So I’m trying to decide whether my recent encounter with rogue punctuation is an extension of Julian’s arguments about case insensitivity, or a counter argument.
I’m sure you all read Julian’s blog, but allow me to briefly summarise his argument. Basically he says that when capitalisation rules are ignored, the meaning of words is unchanged. So whether I write JULIAN or Julian, it is obvious that I am referring to the same person. Obvious to humans, that is. And hence it is desirable for computers to be using similar rules when dealing with information obtained from humans.
There is no longer any excuse for making humans learn and handle the quirks of the way computers store upper- and lower-case characters. Instead, software should handle the quirks of human language.
It’s hard to disagree with this. But it does pose the question: are there other types of transformations besides case changing that should be considered semantically neutral? And if so, should we expect software to deal with them?
Consider the following sentence:
You are 2-3 times more likely to use correct punctuation than I.
Did you spot the deliberate punctuation mistake?
No? It’s not an easy one. I used a hyphen (-) when I should have used an en dash (–). Don’t know or care about the difference? You’re not alone, I think it’s fair to say that in everyday writing, hyphens and en dashes are more or less interchangeable.
In this case it seems a fairly easy prospect for a computer to make the relevant transformation in order to determine equivalence between two passages of text. So it could, for example, realise that “Accounts 2004-2005.doc” is really the same file as “Accounts 2004–2005.doc”.
Software developers may find it more relevant to ask whether underscore (_) is really a upper case hyphen? After all, on my keyboard anyway, the former is the shifted version of the latter. So does this deserve to get the case-insensitivity rule applied? Is
ip-address really a typo for
ip_address? I would say most people would regard these as equivalent, at least to the degree of similarity between
Apostrophe’s For Sale
Sometimes punctuation can be added without loss in meaning. The infamous grocer’s apostrophe is a good example of this: “Apple’s for sale” means the same thing as “Apples for sale”. And there’s also my personal punctuational bugbear, the possessive “it’s”. Just as with capitalisation, there is of course no argument about the grammatical incorrectness of these examples. But humans are of course fallible, and the computers should be accommodating our flaws, particularly when the intended meaning is still easily discernible.
However, unlike the relatively simple case transformation, it’s not obvious to me whether or not a computer could perform these transformations correctly in every instance.
The challenge here is in detecting false positives, or in other words, detecting transformations which are not semantically neutral. Just ask the proverbial panda who eats, shoots and leaves. Here, adding the comma changes the meaning of the sentence and hence should not be considered the same sentence by a computer.
The Slippery Slope
So if we allow computers to deal with case-abuse by making them case insensitive, should we not also empower them to deal with punctuation abuse also? If not, why the discrepancy?
The pragmatic answer is probably that punctuation insensitivity is highly complex, or at least non-obvious. Perhaps punctuation problems are relatively uncommon, and so the benefit of making punctuation-insensitive software is not worth the cost of this complexity.
So while it seems quite arbitrary to say that unintentional capitalisation problems will be handled transparently, but not accidental punctuation problems, it should be accepted that there is almost certainly no one answer to What The User Expects the computer to do in all cases of the latter kind. Hence providing case insensitivity but not punctuation insensitivity is probably a reasonable compromise for now.
The Case Against
On the other hand, I can imagine a purist approach which says that unless it can be guaranteed that the computer will handle in an identical manner all possible inputs that are sematically equivalent to each other, then all bets are off. An all-or-nothing approach in other words.
I have some sympathy for this position. Implementing case insensitivity is really a special case (erk) which does not apply to the majority of the world’s written languages (or even to a majority of the worlds writers). I shudder to think what other transformations would be reasonably demanded to support languages other than the Indo-European ones.
In short, it may be easier to just adopt a “no transformation” policy when it comes to determining if one sequence of characters should be regarded as equivalent to another. (Excepting the behind-the-scenes transforms that are allowed by Unicode such as character composition). The alternative is devising a set of transforms that are universally accepted as being valid for all cultures and languages.
This is left as an exercise for the reader. Extra marks for working code.