Sorting it all Out Michael Kaplan's random stuff of dubious value Be sure to read the disclaimer here first!
Occam's Razor is a principle easily stated in Latin (entia non sunt multiplicanda praeter necessitatem) or English (entities should not be multiplied beyond necessity). Applying it to UTF-8 is an obvious matter -- it is the shortest form of the encoding is the correct one. :-)
Anyway, Tom Gewecke has been wanting me to talk about a particular issue to do with the problem of non-shortest forms for a few months now....
First he sent me the following via the Contacting Michael link:
Around the same time, a rather extensive thread was going on over on the Unicode List about the same issue... the fact that IE 6.0 and Outlook HTML mail were accepting non-shortest form UTF-8 and interpretting it.
A little while later he posted a comment in a thread here:
Some Windows programs (IE and Outlook that I know of) will also interpret invalid UTF-8 sequences as if they were real characters. A test of this is at http://homepage.mac.com/thgewecke/badutf8.html
Of course there is one of those rules that if you wait long enough (I just had not gotten to posting about the issue yet? I've been busy!), the issues will resolve themselves. He asked me just the other day via the Contacting Michael link:
First I want to apologize to Tom, I had not meant to wait that long. I have been keeping busy lately though, at some point I'll probably be able to explain what has been keeping me busy.
Second, it does indeed look like folks on the IE team at the very least read the Unicode List, and were able to turn the feedback into the effort to fix the issue. I do not know the exact build that the fix is in, but I too have been told that it has been addressed. :-)
And last, I did want to point out one thing, about the issue in general. Especially as it relates to places that the final fix does not reach....
While I believe that this was a very sensible issue to address, especially since a UTF-8 corrigendum went out some time ago that expressly stated that non-shortest forms of UTF-8 should not be accepted. The older text stated that while the non-shortest form should never be produced that it could be accepted by a process -- and it was with that older rule that a lot of the support of UTF-8 was done in MS products.
The truth is that when Unicode makes changes to the standard such as this, that it takes time before the change gets propogated (and the change does not always get propogated to every version of every product produced by every company)....
It is one of the reasons that Microsoft has started taking a more active interest in adopting the most recent versions of Unicode rather than 'hanging back', an issue that I will talk more about soon. To make sure that things can be implemented sooner, and of course because if more people are reviewing things that there is a greater likelihood of finding and avoiding problems.
In the end, everybody wins (well everyone other than people with non-shortest form UTF-8 web pages?). :-)
This post brought to you by ༺ (U+0f3a, a.k.a. TIBETAN MARK GUG RTAGS GYON)