Sorting it all Out Michael Kaplan's random stuff of dubious value Be sure to read the disclaimer here first!
This is an issue that has been around for a long time.
Back in February (geez, I really have been blogging almost a year now, haven't I?), I explained the difference between Big Endian and Little Endian Unicode. In January I also talked about the Byte Order Mark.
Neither of them are what this post is about.
This post is about the Preferred Charset Label for web pages that are encoded with Big Endian Unicode (or 'Unicode big endian' as Notepad likes to call it).
It is indeed unicodeFFFE.
"But Michael, that is not a valid Unicode code point!" cry some.
"But Michael, that is not what the big endian BOM looks like in memory if one is looking at the bytes!" cry others.
"But Michael, that is not what the big endian BOM looks like on Big Endian systems!" cry some of those remaining.
"Michael, is Microsoft off its rocker?" exclaim a few of the rest (their language is at time less polite, but one email used this language so I decided to go with it).
And believe it or not, there are actually bugs raised by people on several different product teams over the years, who are unhappy with one or more of the following:
And then some the words from people at Microsoft....
"The byte-order mark for big-endian unicode is FEFF, so this should be UnicodeFEFF. This seems like a valid complaint, but I was wondering if it'd break something else to change it." explains Shawn Steele, the development owner of encodings in Windows and the .NET Framework.
"I think this a mistake in the original MLang data, but we have to keep it for compatibility." explains the developer who used to own MLang, now the MUI Development Lead.
"Yes, it was a misnomer that we inherited from MLang. It’s too late to change that." explains the NLS Development Lead.
"Yes, this was wrong in the initial implementation. But now that apps are coded to it, we cannot change anymore." explains Software Architect Chris Lovett on the SQL Server team.
But the original truth about why it was in MLang in the first place is not quite this insidious. Basically, Windows (and Microsoft) are predominantly Little Endian shops (even when platforms that supported BE ran Windows like Alpha, they used LE on the installs). And when someone on a little endian system reads it in as if it were a WCHAR (thinking it to be a UTF-16 LE code unit), they see 0xFFFE, which is of course not a valid Unicode code unit. Thus it is easy it is easy to see it as a big endian file.
The BOM is always U+FEFF. Always. ALWAYS. But that means that in memory it is (in BYTEs):
This is because big endian sytems take the first (big) byte first, where little endian systems take that seond byte first. Which means that in memory it is (in WORDs):
Try it yourself on any platform you happen to have handy if you don't believe me. :-)
The semantic is clear and unambiguous, just not documented very well, and perhaps some would call it a rather silly way to think of it. The name is just acting as a somewhat sensible (if somewhat platformily provincial) labelling of what one sees on almost 100% of all Windows platforms.
And as people already pointed out, it is a bit late to be talking about changing it....
This post brought to you by U+fffe, a permanately reserved code unit in Unicode so that BOM determination can remain easier....
It's quite simple, Microsoft messed it up and did absolutely nothing to fix it (not even an deeply hidden option in notepad).
So, most users use Windows, so... if we break text files this way (not help fix the problem in any way) all Unix users will get problems... which is good for us. Sounds convincing, doesn't it?
Hmmmm. Not sure what on earth that has to do with Big Endian UTF-16, which UNIX chokes on just as often as little endian UTF-16.
If UNIX wants to claim to have UTF-8 support except for that one character (the BOM) then I have a simple solution: STOP WRITING YOUR UNIX SHELL SCRIPTS IN WINDOWS NOTEPAD AND SAVING THEM AS UTF-8!!!!
Everybody hates Microsoft. Well, not everybody. But hating Microsoft seems awfully popular.... It seems
"Note that all of the above post (about trying to figure out whether you're looking at UTF-16 in big endian or little endian form) would be irrelevant if Microsoft had, like everyone else, simply accepted UTF8 as an on the wire/ on disk storage format. Once again Microsoft's desire to be anything-but-Unix costs their customers a lot of time and money for no gain."
MS was an early Unicode adopter and UTF-8 was created in 1992, by then it was too late, the first NT betas has already come out and NT was released in 1993.
Sometimes it depends on your point of view. Gaurav's question was: Hi, We have a question related to