Every character has a story #4: U+feff (alternate title: UTF-8 is the BOM, dude!)

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

Every character has a story #4: U+feff (alternate title: UTF-8 is the BOM, dude!)

  • Comments 41

(The alternate title should be spoken with either a circa-1982 Jeff Spicoli or circa-1989 Theodore "Ted" Logan mannerism and accent)

U+feff has two jobs in the Unicode standard:

Job #1, and its namesake, is as a ZERO WIDTH NO-BREAK SPACE. The name kind of says it all. After all, we have U+00a0 (NO-BREAK SPACE) and U+200b (ZERO WIDTH SPACE). Yet we somehow needed to combine the two to create a character that has no width yet you should not break a line between what is on one side of it and what is on the other. Later they decided to add a different character is the preferred one for this job (U+200d, ZERO WIDTH JOINER), in part due to U+feff's violation of the moonlighting agreement and its apparent lack of focus (see job #2). But at its heart a conformant Unicode application does not have to do anything special with U+feff because this is a character that has no width. If it is between two characters and you completely ignore it, then you will get identical results to it not being there at all.

Update 21 January 2005 -- TLKH pointed out that the actual character that took over job #1 after U+feff was fired (well, depracated) from this job is U+2060 (WORD JOINER) and not U+200d (ZERO WIDTH JOINER).

Job #2 is to act as a BOM, a Byte Order Mark. A signature at the beginning of a file with no "wrapper" to indicate its encoding -- someone could look at the byte stream and know by the pattern of bytes what the encoding might be:

00 00 fe ff      UTF-32, Big Endian

fe ff 00 00      UTF-32, Little Endian

fe ff ## ##      UTF-16, Big Endian

ff fe ## ##      UTF-16, Little Endian

ef bb bf         UTF-8

And that last line is where it starts to get weird for people. Because lots of the folks who support UTF-8 in other standards like XML note that you do not need the BOM when you have other means to document the encoding and which use UTF-8 as their default encoding when none is specified anyway. And lots of other folks who support Unix tools that did not have to be completely changed to support Unicode by using UTF-8 do not like these extra three bytes at the front of the file. Sometimes that is because they really only support ASCII or ISO-8859-1, other times it is because they just can't handle those three bytes right in front but later on would not matter.

Enter Microsoft.

(Yes, I know -- boo, hiss, etc.)

Microsoft has an application called Notepad. Application is perhaps an overstatement; its just an uber-wrapper for a Win32 EDIT control. It stores the text in the edit control, which means if you open a file it must literally load all of the data to stick it into the control (which is by the way why it takes forever to open huge files). Over the years minor features and tweaks have been added. But in its soul it is just a plain old edit control.

When it was ported to NT the option of saving Unicode files had to be there since Unicode was there, so they added it. It was Little Endian UTF-16 (that was all the platfom really supported back then) but they just called it Unicode since it was vaguely more likely that someone might have heard of that. And that is what the rest of the platform was doing.

Then in Windows 2000 they added the ability to save a file as Big Endian UTF-16 and since they were already calling Little Endian UTF-16 Unicode they decided to call the other form Unicode (Big Endian). I do not think this is so bad, certainly less controversial than calling it unicodeFFFE1, but it definitely did irk some people who did not like one format being called Unicode as if others were not.

Incidentally, I those people are probably right. But the number of people who don't really care what it is called since they will never use it does outnumber all of the people who do care, so I kind of understand the logic behind the lack of detail that would confuse...

But then the worst sin of all was committed -- Notepad also added UTF-8 support. And of course the issue with the BOM had to come up.

The folks on the Shell team who did this recognized that if the file only had ASCII characters that it could be called UTF-8 or it could just be using the default system code page. So if a user intentionally saved it as UTF-8 then they would be confused if opening it again would not appear to remember that it had been saved in such a way. So they add a BOM when it is UTF-8, to tag it as UTF-8 in a way that is 100% conformant with Unicode.

This is completely legal and since Notepad is just a simple "Hans & Franz" wrapper around an EDIT control, it has no other means of understanding "envelope" information to tell anyone what the encoding is. What else could they do? The bug is in the people who use Notepad to edit HTML and XML, because they do not require a BOM. People still use it as a convenient editor of files, but the caveats are pretty clear....

People like Raymond Chen have been posted about how Some files come up strange in Notepad but generally people do not have complaints about the way Notepad behaves.

But every 4-6 months another huge thread on the Unicode List gets started about how bad the BOM is for UTF-8 and how it breaks UNIX tools that have been around and able to support UTF-8 without change for decades2 and about how Microsoft is evil for shipping Notepad that causes all of these problems and how neither the W3C nor Unicode would have ever supported a UTF-8 BOM if Microsoft did not have Notepad doing it, and so on, and so on.

We are about 30+ messages into such a thread right now, believe it or not. That did not inspire this post so much as the image of Sean and Keanu talking about it like surfer dudes did, though. :-)

No one ever has answers about the fact that if someone really supports Unicode, they should be able to handle a ZERO WIDTH NO-BREAK SPACE without breaking a sweat. If they can't, their tool or utility or application or whatever they have is broken, and it's their bug, not Microsoft's. At least, if they claim to support UTF-8, that is. Tools that support "all of UTF-8 as long as it starts with ASCII" and tools that cannot handle these three bytes at all are not really supporting UTF-8.

And by the way, that includes Microsoft applications, too. In my opinion Frontpage 2000 kinda stunk (all things considered) because of this problem. Even though they added the cool "don't screw with my HTML setting that I liked so much. I was very happy when Frontpage 2002 and 2003 fixed this problem. Just like I'm sure most others would be happy if people fixed their tools, as well....

I thought I'd briefly quote one of the posts to the Unicode List that was just done by Peter Constable:

As for whether plain text files can have a BOM, that is one of the few unending debates that arise with certain (fortunately not too freguent) regularity, each time with vociferous expressions of deeply-held beliefs but never any resolution. I'll just observe that the formal grammar for XML does not make reference to a BOM, yet the XML spec certainly assumes that a well-formed XML document may begin with a UTF-8 BOM (or a BOM in any Unicode encoding form/scheme). Rather than have a philosophical debate about the definition of "plain text file", I suggest a more pragmatic approach: for better or worse, plain text processes that support UTF-8 are going to encounter UTF-8 data beginning with a BOM: learn to live with it!

I agree 100% with his words and wish I coulsd summarize the issues as cleary and as effectively as he can. :-)

For the record, it has occurred to me in the past that it would not be a bad idea to add an option to save files without the BOM. Of course that would mean having to document it for people who probably struggle with the difference between Unicode and Unicef3. That does make this something of an uphill battle (doc. changes are the hardest and most resource intensive in changes like this), but perhaps worthy of a try. Maybe they could take out some of that "UTF-8 is for legacy" stuff that is in Notepad help now while they are there. What do you think? :-)

 

1 - Believe it or not, unicodeFFFE is actually documented as Internet Explorer's Preferred Charset Label for Unicode (Big-Endian). Periodically people report the name as a bug, since there is no such code point in Unicode as U+fffe. But the reason for the name is that if you look at a BOM of UTF-16 big endian on a system that is little endian, it will look like FFFE. Since that is not a valid character, it is easy to tell on a Little Endian system that the file must be Big Endian Unicode. The name is just acting as a sensible (if somewhat platformily provincial) labelling of what one sees on almost 100% of all Windows platforms.

2 - Never mind that Unicode has not existed for that long, let alone UTF-8!

3 - Someone once asked me at a conference how saving a file is able to contribute to a charity, and was it like one of those fake email chain letter things on her machine? And I did not laugh, though I admit I smiled pretty broadly as I explained to her about how Unicode was not Unicef. And I did laugh a bit afterward.

 

This post is sponsored by "" U+feff (ZERO WIDTH NO-BREAK SPACE, of course)
Though he was a little bitter about the lack of visible representation here, I was unable to find the little guy to spray paint him so that you could all see him here today. He is between those quotes, I can promise you that.

Comment on the blather
Leave a Comment
  • Please add 5 and 1 and type the answer here:
  • Post
Blog - Comment List
  • Strange things are afoot at the U+004B U+20DD ;)
  • The XML specification explicitly permits a UTF-16 BOM at the beginning of the file or stream. Otherwise, it must start with the XML declaration (<?xml version=…>), no whitespace or other characters allowed. At least that’s how I’d interpret sections 4.3.3 and 2.1.
  • Heh, I used to work for Unisys. I always felt bad correcting people when they thought I said "Unicef", cause suddenly I'm not such the good samaritan that they thought I was...
  • Mike Dunn -- something not Kosher? :-)

    Centaur -- the XML spec allows the BOM; it even describes it. So anyone who does not allow it does so at their peril....
  • The Unicode FAQ talks about this issue a bit, also.

    http://www.unicode.org/faq/utf_bom.html#BOM

    With the number of bytes wasted in web/email communication over a character that takes up only 2-4 bytes in storage and no visible space, it is no wonder that people find Unicode to be complicated!
  • Back in visual studio, we had a few people who were really focussed on getting the editors to be really good Unicode citizens. My (possibly revisionist) history is that we actually introduced use of the utf-8 BOM over there around the time of win98 (vs 6). NT caught up when visual studio users were creating "text files" (whatever the heck /that/ means... :-) that other people couldn't open in notepad.

    Re: so much attention:

    My 1st dev mgr at Microsoft always noted that it was the little picayune issues that drew the most heated debates because everyone felt they understood /all/ the issues.

    to quote Kosh: the avelance has started, it is too late for the pebbles to vote.

    UTF-8 has a BOM and people just need to learn to love it. (The tricky question is when to preserve/not preserve a BOM found in a byte stream...) I think you're right; just because something is 8-bit clean doesn't make it a good utf-8 citizen. It has to be very careful not to split an encoding (just like a good UTF-16 citizen has to know not to split high/low surrogates...)

  • Interesting! I had not heard this before... but I guess the timing is right. I never remember trying UTF-8 in VS6, did it really work?
  • I think the confusion reigns because people expect saving a file as UTF-8 to mean "Save it as UTF-8 if it contains non-ASCII characters, and ASCII otherwise", so they expect the BOM to be only present if characters with values greater than 127 are contained within the file.
  • What is supposed to be the caret behaviour when encountering such a character ?

    I pasted the sponsor message into Notepad and I noticed that even though you don't see the BOM, you can definitely 'feel' it when moving the caret : You need to press the arrow key twice between the 2 ".

    Does it mean that it's not completely true to say that apps may safely ignore it, especially at the beginning of a doc: If the app provides edition of the contents, users will have a weird experience and bug reports will flood in !

    Also, how does text rendering work ? The BOM is not in the font I use in Notepad.
  • As far as I remember from the time when I implemented unicode line breaking algorithm for my editor, U+200d allows breaking before/after it.
    The real zero-width-non-breaking-space character (except for BOM) is U+2060, not mentioned in this article.
  • TLKH is right -- U+2060 (WORD JOINER) is the preferred character that took on the job formerly occupied by Job #1 of the ZWNBSP. I will put a correction in on the page).
  • Serge -- hard to say what the caret behavior should be here -- after all it *is* a space, even though it is zero width. The fact that it is deprecated makes it even less likely that implementations will do much more than ignore it....
  • "For the record, it has occurred to me in the past that it would not be a bad idea to add an option to save files without the BOM."

    It would be convienient if UTF-8 could be selected as the "ANSI" codepage in the control panel's advanced regional and language options. Then Notepad and many applications designed for ANSI would automatically support UTF-8 (without BOM). I would prefer this because nowerdays I rarely create text files with legacy ANSI encoding.

    For those few applications that make specific assumptions about the ANSI codepage (hard-coded strings with character codes >= 128 etc.), AppLocale provides a good solution:

    http://www.microsoft.com/globaldev/tools/apploc.mspx

    (A UTF-8 "ANSI" codepage may cause problems if the API implementation depends on assumptions like "ANSI character <= double-byte").
  • Unfortunately, this is not possible -- there are too many bugs in Windows and in apps for components that will not work with UTF-8 here....
  • 在 Michael Kaplan 那看到 Every character has a story #4: U+feff (alternate title: UTF-8 is the BOM, dude!) 解釋為甚麼 Windows 2000 以後的 Notepad 存 UTF-8 的檔案會加上 BOM(Byte Order Mark, U+FEFF), 主要是因為 UTF-8 和 ASCII 是相容的, 為了避免使用者自己忘記用甚麼存, 造成 UTF-8 檔案用 ASCII...
Page 1 of 3 (41 items) 123