Sorting it all Out Michael Kaplan's random stuff of dubious value Be sure to read the disclaimer here first!
NOTEPAD adds a BOM (Byte Order Mark) when you save a file in the UTF-8 encoding.
You'd think that since Windows Notepad has been doing this for over 319680000 seconds2, and that the combined usage of Windows 20003, Windows XP, Windows Server 2003, Windows XP 64-bit, Vista, Windows Server 2008, Windows 7, and Windows Server 2008 R2 is so high that it may well blow your mind to calculate the number, that people would have gotten over this by now.
As recently as yesterday4, people were making comments again in that Why are UTF-8 encoded Unix shell scripts *ever* written or edited in Notepad? blog, the one where I officially suggested that these people who don't like the Notepad behavior of inserting a BOM in front of UTF-8 files had a simple remedy:
STOP USING WINDOWS NOTEPAD!
Yet for some reason people are still arguing it.
Please give up, it is over. If you were in a contest or duel for this5, then you have lost the contest, been bested in the duel. The game is over6.
A long time ago, someone decided that:
you should not be prompted8 in a way like this:
and so that was the way the feature was coded.
There is probably an alt.i.hate.microsoft newsgroup somewhere on USENET that would be happy to hear your complaint on the matter.
But the world has moved on.
And Notepad (the apparent premiere tool of UNIX shell script authors throughout the world) has let down a segment of customers who could have updated whatever is reading the scripts in less than a day, rather than complaining about this on and off for the last ~37009 plus days.
Your sacrifice is appreciated.
But please, go home now.
P.S. Isn't there some tool on UNIX that does this correctly10?
P.P.S. I will not include a screenshot of my private Notepad; I'm not trying to tease you here that badly....
1 - Well, not on the private Notepad I build from time to time from the Windows source, but that one is not one that is released to the public.2 - Over ten years, give or take3 - Where this first started happening.4 - The day before today.5 - Which none of you were, who are you kidding?6 - Even more over than the Canadians in that game last night.7 - Which ironically, most UNIX shell scripts are.8 - This is a cool feature too, by the way.9 - Over ten years, give or take.10 - By your definition of "correctness", at least - a BOM-less UTF-8 save.
But that's not a remedy at all. The usual problem is that some other donkey (not me) thought that Notepad was a suitable editor for producing an XML document.
However I'm not one of the people complaining. It's quite obvious to me that even if somebody at Microsoft did decide to fix the problem (or however you want to look at it) it would take years for the problem to actually go away. So it's easier to just deal with the situation.
For the XML case, you can blame the parser. From the spec [Emphasis mine]:
4.3.3 Character Encoding in Entities
Each external parsed entity in an XML document may use a different encoding for its characters. All XML processors MUST be able to read entities in both the UTF-8 and UTF-16 encodings. The terms "UTF-8" and "UTF-16" in this specification do not apply to related character encodings, including but not limited to UTF-16BE, UTF-16LE, or CESU-8.
Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin with the Byte Order Mark described by Annex H of [ISO/IEC 10646:2000], section 16.8 of [Unicode] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors MUST be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents.
Notepad++ supports syntax highlighting and multiple encodings. It seems silly to edit unix shell scripts with vanilla notepad, especially since it doesn't represent linebreaks in the same way.
Sweet, a blog post just for me!
I'm not entirely sure what you mean by 'game over,' but it seems to be something along the lines of 'this is what Notepad does and that's final.' I have to agree that that's probably never going to be fixed (though developers for new apps probably have other options today, like using a file attribute, say). The thing we can forever lament (loudly, at every opportunity) is that people don't take your advice and we end up with people complaining about problems like http://support.microsoft.com/kb/301623
And then there's another kind of problem:
Anyone who agrees with Peter Constable's reasoning: "for better or worse, plain text processes that support UTF-8 are going to encounter UTF-8 data beginning with a BOM: learn to live with it!" should also agree that processes that support UTF-8 are going to encounter UTF-8 data that doesn't begin with U+FEFF, and that they should 'learn to live with it.' For what reason wouldn't cl.exe support a switch for encoding?
To be fair, that language about UTF-8 BOMs wasn't added by the XML Core WG until the Third Edition of 2004, six years after XML first became a W3C Recommendation.
Not feeling as belligerent today as when your previous post came out, I guess.
"Why are UTF-8 encoded Unix shell scripts *ever* written or edited in Notepad?"
Ok, I am not happy about Notepad doing that, but I got over it.
I would ask something else though: if UNIX/Linux claims to be Unicode aware at least a little bit (after 19 years of Unicode), why the shell does honor the BOM, using it to detect the UTF-8 even if (for some crazy reason) my locale is set to en_US.iso88591?
Any reason to choke on it?
But... for such an old question, people had developed bash/tcsh scripts that performs "replacement of \r\n to \n" and "trim BOM from the begining of files" in the specified folders. *nix fans should be familiar with usage of such tool(s).
Why should people complain when handy workarounds are readily available? :O
Just happened upon your blog because of some posts about Khmer. (I've always been interested in Unicode and now I'm trying to teach myself a thing or two about how foreign writing systems actually work.) Just thought I'd say, I like what you have to say here and might keep reading. :)
@Seth: "I have to agree that that's probably never going to be fixed"
How would you propose "fixing" it? With the message box in Michael's post?
What about the Subsystem for UNIX-based Applications?
I have been waiting for a long time to suggest that for a reason -- and no one has, yet!
One way to maintain that feature of Notepad without prohibiting the storage of UTF-8 text that doesn't begin with a certain character would be to stick the encoding metadata in a file attribute. Someone else may be able to think of a better way though.
Another possibility I hope to see someday is Windows using a UTF-8 codepage. Michael has said that's not possible, and if that's the case then it's a real shame that we'll never be able to move away from using legacy encodings as the default.
Seth, there is no file attribute that can be supported on every file system that Windows can support as not all of them have such mechanisms.
And a UTF-8 "code page" (which exists now -- 65001) would not change the nature of the problem.
The signature fixes the problem, and since the user who might hit that bug has no workarolund while the user of Unix shell scripts has many, the current resolution, determined over a decade ago, can and will stand.
I guess I should have specified 'using a UTF-8 codepage _by default_', i.e. set the CP_ACP to UTF-8.
Currently Notepad has the option to save files using the CP_ACP, UTF-16 and UTF-8. As the UTF-8 signature is intended to disambiguate between files specified as UTF-8 the ACP, if the ACP and UTF-8 options were the same then that dialogue you showed would never be needed even without a disambiguating mark. It seems to me that that does change the nature of the problem.
"The signature fixes the problem, and since the user who might hit that bug has no workarolund __ while the user of Unix shell scripts has many, the current resolution, determined over a decade ago, can and will stand."
Again, I think the solution you offered (STOP USING WINDOWS NOTEPAD!) is perfectly reasonable. As long as that solution is always preferred over, say, changing standards to require parsers to accept a ZWNBS that doesn't fit into their grammar at the beginning of files, then there's little to complain about here. (Though we can still complain about other programs failing to handle UTF-8 that doesn't begin with ZWNBS.)
Re: file attributes
Well, perhaps it's enough if just NTFS supports it. Or maybe someday the file system support could be extended to support attributes on any file system the way some other systems support storing their arbitrary attributes even on e.g. FAT32.
I don't care too much about this, since really I'd like to just get away from legacy encodings, but something like this is necessary to support legacy encodings. Without it you have to put a special signature at the beginning of all text files to identify which encoding, or you have to guess at encoding, or you just have to put up with potential data corruption when a file created on a system with one code page is edited on a system with a different code page.
1. I would suggest that the dialogue box you showed, or better yet, a dialogue that presents the encoding options directly, _is_ a workaround.