Sorting it all Out Michael Kaplan's random stuff of dubious value Be sure to read the disclaimer here first!
Please read the disclaimer; content of Michael Kaplan's blog not approved by Microsoft!
Everybody hates Microsoft.
Well, not everybody.
But hating Microsoft seems awfully popular....
It seems like to try to be the best at anything you have to make choices that lots of people won't like. And then before you know it, people are hating you.
Everyone hates what Microsoft does with the BOM (Byte Order Mark). That thing I talked about in Every character has a story #4: U+feff (alternate title: UTF-8 is the BOM, dude!).
Lots of people hate it so much that they will complain about it when it is not completely on topic, like in that other post (unicodeFFFE... is Microsoft off its rocker?).
But I feel I must ask one question.
Why are people writing their UNIX Shell scripts in Notepad such that the issue of Notepad saving the BOM in UTF-8 is such an issue?
I mean, people who are writing UNIX shell scripts are not guaranteed to be among the Microsoft haters, but all things be equal they are probably more likely to be than the people who pay their own fees to go to TechEd or PDC.
So why are they writing their UNIX shell scripts in Windows Notepad, exactly?
I'd just like it if someone could explain this one. It just makes no sense to me....
This post brought to you by U+fffe, a permanently reserved code unit in Unicode so that BOM determination can remain easier....
I dont under stand where the problem comes from.
If a program is unicode aware then the only time it takes any notice of the BOM is when it is trying to workout how the string was stored on disk. If the program is not unicode aware why did you save the file as UTF8?
Any program that doesnt care about encoding will not even notice the BOM anyway since is is simply passing the constents of a file around.
If you wanted a plain text file why did you ask for a UTF8 encoded one? To me it sound more like a PICNIC (Problem In Chair Not In Computer) type problem than anything else.
Like i sad at the start, i may have missed somthing, but it wouldnt supprise me if its just people bitching at microsft for their own mistakes.
Saying that somthing is UTF8 compatible so long as the first few bytes arn't UTF8 is basicly lying. What should be said is "it works but its a bit of a bodge, and breaks unless the data has been specifly crafted in such a way that the bodge doesnt fail. So remeber to keep the bodge in mind.... or else!"
Actually, Unix folks do complain about this all the time; it isn't Microsoft people.
But I think you are right, and it is still requiring an overt act (on the part of the person changing the encoding in the Save As dialog), and it is therefore pilot error....
I think in summery my point could be expressed as, it is not utf8 compatible if it cant hande all legal utf8 strings.
If it cant then fix the app rather than blaming the perfectly legal data :). If that means that you need to maintain somthing that has been hobbling along for decades, then so be it.
Also, when i said "Bitching at microsoft" i probably should have said "Bitching to microsoft" or "Bitching about microsoft" i see where the confusion came from and was trying to say that it seems people are all to ready to blame microsoft for their own mistakes (it appars this is especially the case in the Open source and Apple comunitys), my fault and for that i appologise :).
Agree++ and no worries -- it is in fact the reason I wrote this [very direct ]post. :-)
>>Actually, Unix folks do complain about this all the time; it isn't Microsoft people.
I am not a Unix guy, but it still bothers me that a there is no plain text editor in Windows. Notepad comes close, but then it is trying to play smart by adding a BOM to something that did not have one.
I hate applications that think they know better than me and I would prefer a "pipe safe" behavior.
If you wanted plain text you should not have asked for UTF8. UTF8 is not plain text, simple as that.
What you eneded up with is perfectly valid, explictly marked, UTF8. Exactly what you asked for. Unfortunaly or probably fortunatly* computers cant read minds at this stage of their development.
*Presumably if they could read minds they would probably be capable of independant thought. I have a few ideas about what a computer might think about people who ask for one thing but want another and blame the computer for what they end up with. I sugest you try asking for somthing in a fast food restaurant and then complaining that deep down you really wanted somthing else even though you didnt say so. I also sugest the results of such an experiment should be posted online somewhere for everyone to see (with photos of the contents of your replacement order of corse). :)
One of the most compelling features of UTF-8 is backwards compatibility to old programms. The additional character at the start throws that away. Suddenly, UTF-8 is no better than UTF-16 or UTF-32 or something. So why have the option at all? To save some space? I don't think notepad is designed for really big files for which the difference matters.
That said, at the time I still used Windows enough to get into the situation of having to edit an UTF-8 file, I had SciTE installed and did not use notepad. But I probably will have to add BOM-stripping to a game I work on because a lot of the game's content authors use notepad...
<<If you wanted plain text you should not have asked for UTF8. UTF8 is not plain text, simple as that.>>
Unicode (and utf-8) is all about plain text (http://unicode.org/glossary/#plain_text)
Or, to quote someone else <<All that stuff about "plain text = ascii = characters are 8 bits" is not only wrong, it's hopelessly wrong, and if you're still programming that way, you're not much better than a medical doctor who doesn't believe in germs.>>
(http://www.joelonsoftware.com/articles/Unicode.html)
I agree with Mihai. UTF-8 is plaintext. Microsoft's use of BOM in Notepad is just a very clever hack to deal with the problem that without external metadata, it can be difficult or impossible to identify which plaintext encoding you mean. In the case of Notepad, they simply put the encoding metadata in the file itself it it is any flavor of Unicode.
I agree the BOM in Notepad is ok sometimes.
If I start from scratch, or from UTF-16LE/BE, or from an ANSI file. And I say "Save as UTF-8" Fine, do whatever.
But if I open a file, no BOM, Notepad thinks is UTF-8, change one character and save, I don't want Notepad adding the BOM.
The BOM was not there, Notepad was able to detect utf-8, so why mess with it?
This is an old post, but I still want to comment:
"the UTF-8 BOM [...] is the only way to distinguish UTF-8 from ASCII"
Distinguishing between the two is pointless. If a program understands UTF-8 then it doesn't need any special ASCII mode and therefore doesn't need to distinguish. If a program does not understand UTF-8 then it's not able to understand the BOM and therefore isn't able to distinguish. The program would have to be taught at least enough about UTF-8 to read the BOM. However, most of such programs are legacy and won't be updated to take advantage of a distinguishing mark.
Um, as a Notepad feature, it is NOT pointless. Perhaps people could less UNIX shell script authoring in Notepad? :-)
Even just as a Notepad feature UTF-8 BOM seems dubious at best. The argument seems to be that the UTF-8 BOM is helpful in preventing users from being confused when opening files previously saved as UTF-8, but which contain only ASCII characters, and they see the file marked as being a legacy encoding. This reasoning seems specious since in retrospect it doesn't seem to have held on any of the other platforms. It'd be interesting to know if this reasoning originated in some programmers gut or if Microsoft actually had significant real-world data at the time they were implementing UTF-8 in Notepad.
Then there are the downsides. The feature inevitably escapes the domain where acts as a Notepad feature. This feature manufactures tricky questions like when to preserve/not preserve a BOM found in a byte stream. Standards have to be redefined, as John Cowan mentions of XML. Far more time has been spent arguing about UTF-8 BOMs than ever would have been spent by users confused over why they have to re-select the encoding for their files. Notepad fails to meet the needs of what is apparently, according to your comments, one of its major customer demographics. ; )
Actually, it is used by Visual Studio, the C/C++ compiler, and many other apps that want to make it easier to detect without having to look ahead in the file.
There are even apps like FrontPage which can handle either.
Almost all apps outside of the niche Unix scenario work just fine, so they should STOP USING NOTEPAD!!!
I think the same reasoning I applied to Notepad also applies to other applications. And even if it didn't hold, and UTF-8 BOM was actually useful in some circumstances, that doesn't answer any of the other criticisms.
"Almost all apps [...] work just fine,"
Arguing that UTF-8 BOM breaks almost no apps doesn't inspire confidence that this was the right decision back when it was made.