Sorting it all Out Michael Kaplan's random stuff of dubious value Be sure to read the disclaimer here first!
A few days ago, Raymond was talking about the Notepad file encoding problem, again. And the comments were pretty funny, like watching a traffic accident as people started going off the rails in all kinds of directions.
For the record, here is the official, UNDOCUMENTED, Notepad encoding detection story, only mildly changed between Windows 2000 Beta 2 through now (into Longhorn Server thast hasn't shipped yet):
Now note that there are some holes here, like the fact that step 2 does not do quite as good with BOM-less UTF-16 BE (there may even be a bug here, I'm not sure -- if so it's a bug in Notepad beyond any bug in IsTextUnicode).
And frankly if people were happy with the IsTextUnicode behavior in general or with small files in particular then the big hub-hub I mentioned here and here wouldn't have been such a mini-phenomenon (like as if people needed Notepad to comment on whether Bush hid the facts or not!).
But then again I already mentioned I don't like IsTextUnicode, for roughly some the same reasons that the whole Notepad "detection" thing is a pain.
I also don't like step 3 above, either -- the code may be fast but it also is way behind the current algorithm used by MultiByteToWideChar, which has one a pretty good job keeping up with the ever changing Unicode conformance guidelines. I still haven't gotten my head around what it means for a file that meets the 1998 guidelines but not the latest UTF-8 conformance rules. Probably a lot of U+FFFD characters in the future, UTF-8 style (EF BF BD).
But in the end I think it is unfair to pick on Notepad here. IsTextUnicode needs to be updated as I said over two years ago here and then after that is done someone needs to go update Notepad to use the new detection stuff that is added.
In the meantime folks should not be so busy complaining about stuff before they understand it; as the above makes clear there is plenty of material to complain about accurately, later. :-)
This post brought to you by � (U+fffd, a.k.a. REPLACEMENT CHARACTER)
I would really like to see the Unicode names that Notepad uses clarified in an future version. Particularly confusing is "Unicode". At the very least, I suggest renaming this to "Unicode (Little Endian)"; however, "UTF-16 LE" would be better. Same for "Unicode big endian". UTF-16 BE or UTF-32 BE? I know the answer, but not from looking at that dialog. It already lists "UTF-8", so it would make things more consistent as well. These are very minor changes and should not affect any code.
Do you know if step 3 is done against the whole file or the first 256 chars or something like that?
I believe step #3 is done against the entire file, which is not what happens with step #2 since the function itself limits how many bytes it looks at....
CORRECTION -- I looked at the code a little closer -- it reads the first 1024 bytes to try to make the determination.
Although this would be better for people who understand things about Unicode, it is actually worse, and less understandable, for the majority of people. And its not like the first group can't easily see what each one means just by process of elimination (or by looking in help!), which makes change even less likely.
Content of Michael Kaplan's personal blog not approved by Microsoft (see disclaimer )! My blog Behind
I couldnt understand you statement related toEF BF BD.
Post .net 2.0, I have seen that StreamReader converts all invalid characters into EF BF BD sequence. (I couldnt find any official documention about this though)
Are you talking about this?
(Excuse the Shaggy reference!) It wasn't me. Well, this time it wasn't me. I mean, yes, it was me in