Holy cow, I wrote a book!
About every ten months,
somebody new discovers
the Notepad file encoding problem.
Let's see what else there is to say about it.
First of all, can we change Notepad's detection algorithm?
The problem is that there are a lot of different text files out there.
Let's look just at the ones that Notepad supports.
If a BOM is found, then life is easy, since the BOM tells you
what encoding the file uses.
The problem is when there is no BOM.
Now you have to guess, and when you guess, you can guess wrong.
For example, consider this file:
Depending on which encoding you assume, you get very different results.
Okay, so this file can be interpreted in four different ways.
Are you going to use the "try to guess" algorithm from
(Michael Kaplan has some thoughts on this subject.)
If so, then you are right where Notepad is today.
Notice that all four interpretations are linguistically plausible.
Some people might say that the rule should be "All files without
a BOM are 8-bit ANSI."
In that case, you're going to misinterpret all the files
that use UTF-8 or UTF-16 and don't have a BOM.
Note that the Unicode standard even advises against
using a BOM for UTF-8,
so you're already throwing out everybody who follows the
Okay, given that the Unicode folks recommend against using a BOM for
UTF-8, maybe your rule is "All files without a BOM are UTF-8."
Well, that messes up all 8-bit ANSI files that use characters
Maybe you're willing to accept that ambiguity, and use the
rule, "If the file looks like valid UTF-8, then use UTF-8;
otherwise use 8-bit ANSI, but under no circumstances should you
treat the file as UTF-16LE or UTF-16BE."
In other words, "never auto-detect UTF-16".
First, you still have ambiguous cases, like the file above,
which could be either 8-bit ANSI or UTF-8.
And second, you are going to be flat-out wrong when
you run into a Unicode file that
lacks a BOM, since you're going to misinterpret it as either
UTF-8 or (more likely) 8-bit ANSI.
You might decide that programs that generate UTF-16 files without
a BOM are broken, but that doesn't mean that they don't exist.
cmd /u /c dir >results.txt
This generates a UTF-16LE file without a BOM.
If you poke around your Windows directory, you'll probably
find other Unicode files without a BOM.
(For example, I found COM+.log.)
These files still "worked" under the old IsTextUnicode
algorithm, but now they are unreadable.
Maybe you consider that an acceptable loss.
The point is that no matter how you decide to resolve the ambiguity,
somebody will win and somebody else will lose.
And then people can start experimenting with the "losers" to find
one that makes your algorithm look stupid for choosing "incorrectly".