Sorting it all Out Michael Kaplan's random stuff of dubious value Be sure to read the disclaimer here first!
Please read the disclaimer; content of Michael Kaplan's blog not approved by Microsoft!
Everybody hates Microsoft.
Well, not everybody.
But hating Microsoft seems awfully popular....
It seems like to try to be the best at anything you have to make choices that lots of people won't like. And then before you know it, people are hating you.
Everyone hates what Microsoft does with the BOM (Byte Order Mark). That thing I talked about in Every character has a story #4: U+feff (alternate title: UTF-8 is the BOM, dude!).
Lots of people hate it so much that they will complain about it when it is not completely on topic, like in that other post (unicodeFFFE... is Microsoft off its rocker?).
But I feel I must ask one question.
Why are people writing their UNIX Shell scripts in Notepad such that the issue of Notepad saving the BOM in UTF-8 is such an issue?
I mean, people who are writing UNIX shell scripts are not guaranteed to be among the Microsoft haters, but all things be equal they are probably more likely to be than the people who pay their own fees to go to TechEd or PDC.
So why are they writing their UNIX shell scripts in Windows Notepad, exactly?
I'd just like it if someone could explain this one. It just makes no sense to me....
This post brought to you by U+fffe, a permanently reserved code unit in Unicode so that BOM determination can remain easier....
This one is simple... a whole lot of folks writing UNIX shell scripts are interfacing with UNIX servers via MS Windows workstations. Rather than use VI/Brief/EMacs etc. they write the scripts using notepad and upload them to the server. Been there, done that...
It's not just shell scripts, and it's not always notepad.
I recently had to deal with PHP files which had been edited in a text editor which added the so-called UTF-8 BOM to them. PHP is quite transparent, so it happily output the BOM -- and later the script tried to set a header, which was not possible since the output had already started (you can only set the headers before the first output byte, unless you enable output buffering). That particular PHP script needed a specific header value to be output, so it stopped working.
Shell files are just a particularly troublesome instance of the problem (not only does the kernel look for the magic value on the first two *bytes* of the file, but also a stray \r before the final \n on the line is included in the command line -- often causing strange error messages).
The main cause of the problem is that the so-called UTF-8 BOM breaks the very useful property of UTF-8 that, if your text can be represented as pure 7-bit ASCII, its UTF-8 representation will be bytewise identical. A lot of Unix tools depend on that property (which is also true for several other character encodings).
The reality is that the UTF-8 BOM (which is not "so-called" -- it IS, and is described) exists and is the only way to distinguish UTF-8 from ASCII -- so if one does not like it, one should use another editor?
Or another OS that deals with things as they are instead of things as they were? :-)
Unless the text file only has 7-bit characters (in which case it makes no difference), it's very easy to distinguish UTF-8 from ASCII: UTF-8 has bytes with the eighth bit set :-)
Most Unix tools are "eigth-bit transparent": they don't care about which character encoding you are using, they simply pass most bytes unchanged (the exception is the byte values they care about, which are almost always in the ASCII range; for a trivial example, the filesystem and filesystem-related tools only care about '\0', '/', and '.'). This is how they were able to use an encoding (UTF-8) which didn't exist when they were designed, as long as the encoding is ASCII-compatible (UTF-8 was designed to be ASCII-compatible from the beginning).
To these tools, the UTF-8 BOM is just another charset-specific sequence of characters to be passed unchanged. This breaks when the context doesn't accept any extra character (like on the first line of a shell script), or when the presence of a non-whitespace (ASCII whitespace, that is) character makes a difference (the PHP example is one case where any character, even an ASCII whitespace character, would break). The UTF-8 BOM, being invisible "junk", is not noticed by the person editing the file, but is noticed by these programs, causing the breakage.
Unix tools don't add a UTF-8 BOM, and either are charset-agnostic (this is the case with the shell and AFAIK also with PHP), or use the current encoding (LC_CTYPE, nowadays UTF-8 by default on most distributions), or are able to autodetect defaulting to the current encoding (this is the case mostly for text editors, like vim). It's the charset-agnostic ones who break with the UTF-8 BOM (they interpret it as valid data, not some odd sort of embedded metadata).
Unless the text file only has 7-bit characters (in which case it makes no difference)
Actually, it does.
If the user says that they want to save a file as UTF-8, then it makes sense to remember that fact. It is much friendlier than forgetting what they did!
If you want to keep things in ASCII, that works too - just don't save it as UTF-8 and it works just fine. :-)
It takes an overt act to break the shell script -- you have to explicitly choose an encoding that will do so....
Do the Microsoft Interix (Services for Unix for <Vista, something else for Vista+) tools play nice with the BOM?
Good question -- I am not sure (I have only ever had them installed briefly, for the irony of the name and to look at the code pages added thereby).
I couldn't disagree with you more on this one, Michael.
It's true that the 8-BOM is now part of the standard, and it was even necessary for the XML Core WG to backpatch XML 1.0 to accept 8-BOMs after it became clear that they *would* appear in XML files, will we nill we. But it's still a gratuitous incompatibility between full UTF-8 applications and applications that can simply be ASCII-aware as long as they are 8-bit clean.
In the research OS "Plan 9 from Bell Labs", for which UTF-8 was actually designed, there were 170 command-line programs packaged with it at the time of the conversion to UTF-8, from simple utilities to compilers and interpreters. Only 23 of these needed to be made UTF-8 aware in the sense above; the rest could treat all their string inputs and outputs as 8-bit vectors or streams, assuming only ASCII.
Here's a nice quote from Rob Pike and Ken Thompson's paper on the UTF-8 conversion:
The Unicode Standard [as it then was] defines an adequate character set but an unreasonable representation [UCS-2]. It states that all characters are 16 bits wide and are communicated and stored in 16-bit units. It also reserves a pair of characters (hexadecimal FFFE and FEFF) to detect byte order in transmitted text, requiring state in the byte stream. (The Unicode Consortium was thinking of files, not pipes.) To adopt this encoding, we would have had to convert all text going into and out of Plan 9 between ASCII and Unicode, which cannot be done. Within a single program, in command of all its input and output, it is possible to define characters as 16-bit quantities; in the context of a networked system with hundreds of applications on diverse machines [that is, using diverse operating systems] by different manufacturers, it is impossible.
Cesar: actually, Posix filesystems don't care about '.' at all. The specific *names* "." and ".." are reserved, that's all. In fact, although we all think of filenames as strings, use them as strings, refer to them with strings, the truth is that Posix filenames are byte vectors with content restrictions, and Windows filenames are 16-bit-code-unit vectors with different content restrictions.
It shouldn't be /that/ hard to get (UNIX shell of your choice) to be BOM-aware.
And I must admit that I've written my share of UNIX-intended text files in Notepad.
The Unicode Consortium was thinking of files, not pipes.
Actually, with the UTF-8 BOM, they are still clearly thinking about files. As is Notepad -- so they all have something in common!
Though perhaps a note could be added to the help file to explain that Notepad is not "pipe-safe" ? :-)
Maurits: The point is that plain text in Unix (using the term "Unix" generically, of course) is a universal representation: absent a compelling reason to do otherwise, everything is represented as text. So it's not about fixing one particular shell: it's about making *every* program encoding-aware even when it's completely unnecessary.
Disclaimer: When I'm stuck with using Windows, I install Cygwin first thing and then live in it as much as possible. I also added BOM-stripping to the text conversion utility (dos2unix) that changes CR+LF pairs to just LFs.
Well, we take this all another way -- if you are moving to another platform, then there are a bunch of things you have to change, to fit in well with that platform -- from CRLF -> LF conversions to UTF-8 BOM prefix stripping to Unicode normalization for tools that do not understand canonical equivalence, and so on. If you are not willing to do these things, then it is a self-imposed bug in the process of the person working cross-platform without being willing to understand the full requirements of doing so. :-)
With that said, I have coded the change to add a "BOM-less UTF-8" save option three times over the last seven years, and the option for "CR-less new lines" twice, each time forwarding to the owners of Notepad at the time; in every case the code was not integrated into the product as neither change targets a core scenario for NOTEPAD.EXE that was of significant importance to merit the test, UA/UE, localization, and servicing costs thereof....
> everything is represented as text
Yeah, but BOM is precisely intended to disambiguate text.
> it's about making *every* program encoding-aware even when it's completely unnecessary.
The task of making *every* program encoding-aware is not as complicated as you imply. It probably suffices to make a few file-reading libraries encoding-aware, at least in the 99% case of ASCII-only text files.
I mentioned filesystem and filesystem-related tools; ls, for instance, hides files starting with a '.' by default, and several programs use file extensions (separated by a '.') to guess the file type when not told otherwise. This together with the two special directory entries is enough to make '.' also a significant character (together with '/' and '\0').
> It probably suffices to make a few file-reading libraries encoding-aware, at least in the 99% case of ASCII-only text files.
I'd say in 99% of the problematic cases the file-reading library is either the C library's stdio or the POSIX lower-level functions (open(), read(), write(), close(), ...). They are used both for text files and for binary files (which must not be converted). For fopen(), there's a mode flag, but SUSv3 says it "shall have no effect". For open(), there's no mode flag at all (O_BINARY and O_TEXT seem to be a Microsoft extension).
After you add a mode flag to open() (and either change the kernel or make open() no longer be a thin wrapper around the system call), you still have to chose either text or binary mode on each file-opening call of each program which reads or writes a file. Sometimes the program cannot determine by itself, and would need new command line switches (for instance, consider cat(1) being used to concatenate two files. If the second file is a text file it should strip the UTF-8 BOM from it; if the second file is not a text file, it must pass the data unchanged, even if it looks like a BOM). For backwards compability and to avoid accidental data corruption, these switches would all have to default to "binary" mode. Since fopen() defaults to text-mode, every program must be audited to add the binary-mode flag unless it really wants to use text mode (the flag isn't required on SUSv3, since it makes no difference). In the end, it's as much work as making every program fully encoding-aware.
There's also the question of what to do if a file is *both* a text file and a binary file. These do exist; the Sun Java self-extracting installer is an example (a shell script concatenated with an ELF binary). For these, opening as a text file would be wrong, but they cannot easily be identified (and if you open all shell scripts as text and try to convert the text, scripts like that Java installer will stop working).
All this for a problem which wouldn't exist if the so-called UTF-8 BOM didn't exist (why is it called a "byte order mark" if it isn't marking any byte order?). Different unicode normalization (or lack of it) or even extraneous CR characters (which most programs pass unchanged or discard; the kernel script loader is one exception) don't cause so many problems.
Back to the original question, the reason shell scripts are ever edited on notepad is that Windows is too common (meaning even hardcore Unix users end up having to use it sometimes), and the only editors guaranteed to be on a Windows machine are Notepad and Wordpad, and I'm not sure about the later.
This is not about hating Microsoft; is about hating one particular bad technical decision (the UTF-8 BOM, in this case), and the way it spills over unrelated systems.
Sorry, I can't agree here. Notepad is supporting the scenario it needs, and not supporting a scenario it never agreed to -- so it behaves as designed to the betterment of those it was designed for.
The UNIX shell script scenario on Windows? The "text file and a binary file" being created in Notepad? Way out of scope here....