Sunday, May 22, 2005 5:38 PM
Michael S. Kaplan
When will this line end? And how?
I have talked about Chris Walker before.
He is one of guys behind Notepad.exe for several versions, watching this uber-layer around a Win32 EDIT control be morphed into what some consider to be the most-used plain text editor on the planet.
Often when people complain about behavior of international text in Word or Wordpad, I ask them to try it in Notepad -- I can easily determine if the problem is an issue in Word, RichEdit, or Uniscribe in this way).
Anyway, after the first time I had posted about Notepad, Chris had suggested a bunch of interesting topics, and this post is about one of those topics.
How can you tell how a line ends?
Easy on Windows -- just put in a CARRIAGE RETURN followed by a LINEFEED (U+000d U+000a).
Easy in a completely incompatible way on UNIX platforms -- just a LINEFEED (U+000a) and nothing else (the C standard kind of does this, too, thus the rules about files opened in TEXT mode in the C Runtime).
And also easy in a compeletly different, completely incomaptible way for some Apple system, which use the CARRIAGE RETURN (U+000d) alone (although the fact that the newer versions have a UNIX base make me wonder hether all of this is harder on an Apple now given the CR backcompat and the LF platform issue!).
As Raymond Chen discussed last year in Why is the line terminator CR+LF?, there are a lot of people who wished that Notepad dealt with files that had only an LF, since lots of text files (such as the ones in the Unicode Character Database) have a .TXT filetype but Notepad cannot open them directly without assuing the whole file is on one line.
But course it is not Notepad that is responsible for this functionality as much as the system EDIT control, which has its own rules about lines used by messages like EM_GETLINE and EM_GETLINECOUNT. Rules that would need to undergo some pretty big changes if the fundamental plain text definition of a line delimiter on Windows platforms ever changed. It would probably have to be a new set of messages, or a mode for the control. Or people could just use WordPad and the RichEdit control, that does the right thing with different line delimiters already. With some very interesting (where interesting is defined as potentially scary!) performance concerns....
Fixing an occurrence of this problem was actually one of the changes I was able to make in the Micrsoft Access Import Text Wizard, which had the same problem for many versions. Then Jet 4.0 came out, with the ability to not only handle the multiple line terminators (which exised before) but also different encodings (which was definitely a new feature). The problem for these prior versions was that the wizard was using VBA's file I/O functions to load its sample text, and VBA is limited to the default system code page and CRLF (so the wizard would either show junk, or throw an error for a single line being too big -- a problem described in the KB in article 149946). It was a pleasure to fix both problems at the same time by getting away from VBA's inflexible file i/o system here. :-)