Postings are provided as is with no warranties, and confer no rights. Opinions expressed here are my own delusions; my employers at best shake their heads and sigh, at worst repudiate the content with extreme prejudice, whenever it manages to appear on their radar.
This blog is unsuitable for overly sensitive persons with low self-esteem and/or no sense of humour. Proceed at your own risk. Use as directed. Do not spray directly into eyes. Caution: filling may be hot. Do not give to children under 60 years of age. Not labeled for individual sale. Do not read 'natas teews ym' backwards. Objects in mirror are closer than they appear. Chew before swallowing. Do not bend, fold, spindle or mutilate. Do not take orally unless directed by a physician. Remove baby before folding stroller. Not for use on unexplained calf pain.
A nice FLAIR (FLuid Attenuated Inversion Recovery) view from the not-too-distant past. Every abnormality you can see on this scan (and there is more than one!) is asymptomatic at present. Alongside is a picture of me walking the walls at Fremont Studios, a sign of a damaged brain.
I have people ask me with an alarming frequency why LCIDs jump around so that for example en-US is 0x0409 and en-UK is 0x0809, and so forth. Why wouldn't it be 0x409, 0x0509, 00609, et.?
To answer the question, let's look at the diagram of LANGID contents, found in winnt.h:
//// A language ID is a 16 bit value which is the combination of a// primary language ID and a secondary language ID. The bits are// allocated as follows://// +-----------------------+-------------------------+// | Sublanguage ID | Primary Language ID |// +-----------------------+-------------------------+// 15 10 9 0 bit////// Language ID creation/extraction macros://// MAKELANGID - construct language id from a primary language id and// a sublanguage id.// PRIMARYLANGID - extract primary language id from a language id.// SUBLANGID - extract sublanguage id from a language id.//
If you look at the makeup, the primary language is 10 bits, and the sublanguage id is 6. Therefore, the "count by 0x0400" is there because that is what happens when you start in the eleventh bit (210).
Now there is another important consequence of this, as more and more LCIDs are added. Soon we we will run out of eight-bit LCIDs, and suddenly there will be LCID values of 0x0500, 0x0501, 0x0502, and so on. At that point, all of the people who do not use the macros to parse through LCIDs (preferring to chop these 16-bit values into two equal 8-bit pieces) will assume that 0x0501 is a locale based on LANG_ARABIC. Yikes!
Unfortunately, I have seen a bunch of code over the years that does this.
If you look back to the post I did about Lions and tigers and bearsELKs, Oh my! then you'll see that at the rate we are adding them, it will not take too much longer to hit this issue. If you have written or are the owner of such code, you have been warned. Fix your code today!
Now some people actually like to look at these numbers in decimal form (Raymond Chen talked about this last year when he asked What are these directories called 0409 and 1033?). The split seems to be:
HEXIDECIMAL: Windows, Windows CE
DECIMAL: Office, VB/VBA/VS .NET, SQL Server
Since I have actually done work at one time or another for all of these groups, I got pretty good at knowing both of them (and doing fast conversions for unfamilar ones). The hexidecimal LCIDs feel more natural to me, for what it's worth. Though some may think this is due to a bias now that my main home is Windows, for me it is just easier to parse through what the LCIDs are this way....
I'll talk more about LCIDs another time. There are a fascinating number of oddities with them....
This post brought to you by "Λ" (U+039b, GREEK CAPITAL LETTER LAMDA)
Helen Custer, in Inside Windows NT, describes the situation back then in an interesting way:
The lowest layer of localization is the representation of individual characters, the code sets. The United States has traditionally employed the ASCII (American Standard Code for Information Interchange) for representing data. For European and other countries, however, ASCII is not adequate because it lacks the common symbols and punctuation. For example, the British pound sign is omitted, as are the diacritical marks used in french, German, Dutch, and Spanish. The International Standards Organization (ISO) establish a code set called Latin1 (ISO standard 8859-1), which defines codes for all of the European characters omitted by ASCII. Microsoft Windows uses a slight modification of Latin1 called the Windows ANSI code set. Windows ANSI is a single-byte coding scheme because it uses 8 bits to represent each character. The maximum numbr of characters that can be expressed using 8 bits is 256 (28). A script is a set of letters required to write in a particular language. The same script is often used for several languages. (For example the Cyrillic script is used for both the Russian and Ukranian languages.) Windows ANSI and other single-byte coding schemes can encode enough charactrs to express the letters in Western scripts. However, Eastern scripts such as Japanese and Chinese, which employ thousands of separate characters, cannot be encoded usng a single-byte encoding scheme. These scripts are typically stored using a double-byte encoding scheme, which uses 16 bits for each character, or a multibyte encoding scheme, in which some characters are represented by an 8-bit sequence and others are represnted by a 16-bit, 24-bit, or 32-bit sequence. The latter scheme requires complicated parsing algorithms to determine the storage width of a particular character. Furthermore, a proliferation of different code sets means that a particular code might yield entirely different characters on two different computers, depending on the code set each computer uses.
The lowest layer of localization is the representation of individual characters, the code sets. The United States has traditionally employed the ASCII (American Standard Code for Information Interchange) for representing data. For European and other countries, however, ASCII is not adequate because it lacks the common symbols and punctuation. For example, the British pound sign is omitted, as are the diacritical marks used in french, German, Dutch, and Spanish.
The International Standards Organization (ISO) establish a code set called Latin1 (ISO standard 8859-1), which defines codes for all of the European characters omitted by ASCII. Microsoft Windows uses a slight modification of Latin1 called the Windows ANSI code set. Windows ANSI is a single-byte coding scheme because it uses 8 bits to represent each character. The maximum numbr of characters that can be expressed using 8 bits is 256 (28).
A script is a set of letters required to write in a particular language. The same script is often used for several languages. (For example the Cyrillic script is used for both the Russian and Ukranian languages.) Windows ANSI and other single-byte coding schemes can encode enough charactrs to express the letters in Western scripts. However, Eastern scripts such as Japanese and Chinese, which employ thousands of separate characters, cannot be encoded usng a single-byte encoding scheme. These scripts are typically stored using a double-byte encoding scheme, which uses 16 bits for each character, or a multibyte encoding scheme, in which some characters are represented by an 8-bit sequence and others are represnted by a 16-bit, 24-bit, or 32-bit sequence. The latter scheme requires complicated parsing algorithms to determine the storage width of a particular character. Furthermore, a proliferation of different code sets means that a particular code might yield entirely different characters on two different computers, depending on the code set each computer uses.
I thought it was interesting the way some of the technology terms were framed. It definitely does not fit the terminology we use today for several different terms. But what really caught my eye was the implicit idea that each of these code pages was enough for a language, and that the only real problems were the lack of good cross-code page support and the difficulty of parsing some of the more complex cases.
The truth is much further from these points than you might guess. Because there are very few languages for which a code page (especially one of the 'Windows ANSI' code pages) actually has adequate coverage. I'd say that these code pages are perhaps 'good enough' for some languages but do not really contain all of the characters one might want to use to fully express information in most languages. Unicode in this context becomes more than just a luxury -- if you are missing letters you need in your language then it becomes a necessity.
There was a recent thread in the microsoft.public.win32.programmer.international forum entitled "Developing ANSI application for multi-national Windows" where someone was strongly advocating not moving to Unicode because they believed their application (written in C, over 1 million lines, with over 50,000 strings, heavily relying on pragmas giving the code page and locale per source file to get their work done) was better served by keeping it all out of Unicode and relying on code page support. Of course almost immediately there were problems:
My biggest wonderment, which perhaps you can answer or even solve, is why a non-Unicode localized application (for MBCS languages) will only run properly if the *system* default locale is set to the proper language. I run the international versions of XP and 2000, but only Unicode applications run properly unless the system default locale is set; there are no provisions that I have found that let me say, "This application uses Japanese.Japan.932." Dialog boxes, drawn text, and other problems are abundant. These issues are obviated by Unicode, but for a project my size that is an undertaking that will take quite a while and detract from product enhancements that are necessary for the marketplace.
My biggest wonderment, which perhaps you can answer or even solve, is why a non-Unicode localized application (for MBCS languages) will only run properly if the *system* default locale is set to the proper language.
I run the international versions of XP and 2000, but only Unicode applications run properly unless the system default locale is set; there are no provisions that I have found that let me say, "This application uses Japanese.Japan.932." Dialog boxes, drawn text, and other problems are abundant.
These issues are obviated by Unicode, but for a project my size that is an undertaking that will take quite a while and detract from product enhancements that are necessary for the marketplace.
Though people did point to AppLocale as a workaround, the fundamental problems in trying to make a complex application work with such methods will (in my opinion) quickly outweigh the "benefits" of avoiding the move to Unicode. Because in the end, code pages are not really enough....
This post brought to you by "©" (U+00a9, a.k.a. COPYRIGHT SIGN)One of the most common code points people complain they lose in their non-Unicode applications since it is not on all ACPs
More non-technical content....
You know the step you miss at the bottom of the stairs, or the one you try to take that is not there at the top? It takes your breath away. Kind of like today's news did. I'll explain...
I am taking Copaxone every day, for my MS. It's like taking Insulin or something.
I used to take Avonex and to be frank I liked the schedule better (just once a week). But I'd kind of feel like I had the flu for the next 36 hours, which kind of stunk, if you know what I mean.
So I was biding my time with the Copaxone.
Though some time this month I was going to be switching to the new drug that used to be called Antegren but was then renamed to Tysabri.
I was kind of annoyed at the wait (the hospital wanted to set up stuff and the infusions have to be in the doctor's office). But a drug that you only have to take once a month seemed like a dream some true, you know?
But then everything changed.
Yesterday morning, my brother-in-law forwarded me an article through email and asked if I had heard about it -- a headline today on cnn.com. It read MS drug pulled after patient dies. The drug companies (Biogen Idec and Elan Pharmaceuticals) voluntarily pulled Tysabri for an investigation after one patient died and another contracted Progressive Multifocal Leukoencephalopathy (PML), a rare but often fatal disease of the central nervous system. Both patients were taking Avonex and Tysabri together for over two years.
My first thought was how terrible that was. And I will admit my second thought was that maybe that would have been me some day, if they had actually rushed through that process stuff at the UWMC neurology clinic a bit faster.
The stock both companies reportedly dropped on the news (small wonder, huh?).
Now tonight, I am looking at this Copaxone syringe and wondering if I am taking my life into my hands by falling into this trap that the drug companies have set up. It's a pretty profitable scam they have going there, you know?
Drugs that have a statistically significant chance of helping me will by definition have a statistically significant chance of doing the fractional value diddly/squat.
And I don't even want to think about the miniscule chance that the rush to get what seemed like a promising drug "fast tracked" through the FDA could do to someone about to be on the bleeding edge of MS treatment. The Copaxone may be doing nothing at all for me (one has to love a $1000 per month placebo), but no one has died yet from taking it after over ten years. I'll take those odds over 1 in 5000 after just two years any day.
Now I know none of this probably even applies to me -- they have a ton of Avonex data and nothing like this has ever been seen. Same thing for Copaxone. and even if I were on the Tysabri, I would never have gone on a combination therapy with Avonex. Even bothering to think that I had a close call is like thinking I had a close call when I was stranded in Los Angeles on 9/11, waiting for a next day flight to San Jose. In other words, I was never in any kind of danger.
I am simply not feeling quite so experimental, if you know what I mean. I think waiting for the longer studies sounds like a safer plan....