Sunday, February 06, 2005 11:37 AM
Michael S. Kaplan
Can a codepage be changed? How about which codepage a locale points to?
Earlier today I explored the question Can I get my characters into Unicode? but Ivan Petrov's question was also asking about what could be done about code page 1251, which also was missing these 20 cyrillic characters.
Unfortunately, there is nothing that can be done with it, for several reasons.
First, as Mike Dimmick tried to point out in a comment to that post (moderated to avoid spoilers, sorry Mike!), code page 1251 has only one free slot, and there is really no way to add 20 characters to it. This of course makes it impossible on its face to update cp1251.
Second, as a matter of policy, Microsoft does not update the so-called ANSI1 code pages. Ever. We can't. We have tried, twice:
- Some time between Windows NT 4.0 and 2000, (also between Windows 98 and Me), some but not all of the code points required for Farsi were added to cp1256 (there was not room for all of them).
- In the Windows 2000 timeframe, almost all of the ANSI code pages were updated to include the Euro2.
We are still dealing with the fallout of both of those changes, and have promised many interested parties both inside and outside of Microsoft that we would not make the same mistake again. It affects persistence formats, application compatibility, and platform/cross-plaftform compatibility to do so.
The Microsoft ANSI code pages are weird anyway. They are not an ANSI standard and most of them are modelled after ISO-8859 code pages. The main difference is that the C1 area (in ISO-8859 reserved for control codes that are also seen in Unicode) is used for characters. On the posiive side this makes them more honest-to-goodness useful; on the negative side the data is often mistaken for the analagous ISO-8859 code page and Microsoft gets to be called evil for messing up standards.
The simple fact is that for many languages, 8 bits are really not enough.
Dr. International was talking about it back in August of 2000, going so far as to suggest that in many cases the GetLocaleInfo/LOCALE_IDEFAULTANSICODEPAGE of a locale was more of a "best fit mapping" which may not contain all of the characters a language needs. in poker terms, I can see Ivan's concerns about cp1251 and Bulgarian, and raise him the doctor's examples for cp1256 (inadequate for Baluchi, Berber, Farsi, Kashmiri, Kazakh, Kirghiz, Kurdish, Pashto, Sindhi, Uighur, and Urdu, two of which are supported now, some more of which will come later).
There are four different ways that this has traditionally been solved:
- The "Microsoft" method, which I mentioned above where we added code points to fill in the unallocated spaces. We have abandoned this approach.
- The "ISO" method, by which I am referring to the ISO-8859 series which if characters had to be added would issue a new code page as an update. Obviously this causes interoperability problems galore since only some people pick up the updates.
- The 'IBM" method, by which I am referring to the original DOS "OEM" code pages and also the EBCDIC series. The solution is similiar to the ISO method but much more free about issuing new code pages. As far as I know iBM is not doing this anymore, though I could be mistaken (I do know that Microsoft is not picking up new OEM/EBCDIC code pages).
- The "Unicode" method, by which I am referring to getting out of using the non-Unicode code pages.
This fourth method is the one Microsoft uses now (after seeing from our own experience and that of others how bad the other three can be).
Anyone is free to limp along as best they can without Unicode (easy for some languages, not so easy for others), or they can move to Unicode and see their language supported as well as the current definitions allow (which is actually quite a long way!).
Then there is an additional question, about changing the value of the code page that is used by a locale, i.e. changing the OEMCP value returned by GetLocaleInfo/LOCALE_IDEFAULTCODEPAGE so that a default system locale setting will have updated behavior. Doing this would cause any file previously saved from the console or in the OEM code page to be corrupted, and there is no possible way that the benefit to any language can outweigh the pain of data corruption. The answer here is also definitely Unicode.
The final related question that Ivan raised is to do with the Bulgarian MIK OEM Codepage, which is one we cannot add to Windows and even if it were there could not switch to have Bulgarian use. The time has come to move to Unicode, especially if you are using a language that needs it. Bulgarian is in the same spot as Urdu and about another 600+ languages for whom 8 bits are insufficient.
1 - Raymond Chen discusses why this code page is misnamed as "ANSI" in his post Why is the default 8-bit codepage called "ANSI"? in May 2004. I will probably add to it one day, as there is more to tell....
2 - If you know which one it is without looking at the Windows code pages, I would be impressed, but since I have no way of knowing whether you looked I will have to stay unimpressed today.
This post sponsored by "?" (U+003f, a.k.a. QUESTION MARK)
The character that appears for almost all code pages when you try to convert from Unicode into them and the character does not exist....