Can a codepage be changed? How about which codepage a locale points to?

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

Can a codepage be changed? How about which codepage a locale points to?

  • Comments 14

Earlier today I explored the question Can I get my characters into Unicode? but Ivan Petrov's question was also asking about what could be done about code page 1251, which also was missing these 20 cyrillic characters.

Unfortunately, there is nothing that can be done with it, for several reasons.

First, as Mike Dimmick tried to point out in a comment to that post (moderated to avoid spoilers, sorry Mike!), code page 1251 has only one free slot, and there is really no way to add 20 characters to it. This of course makes it impossible on its face to update cp1251.

Second, as a matter of policy, Microsoft does not update the so-called ANSI1 code pages. Ever. We can't. We have tried, twice:

  • Some time between Windows NT 4.0 and 2000, (also between Windows 98 and Me), some but not all of the code points required for Farsi were added to cp1256 (there was not room for all of them).
  • In the Windows 2000 timeframe, almost all of the ANSI code pages were updated to include the Euro2.

We are still dealing with the fallout of both of those changes, and have promised many interested parties both inside and outside of Microsoft that we would not make the same mistake again. It affects persistence formats, application compatibility, and platform/cross-plaftform compatibility to do so.

The Microsoft ANSI code pages are weird anyway. They are not an ANSI standard and most of them are modelled after ISO-8859 code pages. The main difference is that the C1 area (in ISO-8859 reserved for control codes that are also seen in Unicode) is used for characters. On the posiive side this makes them more honest-to-goodness useful; on the negative side the data is often mistaken for the analagous ISO-8859 code page and Microsoft gets to be called evil for messing up standards.

The simple fact is that for many languages, 8 bits are really not enough.

Dr. International was talking about it back in August of 2000, going so far as to suggest that in many cases the GetLocaleInfo/LOCALE_IDEFAULTANSICODEPAGE of a locale was more of a "best fit mapping" which may not contain all of the characters a language needs. in poker terms, I can see Ivan's concerns about cp1251 and Bulgarian, and raise him the doctor's examples for cp1256 (inadequate for Baluchi, Berber, Farsi, Kashmiri, Kazakh, Kirghiz, Kurdish, Pashto, Sindhi, Uighur, and Urdu, two of which are supported now, some more of which will come later).

There are four different ways that this has traditionally been solved:

  • The "Microsoft" method, which I mentioned above where we added code points to fill in the unallocated spaces. We have abandoned this approach.
  • The "ISO" method, by which I am referring to the ISO-8859 series which if characters had to be added would issue a new code page as an update. Obviously this causes interoperability problems galore since only some people pick up the updates.
  • The 'IBM" method, by which I am referring to the original DOS "OEM" code pages and also the EBCDIC series. The solution is similiar to the ISO method but much more free about issuing new code pages. As far as I know iBM is not doing this anymore, though I could be mistaken (I do know that Microsoft is not picking up new OEM/EBCDIC code pages).
  • The "Unicode" method, by which I am referring to getting out of using the non-Unicode code pages.

This fourth method is the one Microsoft uses now (after seeing from our own experience and that of others how bad the other three can be).

Anyone is free to limp along as best they can without Unicode (easy for some languages, not so easy for others), or they can move to Unicode and see their language supported as well as the current definitions allow (which is actually quite a long way!).

Then there is an additional question, about changing the value of the code page that is used by a locale, i.e. changing the OEMCP value returned by GetLocaleInfo/LOCALE_IDEFAULTCODEPAGE so that a default system locale setting will have updated behavior. Doing this would cause any file previously saved from the console or in the OEM code page to be corrupted, and there is no possible way that the benefit to any language can outweigh the pain of data corruption. The answer here is also definitely Unicode.

The final related question that Ivan raised is to do with the Bulgarian MIK OEM Codepage, which is one we cannot add to Windows and even if it were there could not switch to have Bulgarian use. The time has come to move to Unicode, especially if you are using a language that needs it. Bulgarian is in the same spot as Urdu and about another 600+ languages for whom 8 bits are insufficient.

1 - Raymond Chen discusses why this code page is misnamed as "ANSI" in his post Why is the default 8-bit codepage called "ANSI"? in May 2004. I will probably add to it one day, as there is more to tell....
2 - If you know which one it is without looking at the Windows code pages, I would be impressed, but since I have no way of knowing whether you looked I will have to stay unimpressed today.

This post sponsored by "?" (U+003f, a.k.a. QUESTION MARK)
The character that appears for almost all code pages when you try to convert from Unicode into them and the character does not exist....

Comment on the blather
Leave a Comment
  • Please add 4 and 4 and type the answer here:
  • Post
Blog - Comment List

  • Every time I get some more detail on how things adapted to the euro symbol, I get more annoyed about how stupid they were to invent a whole new symbol. Nobody with a US keyboard can type it; old printers cant handle it, its just a disaster. What were our EU masters thinking?

  • Do you know if Microsoft ever entertained the idea of creating a UTF-8 codepage, providing Unicode support in the way most Un*x do. I know that there is a codepage code for conversions with MultiByteToWideChar, et al, but is this a real codepage that the system can be set to?
  • Thought? Perhaps... but facts trump thought here every time. It is not possible given both the current architecture (which must work in both user and kernel mode) and also the inherent assumption in several subsystems (like USER) that the ACPs maximum number of bytes per character is 2.

    It is before my time but people have led me to believe it was considered for the "Unicode only" locales until it became clear that it wasn't really possible to change that much legacy....
  • Hi again Michael ;-).

    First of all I want to tell you that I'm very satisfied with your answers!
    I think I clearly understood you about the 'It's UNICODE time!' tendetion, but in that moment in me arised 2 simple questions:

    1) What to do with the tones of OEM-encoded: text files and documents, strings in compiled command prompt (DOS) programs and utilities, etc., ESPECIALLY of those encoded with OEM codepages not supported by Windows (as Bulgarian MIK OEM Codepage for example), to just read them correctly using UNICODE?

    2) How to type characters like 'CYRILLIC CAPITAL LETTER A WITH GRAVE' in applications like Word for example, as they are suported in UNICODE, but not supported in codepages like 1251 ?

    Thank You in advance.

    Regards,
    Ivan.
  • Hi Ivan! Regarding your questions --

    #1 -- If the code page exists then it is certainly outside the bounds of Unicode today -- so someone needs to do work to convert them to Unicode. As it is obviously a one-time operation per data file, a permanent mapping is likely not the best solution here. But a one-time tool that mapped each byte to the appropriate Unicode code point or code points would be best (note that a code page would not work anyway since there is no good way to map one byte to two Unicode characters with MultiByteToWideChar.

    #2 -- A ligature can easily be authored in MSKLC for any such character that is needed (dead keys would not work here).
  • Back in May of 2004, Quan Nguyen sent a message to Dr. International about Vietnamese collation...
  • There is an old expression about a person being like a dog with a bone -- meaning that they really want...
  • Just moments ago, Sergey asked in the Suggestion Box:

    Hello, Michael! Wouldn't it be great to be...
  • Michael Entin asks in the Suggestion Box: Hi Michael. I want to revisit UTF-8 discussion. In several

  • Regular reader Ivan Petrov asked in the Suggestion Box: Hi Michael, I've two questions for you: 1) What

  • You may recall when I pointed out that Microsoft does not change code pages [any more] . This is still

  • In this article, I recommend several Unicode articles/websites for reference. Note, the list is not yet

  • "The 'IBM" method"

    IBM's codepages are much more precise than MS's codepages. For example, when MS added the euro sign to codepage 1252, IBM issued a new codepage instead.

  • Um, that is what I said:

    "The solution is similiar to the ISO method but much more free about issuing new code pages."

Page 1 of 1 (14 items)