Disclaimer: This is mostly my conjecture, so I could be completely wrong about some of this, but it seems plausible to me. I’m aiming for the general concepts here, not to start a discussion about the specific details of the history of code pages.
Taking a snapshot of the current windows code pages (or any other code pages), one can wonder how some of these code pages ended up in their current state. We also wonder about other things such as peculiarities of a function call and other related behavior.
It is important to remember that modern computer systems evolved from earlier systems and “we”, as in the entire computer science community on the planet, have learned a lot since the beginnings of computer science. Rarely do we get the chance to start with “a clean slate” and redesign APIs or systems. Even when we do, we only have our best intentions and previous lessons to learn from, and sometimes those new designs prove to have weaknesses that weren’t originally seen.
In DOS days of PC history, “code pages” were the bytes used to directly print to the console. Apple, Commodore, IBM, and probably many others used bytes to map to a character on the console. (Before that there were the values that showed up on Teletypes or punch cards, but I’m kind of focusing on Windows history). The US and “western” cultures seem to have had a great influence on the development of early PCs, and the ASCII standard was very common. Many future behaviors were based on ASCII or similar work.
ASCII only specified 7 bits of information, but since PCs had 8 bits most manufacturers extended the code pages to provide additional glyphs, such as diacritics or additional scripts (besides latin). This provided the ability to represent many languages, but at a hidden cost of data portability. Since most data was confined to single companies and global exchange of data wasn’t a primary concern this wasn’t a big problem at first.
Additionally, since these bytes were used to render glyphs on the screen it seemed wasteful to ignore the non-printable control sequences from 01-1f, so smiley faces, hearts, spades and the like were added.
As computing evolved users wanted more glyphs and several techniques evolved to solve that problem.
For CJK (Chinese, Japanese & Korean) scripts, 8 bit fonts aren’t enough. CJK code pages are usually still ASCII compatible, but they’ve evolved other techniques for rendering additional characters.
In addition to the evolution of techniques, the repertoire of supported characters has been evolving. Unfortunately the drivers of this process are rarely coordinated across the industry. As a need for a new character becomes apparent, organizations add it to the standards that they influence or control, but this doesn’t guarantee adoption across the industry, particularly if they don’t coordinate with other standards or organizations.
This repertoire evolution can cause the behavior of code pages to evolve as well. For example, the Euro was invented well after the creation of ASCII and many of the many other code pages. Obviously it was needed, so it was added to most code pages, squeezing into unused spaces where possible. For single byte code pages that could mean replacing a previously rarely used code point. Of course if a vendor used that rarely used code point for something special in their application, then this caused behavioral changes.
For other standards the repertoire evolution has meant evolving iterations of the standard. Several organizations add characters to their standards, but it can take a while for those to make it to the font vendor or other level necessary for complete support. Shifting standards can also change existing user data or private use behavior, so supporting new standards isn’t always a trivial undertaking.
Some character sets have been complicated by standards dependencies. For example if a desirable standard assigns a bunch of characters and users want Windows support, then Windows has to find space in Unicode since Windows is Unicode based. In the best case the desired characters are already assigned to Unicode so windows can “just” add font support (not necessarily trivial) and is good to go. Historically however, characters are usually created by some other authority and may take a while to get official Unicode support. In those cases, the characters can remain unsupported, or someone can add PUA characters to support them until Unicode supports them.
If PUA characters are used to temporarily support additional characters, then there are additional problems when they are added to Unicode since existing data will need to be migrated from the PUA to the actual Unicode code point. Migration may also be complicated by the fact that all users may not be able to upgrade at the same time.
Another problem impacting the way code pages behave is how (and when) they’ve been implemented. Occasionally standards have had errors that were corrected in later versions. Other times a platform vendor may have interpreted the behavior in an unexpected way. Sometimes a font vendor for a common font could make an error with a code point. Additionally users may commonly confuse a glyph with a similar glyph and abuse the existing standard.
All of these contribute to variations in the way code page data is handled. Once data is coded in a particular way, correcting the data may be complicated. It can be easy to identify an implementation bug and find the “correct” solution, but making the fix can break existing behavior or data portability.
For historical reasons there are also some oddities in encoded data. Remember that code points were often merely glyphs on the computer screen? And those glyphs depended on the rendering of that machine? Well DOS used the \ character to delimit folders on the file system. CJK users however wanted to be able to type their currency symbol on their machines. Since people don’t use \ very often, it got replaced with the appropriate currency symbol on Asian machines. Internally it was always 0x5C however, and the machine always used that byte value to delimit folders. The end result is a mess where 0x5c doesn’t convert to Unicode very well, where users have different file system delimiter characters, and where fonts end up hacked to render ¥ instead of \ if you have a certain system code page. This is obviously really undesirable, yet it is pretty obvious how this happened and pretty difficult to “fix” at this time.
I find it helpful to remember this stuff when confronted with another code page oddity. One of my goals is to reduce any further complexity in this evolutionary tree of code pages. It is often “clear” what the desired or proper behavior should be when you consider only current standards or when you know the lessons we’ve always learned. With a living system of data it isn’t always possible to get from the current state to the perfect state in a simple manner without causing pain to some users. In that case I try to limit the long term pain, and reduce the problems to as few users as possible.
Similar examples exist for nearly all API sets and programming languages, OS’s and techniques. The global “we” of computer science has learned a lot and continues to learn a lot, but sometimes it’s helpful to remember how an API may have evolved when it doesn’t seem to be doing the most appropriate thing.
For code pages this is a good reason to use Unicode. Windows is natively Unicode and most other systems understand it. It is also reasonably unambiguous, although it does have its own evolutionary quirks. By focusing on a single encoding (Unicode), we can reduce the complexities cause by natural variations introduced as encodings evolve.
Reminder: This is mostly my conjecture and seems reasonable to me, although it might be wrong or lack specifics.