People wonder if we're going to update our best fit code page mappings, or even our code page mappings.  The answer is no.  Changing character mappings causes difficulties for applications and our experience has been that doing so breaks as much as it "fixes".  We'd prefer applications move to Unicode, then you don't have to worry about best-fit, or if a character is supported.

Best fit behavior is the behavior of some code pages to map unknown unicode characters to a character that someone thought was similar that the code page supported.  Examples would be mapping k(U+FF4B, full width k) to k, or ĩ (U+0-129 latin letter small i with tilde) to i,or ∞ (U+221e, infinity symbol) to 8.  Some of these seem reasonable, however we aren’t consistent in our mappings, most break the meaning, and some mappings (∞->8) changes the meaning completely.

The best fit mappings were created “a long time ago”, contained “omissions”, and haven’t been updated to include new Unicode characters.  “Newer” code pages don’t necessarily include the same best fit mappings, and, by now, the mappings are fairly inconsistent and incomplete.  So we don’t recommend that the mappings be used, and we don’t intend to change or “fix” the best fit behaviors.

We also don’t like to change other code page data either.  “unassigned” code points can have arbitrary behavior or map to Unicode PUA code points.  Some applications use those code points (perhaps unwisely) as formatting codes or to cause special behavior.  Adding a mapping could break such an application.  Other applications or systems may provide a glyph for an unassigned code point that round trips, however that might not be the designed intent, and changing the code point behavior could break those applications or fonts.

Code page standards are also sometimes extended, modified, or corrected.  Changing the behavior however impacts all applications using that behavior and our experience is that such changes across the installed windows code base causes as much trouble as it solves.

So we like to keep the code page mappings stable.  My recommendations for code page use are:

  • Use Unicode unless explicitly required for some standard or protocol (and try to upgrade the standards or protocols to allow Unicode).
  • If you can’t use Unicode, explicitly specify the mapping that is used.  (Some applications or standards presume whatever the OS uses, ie: windows ANSI code page, which causes serious interoperability problems.)
  • Avoid best fit mappings.  At best they cause spelling errors or offend customers.  At worst they can cause security problems.
  • Avoid unassigned code points, their behavior is undefined and could cause difficulty if a different machines or software have a different interpretation.
  • Use care when using the Unicode private use area (PUA).  Its use is private.  If data is persisted in the PUA, then there is a risk that future versions or other machines may not read the data correctly.  Eventually migration of data between different PUA mappings may become necessary, and migrating such data is rarely trivial.  The Hong Kong HKSCS mappings are an example of such a difficulty.
  • Don’t rely on illegal or undefined code page behavior.  Illegal sequences might change between versions or software.  Shift modes that aren’t implemented could be implemented on other machines, etc.
  • Don't presume that illegal or undefined code page behavior will remain stable.
  • Don’t pretend binary data is text in some code page (or Unicode).  Variations in code page mappings could then prevent the data from round tripping, particularly if the binary data ends up in undefined or illegal code point behavior.

Hope that helps, more posts about common code page concerns are at http://blogs.msdn.com/shawnste/pages/code-pages-unicode-encodings.aspx

Shawn