If the shoe [best-]fits....

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

If the shoe [best-]fits....

  • Comments 6

When you call the WideCharToMultiByte API with almost all code pages1 the number of possible characters that can be represented on the target code page is always going to be smaller than what Unicode can represent. When this happens, there are one of two possibilities:

  1. If you did not pass the WC_NO_BEST_FIT_CHARS flag and there is a "best fit" mapping, then the best fit mapping will happen.
  2. If you did pass the WC_NO_BEST_FIT_CHARS flag or if there is no "best fit" mapping, then the default character will be placed in the target.

But what is a best fit mapping?

Well, there is really little more than a warning in the Platform SDK:

For strings that require validation, such as file, resource and user names, always use the WC_NO_BEST_FIT_CHARS flag with WideCharToMultiByte. This flag prevents the function from mapping characters to characters that appear similar but have very different semantics. In some cases, the semantic change can be extreme e.g., symbol for ‘∞’ (infinity) maps to 8 (eight) in some code pages.

This hints at the extremes to which these bit fit mappings can take us (unless you are one of those who feel that the infinity sign is just a hungover digit eight that has fallen and cannot get up -- in which case the mapping only goes in one direction).

What do these two behaviors have in common? Well, in both cases information has been lost -- whether you replace with the wrong charcter or a question mark, you are always losing a little bit of data. The best fit mappings are also pretty uneven. Here are some quick approximate counts:

  •  874 -- 138 characters
  • 1250 -- 437 characters
  • 1251 -- 384 characters
  • 1252 -- 442 characters
  • 1253 -- 366 characters
  • 1254 -- 438 characters
  • 1255 --  96 characters
  • 1256 -- 288 characters
  • 1257 --  94 characters
  • 1258 --  94 characters

The above just notes that for single-byte code pages the MBTABLE always has 256 entries and the WCTABLE has more. The DBCS code pages are a bit tougher to do since they are designed differently. But I think the above shows that the actual number of best-fit entries varies from code page to code page.

Some of the entries even make sense (e.g. 1256 does not have Arabic digits in it, so those digits are best-fit mapped to ASCII 0 to 9 -- beats question marks any day!).

Other entries are just funny (like the infinity turns to eight thing -- the next time someone tells you its just an eight-hour work day you will know why it seems to take forever!).

But most fall in between -- arguably better than nothing.

Perhaps that what they should have been called (rather than "best fit" mappings) -- "better than nothing fit" mappings. Seems like the real mappings are the ones that are the best fit. :-)

 

1 - Pretty much all of them other than UTF-7, UTF-8, and GB18030, in fact.

 

This post sponsored by "Å" and "Æ" (U+00c5 and U+00c6, a.k.a. LATIN CAPITAL LETTER A WITH RING ABOVE and LATIN CAPITAL LETTER AE)
Both of which "better than nothing fit" map to U+0041 (LATIN CAPITAL LETTER A) on code page 1250!

Comment on the blather
Leave a Comment
  • Please add 7 and 5 and type the answer here:
  • Post
Blog - Comment List
  • I have talked a bunch of times about the way that different forms of strings that are canonically equivalent...
  • Some people may recall when I talked about how It does not always pay to be compatible. In that post...
  • William Hooper pointed me to an interesting bug report: Help Strings in .NET not always transisitve!!!!!!

  • Recently while paying attention to The Unicode List I was once again reminded why I don't pay more attention

  • I guess that explains this: http://www.comsecglobal.com/FrameWork/Upload/SQL_Smuggling.pdf (in short, non-quote characters can be best-fit into a quote character, post-validation, leading to SQL Injection in some situations).

    Apparently SQL Server calls WideCharToMultiByte with the WC_NO_BEST_FIT_CHARS set, when it shouldnt be.  I was curious as to how this occurs, now I know :-). (If you can convince MSRC that this should be changed, I'd be thrilled)

  • Functions like GetShortPathName have been around for a long time.

    Too long, if you ask me.

    Because

Page 1 of 1 (6 items)