Sorting it all Out Michael Kaplan's random stuff of dubious value Be sure to read the disclaimer here first!
Like I mentioned yesterday, I have talked a bunch of times about the way that different forms of strings that are canonically equivalent according to Unicode and which actually look identical visually exist in the world.
Yesterday, I mentioned it while I was talking about a few of the gotchas of WideCharToMultiByte. Today I thought I would talk about the other direction, the MultiByteToWideChar API.
First of all, almost all code pages are in Normalizaton Form C (a.k.a. precomposed) at all times (I will talk about the exceptions in a second). Of course Unicode (by which I mean UTF-16 Little Endian, which Microsoft always calls Unicode) can be either Form C (a.k.a. precomposed) or Form D (a.k.a. composite).
If you would like to choose, then you get that option; you can pass either the MB_PRECOMPOSED or MB_COMPOSITE flags. For the reasons of having data that is consistent with the rest of the platform, I would recommend the MB_PRECOMPOSED flag, but either one is legal (just not both).
There is also an MB_USEGLYPHCHARS flag. Now I already beat that particular horse to death when I answered the question what the &%#$ does MB_USEGLYPHCHARS do? So if you want to know more you can look there. You probably do not, at least I hope you do not....
Finally, there is the MB_ERR_INVALID_CHARS flag. The documentation says it all on this flag:
If the function encounters an invalid input character, it fails and GetLastError returns ERROR_NO_UNICODE_TRANSLATION.
Now after the MultiByteToWideChar topic covers these four flags, it gets confusing. It says:
For the code pages in the following table, dwFlags must be zero, otherwise the function fails with ERROR_INVALID_FLAGS. 50220 5022150222502255022750229529365493657002 through 57011 65000 (UTF7)65001 (UTF8)42 (Symbol) Windows XP and later: MB_ERR_INVALID_CHARS is the only dwFlags value supported by Code page 65001 (UTF-8).
For the code pages in the following table, dwFlags must be zero, otherwise the function fails with ERROR_INVALID_FLAGS.
50220 5022150222502255022750229529365493657002 through 57011 65000 (UTF7)65001 (UTF8)42 (Symbol)
Windows XP and later: MB_ERR_INVALID_CHARS is the only dwFlags value supported by Code page 65001 (UTF-8).
Call me crazy, but there probably was not a need to have the sentence before the table and the table conflict with the sentence after the table. It is kind of understandble, but as topics go it has the flavor of a WTF sentence, if you ask me!
It does end on a better note by defining what an invalid character is:
The function fails if MB_ERR_INVALID_CHARS is set and encounters an invalid character in the source string. An invalid character is either, a) a character that is not the default character in the source string but translates to the default character when MB_ERR_INVALID_CHARS is not set, or b) for DBCS strings, a character which has a lead byte but no valid trailing byte. When an invalid character is found, and MB_ERR_INVALID_CHARS is set, the function returns 0 and sets GetLastError with the error ERROR_NO_UNICODE_TRANSLATION.
Oh, and before that it talks about some security considerations (more on these another day).
I am forgetting something now. What was it?
Oh yeah, I was going to talk about the code pages that are not Normalization Form C.
Obviously there is UTF-7 (65000), UTF-8 (65001), and GB-18030 (54936). Since each of these code pages covers the entire Unicode repetoire, each can have characters in Unicode normalization Form C, Form D, or any combination thereof. Some of the other code pages in the table above also fall into this category, but in the case of these three and all the rest, the MB_PRECOMPOSED and MB_COMPOSITE flags are both at best ignored and at worst will cause an ERROR_INVALID_FLAGS to be returned. So you will want to not pass either flag with any of them.
But there is one code page that can have data in either composite or precomposed form -- it is the Vietnamese ACP, code page 1258. It has all of the following entries:
CC = U+0300 : COMBINING GRAVE ACCENTD2 = U+0309 : COMBINING HOOK ABOVEDE = U+0303 : COMBINING TILDEEC = U+0301 : COMBINING ACUTE ACCENTF2 = U+0323 : COMBINING DOT BELOW
The reason for doing this is that there was really not enough room in the code page, otherwise. Unfortunately, there are also some precomposed characters with these accents:
C0 = U+00C0 : LATIN CAPITAL LETTER A WITH GRAVEC1 = U+00C1 : LATIN CAPITAL LETTER A WITH ACUTEC8 = U+00C8 : LATIN CAPITAL LETTER E WITH GRAVEC9 = U+00C9 : LATIN CAPITAL LETTER E WITH ACUTECD = U+00CD : LATIN CAPITAL LETTER I WITH ACUTED1 = U+00D1 : LATIN CAPITAL LETTER N WITH TILDED3 = U+00D3 : LATIN CAPITAL LETTER O WITH ACUTED9 = U+00D9 : LATIN CAPITAL LETTER U WITH GRAVEDA = U+00DA : LATIN CAPITAL LETTER U WITH ACUTEE0 = U+00E0 : LATIN SMALL LETTER A WITH GRAVEE1 = U+00E1 : LATIN SMALL LETTER A WITH ACUTEE8 = U+00E8 : LATIN SMALL LETTER E WITH GRAVEE9 = U+00E9 : LATIN SMALL LETTER E WITH ACUTEED = U+00ED : LATIN SMALL LETTER I WITH ACUTEF1 = U+00F1 : LATIN SMALL LETTER N WITH TILDEF3 = U+00F3 : LATIN SMALL LETTER O WITH ACUTEF9 = U+00F9 : LATIN SMALL LETTER U WITH GRAVEFA = U+00FA : LATIN SMALL LETTER U WITH ACUTE
So you it looks like maybe you could have mixed "Form C" and "Form D" code page 1258 text, doesn't it?
Unfortunately, its not that perfect. There are two error patterns, marked below in RED:
0xc0 with MultiByteToWideChar/MB_PRECOMPOSED --> U+00c00xc0 with MultiByteToWideChar/MB_COMPOSITE --> U+0041 U+03000x41 0xcc with MultiByteToWideChar/MB_PRECOMPOSED --> U+0041 U+03000x41 0xcc with MultiByteToWideChar/MB_COMPOSITE --> U+0041 U+0300
and going the other way:
U+00c0 with WideCharToMultiByte/WC_COMPOSITECHECK --> 0xc0U+00c0 with WideCharToMultiByte --> 0x41 0xccU+0041 U+0300 with WideCharToMultiByte/WC_COMPOSITECHECK --> 0xc0U+0041 U+0300 with WideCharToMultiByte --> 0xc0
The pattern is clear, right? MultiByteToWideChar is not quite smart enough to precompose in Unicode what is composite in cp1258, and WideCharToMultiByte is not quite smart enough to keep composite what is composite in Unicode.
Ah well, nothing is perfect -- the Vietnamese code page is missing some characters used in Vietnamese, anyway.
But the real reason for these combining characters is to handle the many letters used in Vietnamese that have double diacritics on them -- the cases of dual representations are somewhat accidental, all things considered, in the face of the need to support letters like "ẳằẵắặầẩẫấậ" and so forth....
This post brought to you by "À" (U+00c0, a.k.a. LATIN CAPITAL LETTER A WITH GRAVE)
Funny how you can discover something that changes nothing about you yet somehow makes it all look different.
The other day when I suggested that if Your VC++ files don't support Unicode identifiers? Drop a BOM
It seems like it was just yesterday that I posted about how TAV is in the public use area . Admittedly
Rasqual asks: Hello Michael, I'll keep the question short: What makes a 'good' encoding, and what makes
Microsoft has had Unicode as a part of its operating system offerings since the easrliest days of its
my multi-byte string contains NULL characters 0x00 within it.
It appears that MultiByteToWideChar does not work on the FULL string, it stops as soon as it encounters the first NULL byte in the multi-byte string, although i've passed the full length of the multi-byte buffer to MultiByteToWideChar in the fourth parameter.
ny clues???
What code page are you using? And what string, exactly?
code page 932.
the multi-byte string is basically from a file, that i've read in a char buffer.
actually i'm detecting the code page for the data from that file, by implementing the technique "Detecting a String's Character Set" given at the following URL:
http://www.microsoft.com/globaldev/DrIntl/columns/019/default.mspx
There is a file on the disk, that I read in a char buffer using MFC's CFile class. Now this file has some NULL bytes in it.
I then pass this buffer to MultiByteToWideChar.
I'm actually trying to detect the code page for the text in the file using the technique described under heading "Detecting a String's Character Set" at the following URL: