Sorting it all Out Michael Kaplan's random stuff of dubious value Be sure to read the disclaimer here first!
Over in the Suggestion Box, Aaron asked:
Hi again - question about one of your favorite codepages - 1258 (Vietnamese) and combining diacritics in regards to Unicode character U+1EB7 (LATIN SMALL LETTER A WITH BREVE AND DOT BELOW - http://www.fileformat.info/info/unicode/char/1eb7/index.htm)Windows Installer, for some unknown reason, has never gone unicode, so it always requires an ANSI codepage. For Vietnamese, this character, while extremely common, isn't actually in the codepage list, so I assume the correct way to get it is via usage of combining diacritics. Unfortunately, no matter how I try to break out the letter into diacritical marks, MSI refuses to show it correctly (yes, I have the font packs, etc...). I can get it to stop showing question marks, but not actually the correct character.Which leaves me with a few questions:1) Any ideas on how to get this character to show up correctly in MSI's UI for codepage 1258?2) What's the best way to automatically convert from System.String for this character into a StreamWriter'd out GetEncoding(1258) version to get this decomposition correctly? Normalizing into FormD or FormKD first is not working, unfortunately, which was my hope. (Managed or Unmanaged is fine with me - just hope there's a way!)3) Any other codepages you can think of that we should watch out for with problems like this?4) Have you heard any push from the MSI team on getting to UTF-8 or UTF-16 support any time soon? It seems way behind the times.Thanks!
Now I won't go so far as to say that 1258 is a favorite code page, so I'll take that as sarcasm. :-)
When I read the questions here, I immediately thought of the Robert Frost poem The Road Not Taken:
Two roads diverged in a yellow wood,And sorry I could not travel bothAnd be one traveler, long I stoodAnd looked down one as far as I couldTo where it bent in the undergrowth;Then took the other, as just as fair,And having perhaps the better claim,Because it was grassy and wanted wear;Though as for that the passing thereHad worn them really about the same,And both that morning equally layIn leaves no step had trodden black.Oh, I kept the first for another day!Yet knowing how way leads on to way,I doubted if I should ever come back.I shall be telling this with a sighSomewhere ages and ages hence:Two roads diverged in a wood, and I—I took the one less traveled by,And that has made all the difference.
Now the poem has been interpreted by others much greater than I, so I'll just go so far as to say that one popular interpretation is an ironic one -- that despite the speaker's bold proclamation at the end about how this different choice made all the difference, in truth it made very little difference, and the two paths were really pretty much equivalent.
I thought I'd twist that up a bit with Aaron's example that leads to a specific scenario where refusing to go down the road not taken can more or less kick your ass!
I'll start by pointing to a blog that comes close to helping here but in the end fails (Getting intermediate forms).
Then I'll point to a blog that comes a bit closer to the mark (Harder intermediate forms of characters).
It's all well and good to talk about U+1eb7 (ặ), aka LATIN SMALL LETTER A WITH BREVE AND DOT BELOW.
Now this character is in Unicode Normalization Form C.
If you decompose it once via the data in the Unicode Character Database, you get U+1ea1 U+0306, aka LATIN SMALL LETTER A WITH DOT BELOW + COMBINING BREVE.
And if you decompose it again, you get U+0061 U+0323 U+0306, aka LATIN SMALL LETTER A + COMBINING DOT BELOW + COMBINING BREVE.
That last sequence? That is the character in Unicode Normalization Form D.
You might notice a bit of lack of coverage of either form in Windows code page 1258. Even with the odd ccombing characters it has, which I mentioned in A few of the gotchas of MultiByteToWideChar.
However, as that blog points out, Windows code page 1258 does have U+0323 (COMBINING DOT BELOW).
And id you look at the code page table itself, you will see that it does have U+0103 (LATIN SMALL LETTER A WITH COMBINING BREVE).
Now this is one of those harder intermediate forms -- U+0103 U+0323 (ặ), aka LATIN SMALL LETTER A WITH BREVE + COMBINING DOT BELOW is, while definitely the road not taken by Unicode normalization, is actually the de facto road taken by (for lack of an official term) Microsoft's "Normalization Form V", as used by its code page 1258.
Note that this sequence will see both Unicode Normalization Forms C and KC convert to U+1eb7, and both Unicode Normalization Forms D and KD convert to U+0061 U+0323 U+0306.
Now there is no conversion built into Windows or .NET to get this form not taken that will look right using code page 1258. Though if I ever had an interview candidate who understood all about code pages, I suspect that writing a converter that could do such a job would make for a fascinating interview question!
So, getting back to Aaron's questions, I handle #1 above and point out how though there is no good specific way to do #2 I'd be very impressed by the person who wrote the code to do it!
For #3, code page 1258 is the only conventional ACP under Windows with this specific problem.
And as for #4, while technically UTF-8 (code page 65001) is unsupported by Windows Installer, as I pointed out in MSI Databases and Unicode, MSKLC was able to successfully use UTF-8 and support the setup packages for many Unicode-only languages such as Hindi and Lao and Tibetan. Which suggests that it can in fact be used for Vietnamese.
Note that there are some characters that even if you do manage to create your own implementation of the so my so called Microsoft Normalization Form V can't be represented by the code page, thus UTF-8 is really the only option that can support the Vietnamese language itself, in the long run.
And although using UTF-8 here will make that conversion code unnecessary, I'd still want to hire the person who came up with an elegant code solution there. :-)
Thinking of all the work that fonts do to support the fictional Form V as well as the other, more valid and less valid forms, it is unfortunate that this support was never made more widespread so that it would be easier to support languages like Vietnamese while waiting for everyone like MSI and others to move to Unicode.
Now Windows code page 1258 is clearly an example where The Road Not Taken may well look the same in fonts and rendering, but in terms of code pages and components that do not use Unicode and/or do not normalize will see the road not taken as one with very very tall weeds blocking the way of folks like Aaron whose actual work might not afford them the ironic detachment of Robert Frost when it turns out that the two paths aren't the same....
This blog brought to you by ặ (U+1eb7, aka LATIN SMALL LETTER WITH BREVE AND DOT BELOW)
There's a Vietnamese-specific logic to CP 1258 that transcends the arbitrary Unicode normalization rules. The breve, circumflex, and horn accents, unlike the rest, affect vowel quality. If you look at a Vietnamese alphabet like the one at Wikipedia, you'll see that A WITH BREVE, A WITH CIRCUMFLEX, E WITH CIRCUMFLEX, O WITH CIRCUMFLEX, O WITH HORN, and U WITH HORN (as well as D WITH STROKE, which isn't Unicode-decomposable) are considered separate letters from their unaccented correspondents. Consequently, in 1258 they are encoded using seven precomposed characters.
On the other hand, the grave, acute, hook above, tilde, and dot below accents are tone marks, conceptually not part of the letters they appear on. They're encoded using combining characters, since encoding them using precomposed characters would create a combinatorial explosion of 12 x 6 x 2 = 144 distinct vowel characters. (The VISCII encoding actually does that, at the expense of filling the whole 0x80-0xFF space with letters and even usurping six of the control characters!)
Unsurprisingly, Vietnamese conventions always place the tone mark outside any breve, circumflex, or horn diacritic (and therefore following it according to Unicode rules). The only place in which this causes a problem is the dot below, beccause Unicode arbitrarily wants all diacritics below to come before all diacritics above.
(ObTooLateNow: IMHO the horn diacritic shouldn't have been encoded separately in Unicode. It's not used anywhere but in Vietnamese, can only appear on o and u, and (like ogonek and cedilla, but unlike most other combining diacritics) always touches the letter that it's associated with. Using undecomposable characters would have cost only 3 codepoints.)
As a by the way, this blog does NOT represent anything beyond my own personal thoughts. You could even