Back in May of 2004, Quan Nguyen sent a message to Dr. International about Vietnamese collation in Windows and the .NET Framework:

I tried to sort Vietnamese characters according to Vietnamese collation rules, as precribed in http://vietunicode.sourceforge.net/charset/vietalphabet.html. However, .NET Framework's built-in sort order for CultureInfo("vi-VN") seems not correct. What should I do to get it to sort according to Vietnamese alphabetical order? 

This was not the only place that this information was asked -- Quan had asked this same question on several newsgroups and other places. We requested some more details, did the investigation, and were able to report on the claim -- he was right, there were a few letters that did not sort properly. In the end, the problem basically consisted of the uppercase and lowercase versions of the following letters:

Of course since these letters are in Unicode and are used by several other languages, they have some default weights -- but they are not in the Vietnamese exception table. And their weights in the default table are not completely correct....

Now no one had reported this problem before, so hopefully these are letters that are not used often in Vietnamese in situations where the small but definite differences in collation would be noticed.

Which is not to say it is not a bug or that it should not be fixed -- it definitely is.

But it is to perhaps explain why it took so long for someone to report to Microsoft a bug that has been in the code page and sorting tables since the very first Vietnamese enabled versions of Windows....

Now Windows code page 1258 has its own set of problems here, because the above characters are not in cp1258, either. Well, they sort of are as combining characters since the code page has U+0300, U+0301, and U+0303 on it -- but the conversion to and from Unicode of the above characters can be quite nightmarish, for the reasons I mention when I pointed out a few of the gotchas of MultiByteToWideChar. We would have had to include them as the precomposed form listed above, and there are not enough free slots to do so (even if we were able to modify code pages, which we are not when I explained about we cannot change the code pages).

So let's just assume that cp1258 is about as limited in use as all of the rest of the attempts at the other (at last count 42!) 8-bit encodings of Vietnamese are (they all have problems due to the fact that there are too mny characters or not enough slots to put them) and stick with Unicode....

Getting back to collation, this particular problem that Quan Nguyen reported is fixed in the updated sorting tables in LonghornVista Beta 1. It could not be fixed in earlier versions of Windows or the .NET Framework as requires a major version change for Vietnamese to change the weights of code points that already have weights defined, so Vista is our first chance to make the fix (Whidbey's sorting tables are not being updated so the fix could not be made in .NET 2.0).

On a happier note, the font story for Vietnamese has been really good on Windows for a while now, for all of these various letters.

And the Vietnamese LIP was released in March 2005 which is also pretty awesome.

It just took a little while for the NLS side of GIFT to catch up with everyone else, that's all. :-)

 

This post brought to you by "Ý" (U+00dd, a.k.a. LATIN CAPITAL LETTER Y WITH ACUTE)