Sorting it all Out Michael Kaplan's random stuff of dubious value Be sure to read the disclaimer here first!
Yesterday, I said that CompareString prefers meaningful strings, and that while (the rare) inconsistencies are always bugs that we have to prioritize such bugs based on whether or not the data is actually valid/meaningful.
Many people stopped and wondered how one defines the word 'meaningful' here. Is it a definition that is useful for developers?
I'll ignore the cross-script strings that have little clear semantic or pragmatic meaning and focus on the strings that have code points not defined by the MS collation tables (sometimes not even by Unicode!) as discussed in The jury will give this string no weight.
Some developers might think they could use the CompareString API and compare characters to a zero length string. Others think about using LCMapString looking for a "no weight" sort key. But both of these ideas share two problems that keep them from acting as practical solutions:
So, what can you do? You can use the IsNLSDefinedString API! You pass it a string and it will tell you if every character in a string has a defined result (which in this case is exactly what you may need).
It is intimately related to the GetNLSVersion API, which also helps out with the question of stability in collation.
Both APIs were added in Windows Server 2003, and the Whidbey release of the .NET Framework includes a method analagous to IsNLSDefinedString (CompareInfo.IsSortable, you will see it starting in Beta 2).
GetNLSVersion is used by major databases like Active Directory in order to know when it needs to re-index their data. Basically looking at the NLSVERSIONINFO struct, the dwDefinedVersion member will be incremented any time a major version sort of change happens, and the dwNLSVersion member will be incrememented any time a minor version sort of change happens.
Now looking at IsNLSDefinedString, if you have a database and create indexes based on sort keys from LCMapString or B-Trees built from CompareString calls:
Obviously, major version changes are expensive and would be expected to be rare -- not even every major release of Windows requires a new major version.
Why is that? Well, usually a new version would just mean a whole bunch of new characters added, and thus there is no need to re-index strings that are already indexed -- which suggests a minor version. Minor version changes would be much more common. With them you can trust all existing index values, and only need to re-index strings that previously contained one or more unsortable elements.
If you follow principles (A) and (B) above and always store information about unsortable strings, you can use these APIs to maximize the utility of support of the collation of meaningful strings on Windows.
This post brought to you by "ქ" (U+10e5, a.k.a. GEORGIAN LETTER KHAR)
When I first started working in Windows and for several years, I did not really act by default like a