Tuesday, January 18, 2005 7:54 AM
Michael S. Kaplan
The jury will give this string no weight
(the title was inspired by a decade and a half of Law & Order on NBC, then A&E, and now TNT!)
I don't want to knock collation on Windows, because I think it rocks. It covers a lot of territory, and it gets the job done (and done well) in a lot of the world. But every once in a while you may find yourself on the bleeding edge of what is supported, and it is important when you are on the bleeding edge to keep from wounding yourself. Thus I am going to talk about one of those edge cases now....
It starts with the NLS APIs that handle collation. When you use the CompareString API to compare two strings or the LCMapString API with the MAP_SORTKEY flag to get the sortkey of one string, an important bit of NLS architecture is involved. That bit is the weight tables that these NLS APIs use to make linguistic comparisons.
The weights are something I discussed a bit in a previous post entitled How do sort keys work? and this post is going talk a little bit more about those weights.
The main problem is that although the weight tables that are used by Windows and the .NET Framework are great for all of the languages and scripts that Windows support, they are not quite as useful when the weights are not present.
There are many reasons for a code point to have no weight. It may actually not be a valid encoded Unicode code point, in which case it would be expected to have no weight.
Or it may be a code point that was not encoded in Unicode until after the operating system shipped a version (in which case it will have no weight since we do not have clairvoyants on staff!).
Or finally (and this is the one that kind of sucks a bit) it may not have been added to our tables yet. So....
If you try to compare strings containing (for example) Tibetan script on any shipping version of Windows, they will all be considered equal to each other. If you tried to get sort keys for them then you will see that they have no weight. Therefore any kind of linguistic comparison will not return useful results; all strings will be equal. And this will happen even though the strings may not be the same length!
There are probably some developers around right now who are objecting to that last point, but I'll give a counterpoint. Let us say that you are comparing "hello" (U+0068 U+0065 U+006c U+006c U+006f) and "hëllô" (U+0068 U+0065 U+0302 U+006c U+006c U+006f U+030a) using CompareString with the NORM_IGNORENONSPACE flag. You would expect them to be considered equal since you are ignoring diacritics, which means "give the diacritics no weight", even though the length of the two strings is different. So the length is not important -- what is important is that the weights on the two strings are the same.
You'll get the same results if you try to compare strings in other scripts that do not yet have weight (such as Yi Syllables or Khmer).
It will happen in Windows 2000 and in SQL Server 2000 with CJK Unified Ideographs Extension A (1.5MB) or CJK Unified Ideographs Extension B (13MB) -- though we addressed this in Windows XP and Windows Server 2003, and there are new collations in SQL Server 2005 that give these ideographs some weight as well.
And in Longhorn we plan to give everything that is defined some type of default weight, at least.
On a side note, the original version of the post included a bunch of Tibetan strings in it, but .Text actually fails to post when that text is there (it probably has trouble with those "weightless" strings in its parsing logic?). This only affected the initial post; I was able to edit after the post and add characters (like the sponsor line). Weird bug....
Because with (a) MSKLC available, (b) a publicly defined OpenType spec, and (c) custom cultures coming in the "Whidbey" release of the Visual Studio and the .NET Framework, Microsoft is clearly working to try and "get out of the way" of those who do not want to wait for us to support their language. Such people are right; we should get out of their way, And this is yet another step in that process to help enable them.
And yes, there will be more on these plans in future posts, especially as Beta 2 VS 2005 and Beta 3 of SQL Server 2005 make it out into the world, and then especially as more gets said about the "Longhorn" release of Windows. Stay tuned... because it's gonna keep being interesting. :-)
This post sponsored by "ག" (U+0f42, a.k.a. TIBETAN LETTER GA)