Sorting it all Out Michael Kaplan's random stuff of dubious value Be sure to read the disclaimer here first!
Yesterday I contrated sort elements and text elements. I am now going to leave text elements aside for a bit. Because linguistic collation on Windows, at its heart, is an ordering based on sort elements, not text elements.
Every time I look at the text in the Platform SDK for the CompareString function, I cannot help but smile:
If the two strings are different lengths, they are compared up to the length of the shortest one. If they are equivalent at that point, the return value indicates that the longer string is greater.
The truth is that all throughout the string, it is the sort elements that are being compared. There are times that one code point actually represents two sort elements (think Æ, a.k.a. U+00c6 a.k.a. LATIN CAPITAL LETTER AE in some languages) or three sort elements (think ffi, a.k.a. U+fb03 a.k.a. LATIN SMALL LIGATURE FFI in other languages). There are other times that two code points (think ch in Traditional Spanish) or three code points (think dzs in Hungarian) make up a single sort element. Other times code points have no weight and they are ignored entirely, having no sort element at all.
So if each code point will have between 0 and 3 sort elements (with fractional values supported), it is hard to try and equate string length to any operation beyond when to stop looking. The string length is definitely not a count of relevant elements to consider!
It makes the notion of that sentence from the documentation almost comical. Since CompareString is looking at each string, one sort element at a time, the only length that is meaningful to it is the length in sort elements; it is only when the sort elements are equivalent until one string ends that the issue with the longer string being greater comes into play.
On the other hand, I would hate to suggest trying to inject the notion of sort elements into the Platform SDK just to have a nicer sentence in the one doc topic.
I guess that is what this blog is for. :-)
Now lest you think it is all easy now once you add this one "conceptual simplification", I promise to make it seem harder again while talking about the reverse diacritics used in French, the double compressions used in Hungarian, tricks with Jamo and Old Hangeul, the full story on Hiragana and Katakana, the stuff happening in Longhorn, and more.
But it is still a good start. This whole subject ought to be a lot easier, conceptually. Any subject that just about every single person in the world who can read is able to intuitively understand ought to be easier conceptually, even if most of those people cannot explain how it works. Maybe if they have been and plan to keep reading here, they will be able to. :-)
This post brought to you by "ffi" (U+fb03, a.k.a. LATIN SMALL LIGATURE FFI)
How do those lyrics go again? Wild[card] thing You make my CHAR sing. You make everything query Wild[card]
Developer Andrew Arnott asks: Michael, David Kline recommended I forward my question on to you. If you
There is a problem with the notion of both trailing spaces and fixed width in SQL Server, when you are
Previous posts in this series: Part 0: The empty string sorts the same in every language Part 1: The
The first blog in this series was On reversing the irreversible (the introduction) and the second was