More on sort elements

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

More on sort elements

  • Comments 6

Yesterday I contrated sort elements and text elements. I am now going to leave text elements aside for a bit. Because linguistic collation on Windows, at its heart, is an ordering based on sort elements, not text elements.

Every time I look at the text in the Platform SDK for the CompareString function, I cannot help but smile:

If the two strings are different lengths, they are compared up to the length of the shortest one. If they are equivalent at that point, the return value indicates that the longer string is greater.

The truth is that all throughout the string, it is the sort elements that are being compared. There are times that one code point actually represents two sort elements (think Æ, a.k.a. U+00c6 a.k.a. LATIN CAPITAL LETTER AE in some languages) or three sort elements (think , a.k.a. U+fb03 a.k.a. LATIN SMALL LIGATURE FFI in other languages). There are other times that two code points (think ch in Traditional Spanish) or three code points (think dzs in Hungarian) make up a single sort element. Other times code points have no weight and they are ignored entirely, having no sort element at all.

So if each code point will have between 0 and 3 sort elements (with fractional values supported), it is hard to try and equate string length to any operation beyond when to stop looking. The string length is definitely not a count of relevant elements to consider!

It makes the notion of that sentence from the documentation almost comical. Since CompareString is looking at each string, one sort element at a time, the only length that is meaningful to it is the length in sort elements; it is only when the sort elements are equivalent until one string ends that the issue with the longer string being greater comes into play.

On the other hand, I would hate to suggest trying to inject the notion of sort elements into the Platform SDK just to have a nicer sentence in the one doc topic.

I guess that is what this blog is for. :-)

Now lest you think it is all easy now once you add this one "conceptual simplification", I promise to make it seem harder again while talking about the reverse diacritics used in French, the double compressions used in Hungarian, tricks with Jamo and Old Hangeul, the full story on Hiragana and Katakana, the stuff happening in Longhorn, and more.

But it is still a good start. This whole subject ought to be a lot easier, conceptually. Any subject that just about every single person in the world who can read is able to intuitively understand ought to be easier conceptually, even if most of those people cannot explain how it works. Maybe if they have been and plan to keep reading here, they will be able to. :-)

 

This post brought to you by "" (U+fb03, a.k.a. LATIN SMALL LIGATURE FFI)

Comment on the blather
Leave a Comment
  • Please add 2 and 1 and type the answer here:
  • Post
Blog - Comment List
  • This is actually on the "reverse diacritics" post--the circumflexes have not been abolished from the words you mention there, only from words where no preëxisting word is identical but for the circumflex (in other words, all circumflexes needed to distinguish words remain).
  • How do those lyrics go again? Wild[card] thing You make my CHAR sing. You make everything query Wild[card]

  • Developer Andrew Arnott asks: Michael, David Kline recommended I forward my question on to you. If you

  • There is a problem with the notion of both trailing spaces and fixed width in SQL Server, when you are

  • Previous posts in this series: Part 0: The empty string sorts the same in every language Part 1: The

  • The first blog in this series was On reversing the irreversible (the introduction) and the second was

Page 1 of 1 (6 items)