A Microsoft convention for compressions in sorting

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

A Microsoft convention for compressions in sorting

  • Comments 10

The other day, a developer named Stephanie sent me an email about compressions (these are used in collation when two or more characters are given a single sort weight -- the Unicode Collation Algorithm calls their analagous construction a contraction, in part to avoid confusion with other meanings of the term compression that are described in Unicode). She had just read Dr. International's description of the difference between Traditional and Modern Spanish here, and asked:

I did some experimentation and found that I saw the described results for CH, Ch, and ch, although the article only mentions CH. In any case, cH is not included. Can you explain these two discrepencies?

Also, why wouldn't one of these be an alternate sort?

Stephanie, you are right -- every compression we define for a cased script we handle the UU, UL, and LL forms, but we skip the LU form.

This was originally a point of confusion for me as well, but Cathy Wissink set my straight back in the early days when she pointed out to me that words may be ALL CAPS or they may be all lowercase and they may be Initial caps, but there is in most languages not a pattern that has capital letters in the middle of text that is not capitalized. The convention we use for compressions is designed to take this reality into account and handle the expected cases while discarding the one that is unexpected.

The Dr. International article isn't wrong here, though. I will often speak of a compression by just naming the one form when I mean all three forms; it is just a convenient way to express what compressions exist for a language, or a particular sort within a language.

As to your final concern, I agree with you -- there ought to be an alternate sort used here. I actually even pointed this out in the past (described here). The truth is that alternate sorts did not exist then. They were added specifically in the postmortem over handling this issue with Spanish!

 

This post brought to you by "ש" (U+05e9, a.k.a. HEBREW LETTER SHIN)

Comment on the blather
Leave a Comment
  • Please add 7 and 8 and type the answer here:
  • Post
Blog - Comment List
  • "there is in most languages not a pattern that has capital letters in the middle of text that is not capitalized"

    I can think of two counter-examples off-the-cuff...

    1. Ronald McDonald <-- cD
    2. random-capitalization for password complexification

    The first is not much of a problem as combining characters that happen to meet in this fashion should probably *not* be combined as they are from separate semantic objects.

    The second is more of a problem but is mitigated somewhat by the "why would you need to sort passwords anyway" question. Unless you're writing a crypt() function.
  • ah, the #2 case is not meaningful for us, and the #1 case I would argue that the intent was not to treat the two chars as a sort element (if there were a CD compression, that is; no one has one now).

    Now if some language wanted it, we could always add it for that language. We just don't have any right now.... :-)
  • Suppose I was a developer in Spain, tasked with creating a phone directory. In .NET, of course. I use the CompareString culture options to implement a Spanish-sensitive sort, in particular sorting ch as a unique element between c and d.

    CA CB ... CE CF CG CI CJ CK ... CZ
    CH
    D

    cA cB ... cE cF cG cH cI cJ cK ... cZ
    d

    Fine. I complete the work and go on my merry way.

    After I'm long gone, a shipment of Scottish soldiers arrives at the UK's port of Gibraltar. This ship is full of people with last names like McHenry, McTavish, McDonald, ...

    Some of these lads decide to leave Gibraltar and intermingle with the native señoritas, marrying and starting families. And getting phone numbers, and listings in the directory I created.

    All very well and good. My algorithm even sorts McHenry correctly between McDonald and McTavish!

    But then one day someone decides to uppercase all the data...

    Pobre Sra. McHenry! She now sorts differently...

    McDonald
    McHenry
    McTavish

    vs.

    MCDONALD
    MCTAVISH
    MCHENRY

    ¡Ay, dios mio!
  • Well, this was indeed a colorful spelling out of the underlying scenario. :-)
  • A few days ago, I wrote a post entitled Why do we call w 'double u' -- doesn't it look more like a 'double...
  • George asked me via the contact link:

    I was reading on MSDN from a topic titled 'Custom Case Mappings...
  • A couple of months ago I got a phone call from someone at Addison-Wesley who wanted to send me some books.

  • You may recall last week when I mentioned that In any CASE, it is somewhat INSENSITIVE to point out to

  • Previous posts in this series: Part 0: The empty string sorts the same in every language Part 1: The

  • The first blog in this series was On reversing the irreversible (the introduction) and the second was

Page 1 of 1 (10 items)