Sorting it all Out Michael Kaplan's random stuff of dubious value Be sure to read the disclaimer here first!
The other day, a developer named Stephanie sent me an email about compressions (these are used in collation when two or more characters are given a single sort weight -- the Unicode Collation Algorithm calls their analagous construction a contraction, in part to avoid confusion with other meanings of the term compression that are described in Unicode). She had just read Dr. International's description of the difference between Traditional and Modern Spanish here, and asked:
I did some experimentation and found that I saw the described results for CH, Ch, and ch, although the article only mentions CH. In any case, cH is not included. Can you explain these two discrepencies? Also, why wouldn't one of these be an alternate sort?
I did some experimentation and found that I saw the described results for CH, Ch, and ch, although the article only mentions CH. In any case, cH is not included. Can you explain these two discrepencies?
Also, why wouldn't one of these be an alternate sort?
Stephanie, you are right -- every compression we define for a cased script we handle the UU, UL, and LL forms, but we skip the LU form.
This was originally a point of confusion for me as well, but Cathy Wissink set my straight back in the early days when she pointed out to me that words may be ALL CAPS or they may be all lowercase and they may be Initial caps, but there is in most languages not a pattern that has capital letters in the middle of text that is not capitalized. The convention we use for compressions is designed to take this reality into account and handle the expected cases while discarding the one that is unexpected.
The Dr. International article isn't wrong here, though. I will often speak of a compression by just naming the one form when I mean all three forms; it is just a convenient way to express what compressions exist for a language, or a particular sort within a language.
As to your final concern, I agree with you -- there ought to be an alternate sort used here. I actually even pointed this out in the past (described here). The truth is that alternate sorts did not exist then. They were added specifically in the postmortem over handling this issue with Spanish!
This post brought to you by "ש" (U+05e9, a.k.a. HEBREW LETTER SHIN)
A couple of months ago I got a phone call from someone at Addison-Wesley who wanted to send me some books.
You may recall last week when I mentioned that In any CASE, it is somewhat INSENSITIVE to point out to
Previous posts in this series: Part 0: The empty string sorts the same in every language Part 1: The
The first blog in this series was On reversing the irreversible (the introduction) and the second was