Sorting it all Out Michael Kaplan's random stuff of dubious value Be sure to read the disclaimer here first!
This is a question with true 'drive on a parkway, but park on a driveway' feel to it, but one that I have been asked by many people.
If you look at the surrogate range and its definition in Section 3.8 of the Unicode Standard:
3.8 Surrogates
D25 High-surrogate code point:
D25a High-surrogate code unit:
UTF-16 as the leading code unit of a surrogate pair.
D26 Low-surrogate code point:
D26a Low-surrogate code unit:
UTF-16 as the trailing code unit of a surrogate pair.
• High-surrogate and low-surrogate code points are designated only for that use.
you may not find the conformance definitions to be too terribly useful here -- they confirm what we already knew. So what is the story?
Well, a lot of it has to do with the way human beings try to equate what we understand to what we do not.
I remember trying to explain to someone about our collation weighing system, and the way we gave the items that come earlier 'less weight' so that they come first. He was confused because he was thinking about it like an indicator that went from 0 to 100 -- the items you wanted to have first would thus be 'heavier' so they would sink to the bottom of the list.
Now this person was not 'wrong' conceptually, it was just that his model did not match ours. :-)
So it is with the high and low surrogates. The high ones, which come first any time you have a legal surrogate pair, are assigned first. Since they are assigned earlier in the range of possible code points, they have lower numbers (0xd800 is a lower number than 0xdc00 in any computer language I have ever heard of), but no one was really thinking about the low/high surrogate thing in terms of code point values, they were thinking of the 'high that comes first' instead.
In case you are still rebelling against the conceptual disconnect, keep in mind that people say "WE'RE #1" to mean that they have a higher ranking than #2 and #3 and so on, despite the fact that the numbers are lower. That may help people to see that we each have our own assumptions about ranking and ordering that do not always use the same model....
So anyway, Kim 's other recent blog, entitled Making a StreamWriter usable even after given garbage characters
The other day Karl Williamson asked a great question about Unicode stuff:
Subject: Why are the shorter