Wednesday, September 12, 2007 12:16 AM
Michael S. Kaplan
A&P of Sort Keys, part 2 (aka The string that won? Didn't have a mark on him!)
Previous posts in this series:
Because the nature of comparisons is such that the essentially question one is asking is which comes first?, it is very natural to think of it as a competition, almost a fight.
Of course the order is deterministic (minus the odd bug or two) so one could claim that to the extent it is a competition, the game is somewhat rigged. :-)
I could make the claim that the occasional a < b, b < c, c < a scenario is just an attempt to be fair and make everyone a winner, but I am pretty sure that neither Brandon non Jun (both devs on the SQL team) would buy it....
Anyway, given that it can be thought of as a fight, it is only natural that you might expect the loser of the fight to be a bit bruised, to have some marks on himself.
Which brings us to our test strings today:
aaá (or aaá)
aáa (or aáa)
áaa (or áaa)
aaa
The strings are basically various random combinations of U+0061 (LATIN SMALL LETTER A) and U+0301 (COMBINING ACUTE ACCENT) on the left and the canonically equivalent versions made with U+0061 (LATIN SMALL LETTER A) and U+00e1 (LATIN SMALL LETTER A WITH ACUTE) on the right. The two strings in each row will have identical sortkey values (if they don't it is a bug), and the sort keys look like this (secondary a.k.a. diacritic weights are red):
0e 02 0e 02 0e 02 01 02 02 0e 01 01 01 00
0e 02 0e 02 0e 02 01 02 0e 01 01 01 00
0e 02 0e 02 0e 02 01 0e 01 01 01 00
0e 02 0e 02 0e 02 01 01 01 01 00
Okay, so since I did it all with the default table, the primary weights are identical for all four strings. In order to provide the proper ordering of the strings, the secondary weight for the ACUTE is that 0e (actually 02 + 0c, the minimal weight is always added in there), and when a letter in the string does not have a diacritic weight, a minimal weight of 02 has to be put in as a placeholder so that the proper comparison can be made depending on where the acute actually is.
When there are no additional weights, the extra padding is not needed -- which saves some space.
Any time a binary comparison (e.g. memcmp) of two of the byte arrays is done, it will return the same results as if the full string comparison is happening.
And if you have multiple diacritics, the weights are simply added together -- thus when looking at:
s (LATIN SMALL LETTER S)
ș (LATIN SMALL LETTER S + COMBINING COMMA BELOW)
ś (LATIN SMALL LETTER S + COMBINING ACUTE ACCENT)
ș́ (LATIN SMALL LETTER S + COMBINING COMMA BELOW + COMBINING ACUTE ACCENT)
ș́ (LATIN SMALL LETTER S + COMBINING ACUTE ACCENT + COMBINING COMMA BELOW)
The weights:
0e 91 01 01 01 01 00
0e 91 01 5b 01 01 01 00
0e 91 01 0e 01 01 01 00
0e 91 01 67 01 01 01 00
0e 91 01 67 01 01 01 00
are figured out with easy addition operations:
-
the comma below has a weight of 59 (that plus the minimal 02 weight makes 5b)
-
the acute has a weight of 0c (that plus the minimal 02 weight makes 0e)
-
the two diacritics combined have a weight of 65 (0c plus 59 plus the minimal 02 weight makes 67)
-
no matter what order the diacritics are in, they have the same weight -- note that in this case only one of the last two strings is valid, due to canonical reordering rules in Unicode. Can you guess which one is valid and why?
Easy, right?
Now, there is the occasional bug like:
But for normal strings (in normalization form C pre Vista, any form Vista and beyond), the method works quite well.
Looking back to Brett's original question from Part 0, we are starting to build up some interesting rules related to sort key size:
-
As described in
Part 0, every sort key has a fixed overhead of five bytes per sortkey;
-
According to
Part 1, most of Unicode will give you two bytes of primary weight per character;
-
As described above, no matter how many diacritics you pile on a character, it will never add more than one byte per character;
-
As also described above, the need to pad in order to have the diacritic weights line up can actually add one additional byte per character preceding one with a diacritic.
So far, that last point is probably the first one that seems potentially concerning -- imagine a string 2mb in size with the last character only having a diacritic -- your sort key would be 2mb bigger!
Luckily such sort keys with mostly duplicated values can be readily compressed.... :-)
So how are we doing? Is it getting interesting yet?
If not, then just stay tuned for Part 3, coming up tomorrow!
This post brought to you by 2 (U+0032, a.k.a. DIGIT TWO)