Sorting it all Out Michael Kaplan's random stuff of dubious value Be sure to read the disclaimer here first!
Previous posts in this series:
Today's contribution to this series is one that has been the source of several misunderstandings.
It has also led to more than a few bug reports, though in truth most of the bugs were duplicates of the very same issue. :-)
It is about certain types of punctuation, and the impact of WORDSORT (as opposed to STRINGSORT) in linguistic comparisons.
I first talked about the feature in A few of the gotchas of CompareString and I will quote the relevant bit here:
SORT_STRINGSORT - Treat punctuation the same as symbols. For example, a STRING sort treats co-op and co_op as strings that should sort together since the hyphen and the underscore are both treated as symbols. On the other hand, a WORD sort treats the hyphen and apostrophe differently, so that co-op and co_op would not sort together but co-op and coop would. The real documentation for this is built into the winnls.h header file: //// Sorting Flags.//// WORD Sort: culturally correct sort// hyphen and apostrophe are special cased// example: "coop" and "co-op" will sort together in a list//// co_op <------- underscore (symbol)// coat// comb// coop// co-op <------- hyphen (punctuation)// cork// went// were// we're <------- apostrophe (punctuation)////// STRING Sort: hyphen and apostrophe will sort with all other symbols//// co-op <------- hyphen (punctuation)// co_op <------- underscore (symbol)// coat// comb// coop// cork// we're <------- apostrophe (punctuation)// went// were//
SORT_STRINGSORT - Treat punctuation the same as symbols. For example, a STRING sort treats co-op and co_op as strings that should sort together since the hyphen and the underscore are both treated as symbols. On the other hand, a WORD sort treats the hyphen and apostrophe differently, so that co-op and co_op would not sort together but co-op and coop would. The real documentation for this is built into the winnls.h header file:
//// Sorting Flags.//// WORD Sort: culturally correct sort// hyphen and apostrophe are special cased// example: "coop" and "co-op" will sort together in a list//// co_op <------- underscore (symbol)// coat// comb// coop// co-op <------- hyphen (punctuation)// cork// went// were// we're <------- apostrophe (punctuation)////// STRING Sort: hyphen and apostrophe will sort with all other symbols//// co-op <------- hyphen (punctuation)// co_op <------- underscore (symbol)// coat// comb// coop// cork// we're <------- apostrophe (punctuation)// went// were//
The reasons for this feature are fairly clear -- in many contexts, the default word sorting is more useful and much more intuitive.
Of course among the reasons that it is not always expected:
The most recent report of the first issue (by far the most common) was just a week ago, and it led to ond of the developers over in MSN concluding that:
It just shows how evil String.Compare is (for being completely counter-intuitive.)
Hard to argue with that!
It almost makes one want to do something smarter in the function to try and detect the two cases and handle them differently -- perhaps not too hard since the actual cases are so very different?
Plus it would be cool to add a SORT_SMARTSORT constant here. :-)
Let's take a step back and see what the sort keys say.
Remmeber that sort keys are only run on a single string and are thus not subject to the "a < b but b < a" type bugs that can lead to real problems. Although as a general principle any time transitivity is not there or when behavior is different between CompareString/CompareStringEx and LCMapString/LCMapStringEx with LCMAP_SORTKEY deciding which one is wrong can vary, in practice it is usually not the sort keys -- thus they make the best baseline for us, functionally1.
(WS) -0.67:-0.33:0.33 0c 03 07 33 0c 7d 0c 90 07 37 0c 03 07 33 0c 46 0c 46 07 37 0c 03 07 33 0c 46 0c 46 01 01 01 01 80 07 06 82 80 1b 06 82 00(WS) -0.67:0.33:-0.33 0c 03 07 33 0c 7d 0c 90 07 37 0c 03 07 33 0c 46 0c 46 07 37 0c 03 07 33 0c 46 0c 46 01 01 01 01 80 07 06 82 80 2f 06 82 00(WS) 0.67:-0.33:0.33 0c 03 07 33 0c 7d 0c 90 07 37 0c 03 07 33 0c 46 0c 46 07 37 0c 03 07 33 0c 46 0c 46 01 01 01 01 80 1b 06 82 00(SS) -0.67:-0.33:0.33 06 82 0c 03 07 33 0c 7d 0c 90 07 37 06 82 0c 03 07 33 0c 46 0c 46 07 37 0c 03 07 33 0c 46 0c 46 01 01 01 01 00(SS) -0.67:0.33:-0.33 06 82 0c 03 07 33 0c 7d 0c 90 07 37 0c 03 07 33 0c 46 0c 46 07 37 06 82 0c 03 07 33 0c 46 0c 46 01 01 01 01 00(SS) 0.67:-0.33:0.33 0c 03 07 33 0c 7d 0c 90 07 37 06 82 0c 03 07 33 0c 46 0c 46 07 37 0c 03 07 33 0c 46 0c 46 01 01 01 01 00
Clearly, there is a specific preferred ordering for the sort keys based on the earlier hyphen trumping the later one, so in choosing which CompareString/CompareStringEx to prefer, one has a way to go. This is thankfully much less controversial of a decision than when one has to choose between CompareString/CompareStringEx and LCMapString/LCMapStringEx with LCMAP_SORTKEY, since in those cases clients like SQL Server are already contending with index corruption due to the "not so very transitive" results, so as long as one is correct and the other is treated as a bug, they can fix it without waiting for new collations, etc.
At this point, many people thinking about the fillers one sees in the DW and CW values will wonder exactly how these punctuation weights are being defined. I'll give you a very big hint:
-012345 0c 03 0c 21 0c 33 0c 46 0c 58 0c 6a 01 01 01 01 80 07 06 82 000-12345 0c 03 0c 21 0c 33 0c 46 0c 58 0c 6a 01 01 01 01 80 0b 06 82 00012-345 0c 03 0c 21 0c 33 0c 46 0c 58 0c 6a 01 01 01 01 80 13 06 82 000123-45 0c 03 0c 21 0c 33 0c 46 0c 58 0c 6a 01 01 01 01 80 17 06 82 0001234-5 0c 03 0c 21 0c 33 0c 46 0c 58 0c 6a 01 01 01 01 80 1b 06 82 00012345- 0c 03 0c 21 0c 33 0c 46 0c 58 0c 6a 01 01 01 01 80 1f 06 82 00
See what is going on? It still is acting as a filler -- it is just compressing it some. This is much less feasible to do for DW and CW portions of the weight given the additive nature of the values in them, but at least you know punctuation does not waste too much space in sort keys!
There are other characters that have this special weight value; to find them all, you could grab the sort key of each character from 0x0000 to 0xFFFF when the SORT_STRINGSORT flag is included -- these "special punctuation" values all have a first byte value of 06.
In Vista, these values (in weight order, with their other weights present as well) are:
CODEPOINT SM AW DW CW COMMENT0x0001 6 3 2 2 ;Start Of Heading0x0002 6 4 2 2 ;Start Of Text0x0003 6 5 2 2 ;End Of Text0x0004 6 6 2 2 ;End Of Transmission0x0005 6 7 2 2 ;Enquiry0x0006 6 8 2 2 ;Acknowledge0x0007 6 9 2 2 ;Bell0x0008 6 10 2 2 ;Backspace0x000e 6 11 2 2 ;Shift Out0x000f 6 12 2 2 ;Shift In0x0010 6 13 2 2 ;Data Link Escape0x0011 6 14 2 2 ;Device Control One0x0012 6 15 2 2 ;Device Control Two0x0013 6 16 2 2 ;Device Control Three0x0014 6 17 2 2 ;Device Control Four0x0015 6 18 2 2 ;Negative Acknowledge0x0016 6 19 2 2 ;Synchronous Idle0x0017 6 20 2 2 ;End Of Transmission Block0x0018 6 21 2 2 ;Cancel0x0019 6 22 2 2 ;End Of Medium0x001a 6 23 2 2 ;Substitute0x001b 6 24 2 2 ;Escape0x001c 6 25 2 2 ;File Separator0x001d 6 26 2 2 ;Group Separator0x001e 6 27 2 2 ;Record Separator0x001f 6 28 2 2 ;Unit Separator0x007f 6 29 2 2 ;Delete0x0027 6 128 2 2 ;Apostrophe-Quote0xff07 6 128 2 3 ;Fullwidth Apostrophe-Quote0x07F4 6 129 2 2 ;NKO HIGH TONE APOSTROPHE0x07F5 6 129 20 2 ;NKO LOW TONE APOSTROPHE0x002d 6 130 2 2 ;Hyphen-Minus0xff0d 6 130 2 3 ;Fullwidth Hyphen-Minus0xfe63 6 130 2 8 ;Small Hyphen-Minus0x2212 6 131 2 2 ;Minus Sign0x208b 6 131 2 4 ;Subscript Hyphen-Minus0x207b 6 131 2 14 ;Superscript Hyphen-Minus0x2010 6 132 2 2 ;Hyphen0x058a 6 132 21 2 ;Armenian Hyphen0x2011 6 133 2 2 ;Non-Breaking Hyphen0x2027 6 134 2 2 ;Hyphenation Point0x2043 6 135 2 2 ;Hyphen Bullet0x2012 6 136 2 2 ;Figure Dash0x2013 6 137 2 2 ;En Dash0xfe32 6 144 2 12 ;Glyph For Vertical En Dash0x2014 6 144 21 2 ;Em Dash0xfe58 6 144 21 8 ;Small Em Dash0xfe31 6 144 21 12 ;Glyph For Vertical Em Dash0x2015 6 146 2 2 ;Quotation Dash0x301c 6 147 2 2 ;Wave Dash0x3030 6 148 2 2 ;Wavy Dash
The control characters are there for compatibility with prior versions and are used by some very low level pieces of Windows as sentinels (when we triedd to change the values we qwuickly made the system unbootable!).
The only other thing to note is that 08 case weight bug I mentioned in Part 3 is here too for a few characters. Did anyone from the NLS test team put that bug in yet? :-)
As a bonus, can anyone explain why it doesn't matter in the default case, and under what combination of circumstances it would matter? I'll put the answer in a comment eventually if no one else figures it out....
1 - The de facto decision to consider sort key results to be definitive is quite ironic given that SQL Server2 uses CompareString to build its sort keys.2 - It is fascinating though perhaps not unexpected to note that the single biggest reporter of CompareString anomalies is in fact SQL Server.
This post brought to you by 9 (U+0039, a.k.a. DIGIT NINE)
Previous posts in this series: Part 0: The empty string sorts the same in every language Part 1: The
Lori asked: I’m seeing results with VB’s StrComp function that I would not expect. For example: StrComp("Lee-P",
The first blog in this series was On reversing the irreversible (the introduction) . First we'll lay
So the other day I was in one of those email threads again. One of those threads that makes me wonder