Sorting it all Out Michael Kaplan's random stuff of dubious value Be sure to read the disclaimer here first!
Back near the beginning of the month, when I posted about how Everybody's doing the wraparound...., I talked about how the particular implementation of multiple combining diacritic weights led to an interesting situation on Windows where the secondary, a.k.a. DW or 'diacritic weight', weight would wrap around.
Regular reader Maurits suggested a real problem that seemed to be possible:
Hmmm... I just thought of a potentially serious problem with the third method (wrap.) What if the value wraps to 0xfe or 0xfd? Then when you add 2, you get 01 or (worse?) 00. In particular: "e" with 15 tildes will have a DW of 15 * 17 which comes to 255; add 2 and you wrap to 1. "e" with 5 circumflexes and 12 tildes will have a DW of (5 * 10) + (12 * 17) = 254; add 2 and you wrap to 0. On the other hand, what are the odds of encountering a string with such a diacritically-loaded character?
This was before he had actually tested things out -- once he did, his report was slightly different:
I was trying to generate an example string that wraparounded (ugh) to 00 or 01, but I couldn't (on Windows 2000) Maybe I'm trying to solve a problem that doesn't exist?
Well, let's back up a second....
He is indeed right, the sort keys never seem to get created with bogus 0x00 or 0x01 weights inserted (therefore the bug Atsushi Enomoto reported still stands alone as the bug that inserts one of these byte sequences when it is not expected!).
(That was another bug found through blogging!)
But that does not mean there is no bug in this case.
Now I suspected there might be a bug here after Maurits reported the original problem -- and after looking at it tonight I managed to find one....
By carefully applying Tester's Axiom #5 and the knowledge that there is a potentially incorrect issue with the implementation, what can we find out here?
Well, if you look at How to track down collation bugs from late last year, you will notice a couple of things:
Take the following sample code:
using System;using System.Globalization;namespace SortTest { class Class1 { [STAThread] static void Main(string[] args) { CompareInfo ci = CompareInfo.GetCompareInfo(0x007f); string stOld= "e"; for(int i = 0; i < 7; i++) { string st = "e\u0323\u0323" + ((char)(0x031d + i)).ToString(); Console.Write(string.Compare(stOld, st)); Console.Write("\t\t"); Console.WriteLine(SortKey.Compare(ci.GetSortKey(stOld), ci.GetSortKey(st))); stOld = st; } } }}
The output of this little function (which should be two rows of numbers, where each row contains two values equal to each other) is worth 1000 words....
-1 -1-1 -1-1 1-1 0-1 0-1 -1-1 -1
What are the sort key values here? They are:
\u0065\u0323\u0323\u031d 0e 21 01 fe 01 01 01 00\u0065\u0323\u0323\u031e 0e 21 01 ff 01 01 01 00\u0065\u0323\u0323\u031f 0e 21 01 01 01 01 00\u0065\u0323\u0323\u0320 0e 21 01 01 01 01 00\u0065\u0323\u0323\u0321 0e 21 01 01 01 01 00\u0065\u0323\u0323\u0322 0e 21 01 03 01 01 01 00\u0065\u0323\u0323\u0323 0e 21 01 04 01 01 01 00
So, it looks like functions like CompareString (and its managed equivalent) will consistently wrap around and give expected results, while LCMapString for sort keys and its managed equivalent will, in an effort to avoid the problem with embedded invalid bytes will occasionally drop them.
Three bytes are dropped:
As soon as I read that comment, I knew there was a bug here. :-)
(Yet another bug found through blogging!)
This post brought to you by " ̣" (U+0323, a.k.a. COMBINING DOT BELOW)
Previous posts in this series: Part 0: The empty string sorts the same in every language Part 1: The