The problem of string comparisons, WORD sorts, and the minus that is treated like the hyphen

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

The problem of string comparisons, WORD sorts, and the minus that is treated like the hyphen

  • Comments 19

William Hooper pointed me to an interesting bug report:

Help Strings in .NET not always transisitve!!!!!!

We need someone to help us!!!!!!

>>>>>>>>>>>>>>>>>>>>>>

https://connect.microsoft.com/feedback/viewfeedback.aspx?FeedbackID=236900&wa=wsignin1.0&siteid=210

>>>>>>>>>>>>>>>>>>>>>>

   string s1 = "-0.67:-0.33:0.33";
   string s2 = "0.67:-0.33:0.33";
   string s3 = "-0.67:0.33:-0.33";
   Console.WriteLine(s1.CompareTo(s2));
   Console.WriteLine(s2.CompareTo(s3));
   Console.WriteLine(s1.CompareTo(s3));

Bug in the hypen logic when there are two hyphens in string...

Now I have talked about string sort vs. word sort in the past. And this is a problem which appears in both Windows and the .NET Framework, in all versions -- and has for all the versions I checked.

Even more interestingly, if you reverse all three comparisons the bug still happens (it is common in these kinds of issues when they come up for there to be a difference!).

Let's look at the sort keys to see how things look for the three strings:

0c 03 07 33 0c 7d 0c 90 07 37 0c 03 07 33 0c 46 0c 46 07 37 0c 03 07 33 0c 46 0c 46 01 01 01 01 80 07 06 82 80 1b 06 82 00
0c 03 07 33 0c 7d 0c 90 07 37 0c 03 07 33 0c 46 0c 46 07 37 0c 03 07 33 0c 46 0c 46 01 01 01 01 80 1b 06 82 00
0c 03 07 33 0c 7d 0c 90 07 37 0c 03 07 33 0c 46 0c 46 07 37 0c 03 07 33 0c 46 0c 46 01 01 01 01 80 07 06 82 80 2f 06 82 00 

 Now let's focus on that end piece that contains the hyphen info, since the main parts of the string are all identical:

80 07 06 82 80 1b 06 82 00
80 1b 06 82 00
80 07 06 82 80 2f 06 82 00 

 Now obviously from a sort key standpoint, s1 < s3 < s2, always. Which is obviously superior to s1 > s2 > s3 > s1 any day of the week, and which points to comparison that has the bug in it -- the comparison of s1 vs s2:

CompareInfo ci = CompareInfo.GetCompareInfo(0x409);
string st1 = "-0.67:-0.33:0.33";
string st2 = "0.67:-0.33:0.33";
string st3 = "-0.67:0.33:-0.33"; 
SortKey sk1 = ci.GetSortKey(st1);
SortKey sk2 = ci.GetSortKey(st2);
SortKey sk3 = ci.GetSortKey(st3);

Console.WriteLine(ci.Compare(st1, st2));
Console.WriteLine(ci.Compare(st2, st3));
Console.WriteLine(ci.Compare(st1, st3));
Console.WriteLine();
Console.WriteLine(SortKey.Compare(sk1, sk2));
Console.WriteLine(SortKey.Compare(sk2, sk3));
Console.WriteLine(SortKey.Compare(sk1, sk3));

The results?

1
1
-1

-1
1
-1 

Definitely a bug in the string comparison, one that has been around for a long time. Perhaps a better reason than ever to keep in mind the importance of linguistically meaningful strings when one is using the NLS/NLS+ collation support? Clearly the intent of the word sort is to handle an entirely different class of usage of U+002d -- basically it is meant to handle the hyphen, not the minus. A meaningful comparison of these strings would have to be based on the underlying source of the numeric data.

Do we have an NLS tester in the house? Let's make sure that Connect bug makes its way to me when it gets ported. And I suppose we need one for v.Next of Windows, too....

It might even be worth treating U+2212 (MINUS SIGN) as different than U+002d (HYPHEN-MINUS), though as a mitigation it won't help since most methods of converting a minus to a string come up with U+002d, not U+2212. Just as most hyphens do not end up as U+2010 (HYPHEN). Though of course the best fit mappings for both characters usully puts them as U+002d. Changing all of those string creating definitions would probably create even more problems with compatibility. :-(

 

This post brought to you by - (U+002d, a.k.a. HYPHEN-MINUS)

Comment on the blather
Leave a Comment
  • Please add 5 and 3 and type the answer here:
  • Post
Blog - Comment List
  • So how will you work around this ?

    Add code to check if a x2d is followed by 0-9 ? and treat it as a x2212 ?

  • At some level of the code you know that you are passing non-linguistic strings; the key is to know when to not be calling CompareString or its ilk.

  • Look at the sort keys for the three strings again. There are four 01 bytes in front of the end piece that contains the hyphen info.

    According to my local MSDN documentation the sort key is stored as an array of byte values in the following format:

    [all Unicode sort weights] 0x01 [all Diacritic weights] 0x01 [all Case weights] 0x01 [all Special weights] 0x00

    Online documentation (http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/nls_5s2v.asp) specifies another format:

    [all Unicode sort weights] 0x01 [all Diacritic weights] 0x01 [all Case weights] 0x01 [all Special weights] 0x01 0x00

    Note one more 0x01.

    Could it be that both documentation sources are incorrect and the format is:

    [all Unicode sort weights] 0x01 [all Diacritic weights] 0x01 [all Case weights] 0x01 0x01 [all Special weights] 0x00

    Can you make it clear?

  • Actually, it is just the punctuation weights are separate outside of the other special weights....

  • Then, do I understand it correctly that the format is:

    [all Unicode sort weights] 0x01 [all Diacritic weights] 0x01 [all Case weights] 0x01 ["Other" Special weights] 0x01 [Punctuation weights] 0x00 ?

    Can you give an example of the string that has "other" special weights piece in the sort key? Does punctuation weights piece contain information only about hyphens and apostrophes?

  • Yes, when I talked about "punctuation weights" I was referring to those hyphen/apostrophe characters affected by word sorts. Those are not "special weights".

    Special weights are something you'll see when you deal with a lot of Kana....

  • Then example string could be SO-DIMMソケット×2.

    The sort key for this string is:

    0e 91 0e 7c 0e 1a 0e 32 0e 51 0e 51 22 16 22 0d

    22 1c 22 1e 08 1c 0c 33 01 01 12 12 12 12 12 12

    01 c6 c6 c4 ff 02 c4 c4 c4 c4 ff ff 01 80 0f 06

    82 00

    In this key "Special weights" piece is:

    c6 c6 c4 ff 02 c4 c4 c4 c4 ff ff

    and the "punctuation weights" piece is:

    80 0f 06 82

    And the LCMapString with the LCMAP_SORTKEY flag stores a sort key in the buffer, as an array of byte values in the following format:

    [all Unicode sort weights] 0x01 [all Diacritic weights] 0x01 [all Case weights] 0x01 [all Special weights] 0x01 [Punctuation weights] 0x00

    "Punctuation weights" piece is specific for WORD sort and contains

    information about hyphen/apostrophe characters.

    Is this correct description now?

  • Yes, it is.

  • Well, apologies to Dana Carvey and all, but actually, it isn't! The other day when I talked about The

  • Different products have different ways of describing what is essentially the same behavior. Part of the

  • Previous posts in this series: Part 0: The empty string sorts the same in every language Part 1: The

  • I have just spent a whole day chasing a problem where data from an SQL server database was sorted differently by SSIS Sort and SQL Server 2005.

    In both cases I have used the same collation (finnish_swedish_cs_as) Since this is a windows collation the sorts produced by SQL server should (according to SQL Server documentation) be compatible with sorts produced by CompareString.

    After lots of investigation I have identified that the reason for the problem is that SSIS uses CompareString with the above bug, but that SQL server uses something else that has fixed the bug.

    If you ever get a chance to fix this problem in CompareString, please make sure that the result is compatible with the SQL server implementation.

    The strings i have had problems with are:

    a="MX97_5_OE-SMC-24/45"

    b="MX97_5_OESMC-24/45"

    SQL server thinks that a<b while CompareString thinks that a>b

  • Not exactly ontopic here, but version of Windows where the SSIS code is running would also be needed to pin this down....

  • I have tested this on XP SP2 and Server 2003 R2. SQL Server 2005

  • The differences here between Windows and SQL Server are more or less expected since the SQLS tables are based on data from Windows 2000 pre-Beta2 and the XP/Server 2003 is based on data from a half  decade later that was not based on  non-shipping Windows collation data....

    The most comprehensively compatible results between Windows and SQLS are to be found in Windows Server 2008 and SQL Server 2008, as the latter was updated and put largely into sync with the former.

Page 1 of 2 (19 items) 12