If at first you don't succeed, there's probably still a bug

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

If at first you don't succeed, there's probably still a bug

  • Comments 38

Back near the beginning of the month, when I posted about how Everybody's doing the wraparound...., I talked about how the particular implementation of multiple combining diacritic weights led to an interesting situation on Windows where the secondary, a.k.a. DW or 'diacritic weight', weight would wrap around.

Regular reader Maurits suggested a real problem that seemed to be possible:

Hmmm... I just thought of a potentially serious problem with the third method (wrap.)  What if the value wraps to 0xfe or 0xfd?  Then when you add 2, you get 01 or (worse?) 00.

In particular:

"e" with 15 tildes will have a DW of 15 * 17 which comes to 255; add 2 and you wrap to 1.

"e" with 5 circumflexes and 12 tildes will have a DW of (5 * 10) + (12 * 17) = 254; add 2 and you wrap to 0.

On the other hand, what are the odds of encountering a string with such a diacritically-loaded character?

This was before he had actually tested things out -- once he did, his report was slightly different:

I was trying to generate an example string that wraparounded (ugh) to 00 or 01, but I couldn't (on Windows 2000)

Maybe I'm trying to solve a problem that doesn't exist?

Well, let's back up a second....

He is indeed right, the sort keys never seem to get created with bogus 0x00 or 0x01 weights inserted (therefore the bug Atsushi Enomoto reported still stands alone as the bug that inserts one of these byte sequences when it is not expected!).

(That was another bug found through blogging!)

But that does not mean there is no bug in this case.

Now I suspected there might be a bug here after Maurits reported the original problem -- and after looking at it  tonight I managed to find one....

By carefully applying Tester's Axiom #5 and the knowledge that there is a potentially incorrect issue with the implementation, what can we find out here?

Well, if you look at How to track down collation bugs from late last year, you will notice a couple of things:

  • Most of the points in that post do not apply, and
  • Point #12 leads to some interesting results

Take the following sample code:

using System;
using System.Globalization;

namespace SortTest {
    class Class1 {
        [STAThread]
        static void Main(string[] args) {
            CompareInfo ci = CompareInfo.GetCompareInfo(0x007f);
            string stOld= "e";

            for(int i = 0; i < 7; i++) {
                string st = "e\u0323\u0323" + ((char)(0x031d + i)).ToString();
                Console.Write(string.Compare(stOld, st));
                Console.Write("\t\t");
                Console.WriteLine(SortKey.Compare(ci.GetSortKey(stOld),
ci.GetSortKey(st)));
                stOld = st;
            }
        }
    }
}

The output of this little function (which should be two rows of numbers, where each row contains two values equal to each other) is worth 1000 words....

-1              -1
-1              -1
-1              1
-1              0
-1              0
-1              -1
-1              -1

What are the sort key values here? They are:

\u0065\u0323\u0323\u031d        0e 21 01 fe 01 01 01 00
\u0065\u0323\u0323\u031e        0e 21 01 ff 01 01 01 00
\u0065\u0323\u0323\u031f        0e 21 01 01 01 01 00
\u0065\u0323\u0323\u0320        0e 21 01 01 01 01 00
\u0065\u0323\u0323\u0321        0e 21 01 01 01 01 00
\u0065\u0323\u0323\u0322        0e 21 01 03 01 01 01 00
\u0065\u0323\u0323\u0323        0e 21 01 04 01 01 01 00

So, it looks like functions like CompareString (and its managed equivalent) will consistently wrap around and give expected results, while LCMapString for sort keys and its managed equivalent will, in an effort to avoid the problem with embedded invalid bytes will occasionally drop them.

Three bytes are dropped:

  • 0x00 -- reserved for the end of the string;
  • 0x01 -- reserved for a sentinel between subsections of the sort key;
  • 0x02 -- reserved for the minimal weight with no diacritics, which is dropped when it is not needed as a placeholder for later secondary weights that greater than 0x02.

As soon as I read that comment, I knew there was a bug here. :-)

(Yet another bug found through blogging!)

 

This post brought to you by " ̣" (U+0323, a.k.a. COMBINING DOT BELOW)

Comment on the blather
Leave a Comment
  • Please add 5 and 2 and type the answer here:
  • Post
Blog - Comment List
  • It seems at least one person has posted on how to use combining characters to decorate otherwise plaintext web posts:

    How to create s̶t̶r̶i̶k̶e̶t̶h̶r̶o̶u̶g̶h̶ and u̲n̲d̲e̲r̲l̲i̲n̲e̲d̲ text
    http://www.epinions.com/content_3826622596

    (As I post this, "strikethrough" in the title above is -struck-through-, and "underlined" is _underlined_)

    That raises the danger level somewhat.

    For example, this almost seems sensible (from list of 747 above:)
    229)
    e
    0x301 COMBINING ACUTE ACCENT
    0x332 COMBINING LOW LINE
    0x345 COMBINING GREEK YPOGEGRAMMENI
    wraps to 1

    Greek letters can have accents; and small following iotas;* and if this were in an epinions post by someone who read the underline tutorial, it could be underlined.

    If there were another DW character later in the string (or the word,) the sort key for the post (or the word) would have an embedded 1.

    If epinions' full-text search engine used sort keys, the bug would be triggered.

    * U+1FF4 GREEK SMALL LETTER OMEGA WITH OXIA AND YPOGEGRAMMENI http://www.fileformat.info/info/unicode/char/1ff4/index.htm

    So here it is: ῴ̲

    Make a string with that text element, and another accented character anywhere after it, and you get a sort key with five 1's.

    String: \x03c9\x0301\x0345\x0332 \x03c9\x0301
    Sort key: (f 19 7 2 f 19) 1 [1 2 e] 1 1 1 0
  • I thought you were done?

    Told you.... :-)

    However, strikethough and underlining is for higher level markup, not plain text. And the need to not only misuse Unicode by doing such things but to need to index it with sort keys is a fair indication of piling on misuse of multiple technologies....

    It all gets back to that "meaningful string" issue. Truly!
  • 'K, found another one.

    615)
    e
    0x305 COMBINING OVERLINE
    0x32c COMBINING CARON BELOW
    0x332 COMBINING LOW LINE
    wraps to 0

    Hear me out.

    Suppose someone wants to use a username that contains a COMBINING CARON BELOW.  (Why? who knows.)

    And let's suppose they stumble onto this "HOWTO: get a box around your name:"
    http://www.mpcforum.com/showthread.php?t=105521

    And let's further suppose that the caron-below is on any character in their username but the last one.

    EVERY username that satisfies those three conditions trips the bug, because all of the characters after the one with the caron have DW.
  • My name in a box: [M̬̅a̬̅u̬̅r̬̅i̬̅t̬̅s̬̅]
  • That would have worked out better if I'd used LOW LINE instead of CARON BELOW (oops)

    Take two: [M̲̅a̲̅u̲̅r̲̅i̲̅t̲̅s̲̅]
  • Okay, you have convincingly established what I have already said -- if people do stupid things, then bad things will happen.

    Easy fix -- don't do stupid things!

    You can find as many examples as you want, but I am not sure it actually helping anything, or making anything more readable or understandable....

    Try to focus on ACTUAL TEXT needed for language support, without these IMHO lame websites that give bad advice about trying to put in plain text that which belongs in rich text.

    This is what the weights were optimized for....
  • This is looking too much like a dialog, so for the sake of other readers, I just wanted to enjoy seeing a different name for the commenter!
  • Hi,
    This looks more than a wraparound issue to me. You are doing a wrong thing when you add "values" of combining characters to calculate a sort key.
    For example, two strings U+0063 U+0307 U+030D and U+0063 U+0308 have the same sort key (0E 21 01 13 01 01 01 00). According to the mechanism to calculate sort keys you explained in the earlier entry "Everybody's doing the wraparound....", apparently this has nothing to do with wraparound.
    Moreover, string.Compare returns 0 when you compare these two strings. In other words, string.Compare("e\u0307\u030D", "e\u0308", true, CultureInfo.InvariantCulture) gives 0.
    I obtained these results by a small C# program written with Visual Studio 2005 on Windows XP SP2 Japanese. I am not familier with the .NET CLI environment, but this is not desirable, is it?

    By the way, I have been reading your blog for a few months but this is the first time I post a comment. Let me thank you for writing many interesting topics. I especially like Every Character Has a Story entries.

    - Tsuyoshi
  • Hello Tsuyoshi-san,

    There are of course other issues that come into play here -- in the case you pointed out it is the fact that each combining mark has a specific DW and the numbers under 0307 + 030d happen to equal 0308.

    Now that is not the wraparound issue, but it is the same basic issue with weights.

    I should probably cover the wider topic a bit more, since it is potentially a lot more relevant to people than the admittedly rare wraparound issue....

    Glad you are enjoying the blog! :-)
  • Wow... to solve that issue you'd almost need to give each diacritic its own byte in the sort key.  Might need sentinels anyway to mark multiply-diacritic'd characters so that they sort correctly:
    e(e'^) < (e')(e^) < (e'^)e

    Currently:
    ... 01 [2 2+'+^] 01 ... < ... 01 [2+' 2+^] 01 ... < ... 01 [2+'+^ 2] 01 ...

    But without being able to add diacritics... hmm... sentinels to the rescue...

    ... 01 [2 ff 2 ' ^] 01 ... < ... 01 [ 2+' 2+^] 01 ... < ... 01 [ ff 2 ' ^ 2 ] 01 ...

    where the "ff 2" means:
    ff: this character has multiple diacritics
    2: this character has, in particular, two diacritics
    ' and ^ are the diacritic weights

    The sort key is much longer this way, alas.

    Question: reorder diacritics that are on the same base character in some standard fashion? If so, that would allow e'^ == e^'.
  • Well, luckily that sort of problem stays theoretical since I do not know of any attested single language usage of both orders of two diacritics on a letter. :-)
  • It seems the problem is I did not know what the invariant culture is. I should have searched "invariant culture" on your blog and read any entries before I post. Sorry for a noise.

    My current understanding is that if there is a culture where both "e\u0307\u030D" and "e\u0308" are used, then they have differnt sort keys and they compare as different strings in that culture.

    (And I meant string.Compare("e\u0307\u030D", "e\u0308", false, CultureInfo.InvariantCulture) when I wrote true instead of false.)
  • Hi Tsuyoshi-san,

    I don't consider it noise -- at the very least, what you are pointing out is a more realistic case than most of the "wrap-around" issues, if for no other reason than it is easier to hit.

    Your understanding is correct -- if a language is being supported that makes such a distinction, then we do the work to make sure that collation will support the requirement. But it still makes for an interesting case, I think....
  • Hi Michael, (please call me Tsuyoshi if you like)

    Thank you for the clarification. As you know, I did not know what string.Compare do or what the invariant culture is for when I wrote the first comment. Now I know:
    - string.Compare(string, string, boolean, CultureInfo) performs a linguistic comparison, which means the two strings must be meaningful in the specified culture; otherwise, the result will be unpredictable or at least difficult to interpret.
    - Not every string is considered meaningful in the invariant culture.

    Therefore...
    http://blogs.msdn.com/michkap/archive/2004/12/29/344136.aspx
    > That problem remains to this day, though every single time I speak at a conference or answer a question in a newsgroup or get someone to look at posts like this one, then there is at least one less developer who has this problem. Maybe this time it is you? :-)
    Yay, it's me!

    ...sigh. I will never confuse the linguistic comparison in the invariant locale/culture with the ordinal comparison!
  • Hi Tsuyoshi,

    Great to hear -- welcome to the club!

    (we ought to get shirts or something)
Page 2 of 3 (38 items) 123