If at first you don't succeed, there's probably still a bug

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

If at first you don't succeed, there's probably still a bug

  • Comments 38

Back near the beginning of the month, when I posted about how Everybody's doing the wraparound...., I talked about how the particular implementation of multiple combining diacritic weights led to an interesting situation on Windows where the secondary, a.k.a. DW or 'diacritic weight', weight would wrap around.

Regular reader Maurits suggested a real problem that seemed to be possible:

Hmmm... I just thought of a potentially serious problem with the third method (wrap.)  What if the value wraps to 0xfe or 0xfd?  Then when you add 2, you get 01 or (worse?) 00.

In particular:

"e" with 15 tildes will have a DW of 15 * 17 which comes to 255; add 2 and you wrap to 1.

"e" with 5 circumflexes and 12 tildes will have a DW of (5 * 10) + (12 * 17) = 254; add 2 and you wrap to 0.

On the other hand, what are the odds of encountering a string with such a diacritically-loaded character?

This was before he had actually tested things out -- once he did, his report was slightly different:

I was trying to generate an example string that wraparounded (ugh) to 00 or 01, but I couldn't (on Windows 2000)

Maybe I'm trying to solve a problem that doesn't exist?

Well, let's back up a second....

He is indeed right, the sort keys never seem to get created with bogus 0x00 or 0x01 weights inserted (therefore the bug Atsushi Enomoto reported still stands alone as the bug that inserts one of these byte sequences when it is not expected!).

(That was another bug found through blogging!)

But that does not mean there is no bug in this case.

Now I suspected there might be a bug here after Maurits reported the original problem -- and after looking at it  tonight I managed to find one....

By carefully applying Tester's Axiom #5 and the knowledge that there is a potentially incorrect issue with the implementation, what can we find out here?

Well, if you look at How to track down collation bugs from late last year, you will notice a couple of things:

  • Most of the points in that post do not apply, and
  • Point #12 leads to some interesting results

Take the following sample code:

using System;
using System.Globalization;

namespace SortTest {
    class Class1 {
        [STAThread]
        static void Main(string[] args) {
            CompareInfo ci = CompareInfo.GetCompareInfo(0x007f);
            string stOld= "e";

            for(int i = 0; i < 7; i++) {
                string st = "e\u0323\u0323" + ((char)(0x031d + i)).ToString();
                Console.Write(string.Compare(stOld, st));
                Console.Write("\t\t");
                Console.WriteLine(SortKey.Compare(ci.GetSortKey(stOld),
ci.GetSortKey(st)));
                stOld = st;
            }
        }
    }
}

The output of this little function (which should be two rows of numbers, where each row contains two values equal to each other) is worth 1000 words....

-1              -1
-1              -1
-1              1
-1              0
-1              0
-1              -1
-1              -1

What are the sort key values here? They are:

\u0065\u0323\u0323\u031d        0e 21 01 fe 01 01 01 00
\u0065\u0323\u0323\u031e        0e 21 01 ff 01 01 01 00
\u0065\u0323\u0323\u031f        0e 21 01 01 01 01 00
\u0065\u0323\u0323\u0320        0e 21 01 01 01 01 00
\u0065\u0323\u0323\u0321        0e 21 01 01 01 01 00
\u0065\u0323\u0323\u0322        0e 21 01 03 01 01 01 00
\u0065\u0323\u0323\u0323        0e 21 01 04 01 01 01 00

So, it looks like functions like CompareString (and its managed equivalent) will consistently wrap around and give expected results, while LCMapString for sort keys and its managed equivalent will, in an effort to avoid the problem with embedded invalid bytes will occasionally drop them.

Three bytes are dropped:

  • 0x00 -- reserved for the end of the string;
  • 0x01 -- reserved for a sentinel between subsections of the sort key;
  • 0x02 -- reserved for the minimal weight with no diacritics, which is dropped when it is not needed as a placeholder for later secondary weights that greater than 0x02.

As soon as I read that comment, I knew there was a bug here. :-)

(Yet another bug found through blogging!)

 

This post brought to you by " ̣" (U+0323, a.k.a. COMBINING DOT BELOW)

Comment on the blather
Leave a Comment
  • Please add 3 and 1 and type the answer here:
  • Post
Blog - Comment List
  • how about this for a T-shirt: http://tinyurl.com/rjxyt
  • I actually own it. But I would not call it a T-shirt for *this* club! :-)
  • Ah, good point.

    Maybe this then: http://tinyurl.com/rzag5 :)
  • grrr... stupid hotlink restrictions.
    http://www.geocities.com/mvaneerde/einstein.jpg
  • Ok, that is flattering. :-)

    I am not sure I could really wear it, but if anyone walked up to me wearing that on a T-shirt I would certainly (assuming my head didn't explode!) try to buy that person a drink....
  • > attested single language usage of both orders of two diacritics on a letter

    In physics, a'^ and a^' are two different things[1], but that probably doesn't count as a language.

    Edward John Garrett, in his paper about the IPA[2] gives the tantalizing sentence "the order of diacritics is *often* irrelevant" (emphasis mine)

    ... but fails to give a counterexample as to when the order of diacritics *is* relevant.  That doesn't mean there isn't a counterexample, but it would have been nice if he'd been explicit.

    *ugh*

    It would be much better if diacritic order *did* matter and there was a clear case we could point to.  This feels like a trap... (1) invest in a structure of commutative diacritics; (2) five years later some language comes up with a diacritic construct that needs to be uncommutative; (3) tear down the structure and build a new one at terrible cost.

    [1] a'^ is the direction an object is moving (') in three-dimensional space, normalized (^) to a unit vector.  a^' is the direction (') an object APPEARS to be moving when viewed from the origin (^).  If an object is at (1, 0, 0) and is moving away from the origin, a'^ is (1, 0, 0) and a^' is (0, 0, 0).
    [2] http://altiplano.emich.edu/~egarrett/papers/ejg_CICLing2005.pdf
  • Hi Maurits,

    No need to tear anything down -- any time a language has a requirement, we can customize the sort under that language's LCID to mak sure that there would be no break....

    So no need to tear down anything. :-)
  • Previous posts in this series: Part 0: The empty string sorts the same in every language Part 1: The

Page 3 of 3 (38 items) 123