"àèìòù" < "äëïöü" but "àèìòù " > "äëïöü"

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

"àèìòù" < "äëïöü" but "àèìòù " > "äëïöü"

  • Comments 4

You may remember my post I need my SPACE, symbolically speaking from this past March.

There are some interesting consequences of this behavior, which I thought I would talk about a bit further since they have been the subject of several recent bug reports....

Let's take a simple string like

àèìòù (U+00e0 U+00e8 U+00ec U+00f2 U+00f9)

and compare it with

äëïöü (U+00e4 U+00eb U+00ef U+00f6 U+00fc)

Just pass them both to CompareStringW using 0x0409 for the LCID, and you will find that "àèìòù" < "äëïöü". But if you add a space to the first string, then you will see that "àèìòù " > "äëïöü".

Huh? How'd that happen?

Well, let's look at the sort keys of each of the three strings we are looking at here:

"àèìòù"
0e 02 0e 21 0e 32 0e 7c 0e 9f 01 0f 0f 0f 0f 0f 01 01 01 00

"äëïöü"
0e 02 0e 21 0e 32 0e 7c 0e 9f 01 13 13 13 13 13 01 01 01 00

"àèìòù "
0e 02 0e 21 0e 32 0e 7c 0e 9f 07 02 01 0f 0f 0f 0f 0f 01 01 01 00

Aha, things maybe are a little clearer now. The letters have consistent weights, as do the diacritics. And so the first string comparison sees equal primary weights but a difference in the secondary weights. And that second comparison sees a difference in the primary weights, so suddenly the order is reversed. Oops!

Now this will happen with any symbol (or for that matter anything with a primary weight, but for some reason the SPACE and similar characters have results that seem less intuitive!), though simply passing NORM_IGNORESYMBOLS will cause the space or other symbol to be ignored.

Now this is the first example. I will give some of the others in a later post. And maybe some thoughts about how the issue of intuitive results could perhaps be looked into, and why the solution is less obvious than it may seem at first....

Have I scared anyone yet? If so, then Happy Halloween! :-)

 

This post brought to you by " " (U+0020, a.k.a. SPACE)

Comment on the blather
Leave a Comment
  • Please add 3 and 1 and type the answer here:
  • Post
Blog - Comment List
  • I have decided that I need to channel the spirit of the father of comedian of Emo Phillips and juxtapose

  • lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll

  • did you mean alt?

  • I do not understand the question....

Page 1 of 1 (4 items)