A&P of Sort Keys, part 5 (aka EXPANSIONing your horizons)

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

A&P of Sort Keys, part 5 (aka EXPANSIONing your horizons)

  • Comments 10

Previous posts in this series:

In older posts in the blog here (such as these two) I have talked about SORT ELEMENTS, even going so far a to define them:

A sort element is a code point or combination of code points that a user thinks of as a character.

Now in this series the definition has not yet been all that relevant since with the exception of Part 2 each WCHAR in the string was really a letter in the user's mind, and even in Part 2 I was clearly talking about combining characters that were diacritic marks, and showed how they were equal to some precomposed characters anyway.

Well now we are doing to change all that, and talk about EXPANSIONS, where a single code point can map to up to three different letters (in the underlying implementation it is only two, but some of them are nested so the net effect for collation is that it can be three)1.

In theory the code could support even bigger nested expansions, however the nesting rules in the code are:

  • Each step must also be a ligature, thus (U+fb04, a.k.a. LATIN SMALL LIGATURE FFL) can be expanded to f + (U+fb02, a.k.a. LATIN SMALL LIGATURE FL), which can be expanded to ffl;
  • Currently, LCMapString/LCMapStringEx with the LCMAP_SORTKEY flag only allocate the space assuming one level of nesting, so if bigger ligatures were added, someone would have to modify some code there;
  • EXPANSION entries cannot overlap with COMPRESSION entries (which I'll be talking about tomorrow).

Okay, so let's dig in now. Here we go....

Starting with (U+fb03, a.k.a. LATIN SMALL LIGATURE FFI), compared against the string ffi using the default table:

en-US U+fb03               0e 23 0e 23 0e 32 01 01 01 01 00
en-US U+0066 U+0066 U+0069 0e 23 0e 23 0e 32 01 01 01 01 00

See how they are identical? And how the UNICODE WEIGHT piece (discussed in Part 1) is six bytes in size, meaning that we are looking at three sort elements?

At some binary level you may not want LATIN SMALL LIGATURE FFI to be considered the same as ffi, but any normal user looking at them would expect them to be treated like they were kind of the same....

Now let's muddy the waters a bit, and look at æ (U+00e6, a.k.a. LATIN SMALL LETTER AE). Now in many languages it makes sense to treat it like ae so that words like Cæsars and Caesars can be treated the same, thus in the default table:

en-US a  U+0061        0e 02 01 01 01 01 00
en-US ae U+0061 U+0065 0e 02 0e 21 01 01 01 01 00
en-US æ  U+00e6        0e 02 0e 21 01 01 01 01 00

This is all well and good, and the folks running Windows over at Cæsars Palace probably appreciate that (though as far as I know they have not offered to fly anyone from Window International down there yet!), but in the beautiful country of Iceland, this is not acceptable.

Because in Icelandic, æ is a little different, and thusly the weights look a little different:

is-IS a  U+0061        0e 02 01 01 01 01 00
is-IS ae U+0061 U+0065 0e 02 0e 21 01 01 01 01 00
is-IS z  U+0079        0e a9 01 01 01 01 00
is-IS æ  U+00e6        0e ac 01 01 01 01 00

As you can see, æ is its own letter that comes right after z and it just one sort element -- and is thus not an EXPANSION there (although in most locales it is).

Now as I pointed out in these three posts:

  • It is also possible to add EXPANSION entries for a specific language only;
  • Everyone forgot about that fact for a long time;
  • The problem that was thereby caused is one that will be fixed in Windows Server 2008.

But in any case, that kind of explains the EXPANSION functionality in collation2.

 

1 - I even mentioned once (in Why doesn't FoldString take an LCID?) how you can use FoldString with the MAP_EXPAND_LIGATURES flag to access the default table's take on these EXPANSION entries (while bemoaning the fact that the locale-specific entries were unavailable since FoldString itself doesn't accept any kind locale parameter (name or LCID)3.
2 - And even in ligature expansion, via FoldString, though there are limitations there.
3 - I did recommend that a FoldStringEx that would take a locale name be added to the next version of Windows before I moved from NLS to the International Fundamentals group, but I have no idea what the plans are here for the future....

 

This post brought to you by 5 (U+0035, a.k.a. DIGIT FIVE)

Comment on the blather
Leave a Comment
  • Please add 2 and 2 and type the answer here:
  • Post
Blog - Comment List
  • Previous posts in this series: Part 0: The empty string sorts the same in every language Part 1: The

  • Previous posts in this series: Part 0: The empty string sorts the same in every language Part 1: The

  • Previous posts in this series: Part 0: The empty string sorts the same in every language Part 1: The

  • Previous posts in this series: Part 0: The empty string sorts the same in every language Part 1: The

  • Previous posts in this series: Part 0: The empty string sorts the same in every language Part 1: The

  • Previous posts in this series: Part 0: The empty string sorts the same in every language Part 1: The

  • Previous posts in this series: Part 0: The empty string sorts the same in every language Part 1: The

  • Previous posts in this series: Part 0: The empty string sorts the same in every language Part 1: The

  • The first blog in this series was On reversing the irreversible (the introduction) and the second was

  • In one of the very first blogs I wrote, I pointed out that Microsoft does not use the Unicode Collation

Page 1 of 1 (10 items)