Sorting it all Out Michael Kaplan's random stuff of dubious value Be sure to read the disclaimer here first!
If you are a regular reader of this blog then odds are you might be a little sick of hearing about StrCmpLogicalW for a little while. But I thought I'd bring up one more post about it anyway, one that actually brings internationalization back into the forefront....
The question to answer is how would one provide an internationalized version of this functionality in a future version?
It is not a trivial question, even ignoring the point I brought up previously about how if it were integrated into CompareString and/or CompareStringEx as a flag (i.e. NORM_SORTDIGITSASNUMBERS) that a solution to add it to sort keys (in LCMapString and/or LCMapStringEx) and to searching algorithms (in FindNLSString and/or FindNLSStringEx) might also be important since historically these three aspects of collation have been kept in sync from a functionality standpoint whenever it makes sense. And given the needs of actual applications that provides services for the Shell like Search, this is hardly a trviial point. Though it is one that means it would take more work to sign up for integrating the feature!
But getting back to the locale sensitivity issues, here are some questions to consider -- ones that would have to be solved before work could really begin on the feature:
Now I am not going to fall back on implementation issues driving all of the decisions here (though I could easily see some point #8 either postponed for its complexity or just dismissed as being out of scope!). Most of the decisions here are simply ones that would have to be made, after which the implementation plan is simply something to be carried out.
But the actual questions for what would be the prefrences for an internationalized version of StrCmpLogicalW are real ones that need real answers. How would you expect such a feature to work?
This post brought to you by ௫ (U+0beb, a.k.a. TAMIL DIGIT FIVE)
Hmmm.... this does lead to the weirdness of small digits coming after larger ones for Arabic/Extended Arabic and other such cases. Plus, since there is not enough space to give them all unique Unicode Weights, you'd have to make all the numbers within a script be equal -- thus (NORM_SORTDIGITSASNUMBERS | NORM_IGNORENONSPACE) would mean that "9" == "1" and so on.
I have a feeling that saying that two different script versions of the number 9 would be easier to convince people of than having two numbers in the same script....
Hmmm.... well, to some extent I will agree. But I disagree that the concept of an alphabetical order is flawed, or that trying to extend it a bit is a non-intuitive or undesirable idea.
Hmmm. It might just be that we're using different interpretations of the word "script"; to me, basic Arabic and Extended Arabic are the same script. So it's a script with two times number 9. In this particular case, I don't see this as a problem. On the other hand, when you'd split sorting between these types of numerals, I don't think you've got a problem either, because the one off chance of anyone mixing various numeral systems and still expecting everyting to be sorted based on numeric value, is probably very small. It's not unlogical to see things like "these are the files sorted by Latin digits, and these are sorted by Arabic digits". Mixing them would create "mixed" results, for lack of a better pun.I do see some problems:- Roman numerals are Latin just like the ASCII digits (although the Unicode roman numerals are not ASCII, so you could think of these as a different script); interpreting i as 1 is tricky, for eg. English, just like vi or cd, so you probably shouldn't.- Are half width numbers the same as ASCII digits? Tricky one here, but as the digits are basically equivalent, and we're not stuck on proportional fonts per se, I don't see much harm here.Note that I'm talking about equivalences here, whether the algorithm is implemented through sort weights or some hand tuned code is, frankly, a little beyond my expertise.
We would only be doing this with numbers that are DIGITS in the Unicode sense, so there are no worries with other number types as we would not be handling those the same way....But it is (like I said) a hard choice to be made -- it is better to fold all the 2's together than 0123456789 together. So we did make a choice based on implementation issues.
Ths is actually what would happen if you passed both flags here -- all digits will be treated as if they were like the regular ASCII 0-9.
So not that radical. :-)
The question is what to do the rest of the time!
Have you posed the suggestion to internationalize StrCmpLogical to the shell team? Maybe they'd be intrigued (and the rest of the world happy when their files sort more intuitively!)
We have -- they'd like us to do it, so they can call us (of course I pointed out how much faster that call would be from inside of CompareString than from outside of it, which may be what really intrigued them!)....
In just a few weeks from now, it will have been three years since I first wrote about What is up with
More fun from that India trip .... One of the interesting things that can happen as a government works
Disclaimer -- most of the examples given here are fictional, and most of the ones that aren't are not