Wednesday, January 05, 2005 8:41 AM
Michael S. Kaplan
What is up with number sorting?
Earlier this morning, Peter Ibbotson asked me:
Perhaps you explain something to me (I appreciate you may be the wrong person to ask, but you seem to be a sorting expert). When XP was in beta I put a bug report on this odd sorting behaviour with explorer, given a directory containing the following files:
C:\sorttest>dir /on /b
X12.TXT
X13.TXT
X1A.TXT
X1B.TXT
XAB.TXT
Shows in this order (when sorted by name) in explorer:
X1A.TXT
X1B.TXT
X12.TXT
X13.TXT
XAB.TXT
I've never understood why this happens (and it drives me nuts sometimes) Oh I'm in the UK in case that makes any difference.
I can speak to this question. :-)
The first method is the method that has existed for ages on computers and with which a geek such as myself is likely to be most comfortable -- treating numbers as bits of text, each individual number of which is compared to an individual letter. You can get it by calling CompareString or LCMapString, and everything is great.
Unfortunately, it is also incredibly nonintitive to all of the people who are not geeks.
Try explaining to your mom that 30.txt is supposed to come after 100.txt. Its not a pretty sight:
Mom: It just makes no sense to me, dear. 30 is less than 100!
Son: Yes, when it's a number. But what about when you treat it as text?
Mom: How can a number be text?
Son: Well, anything can be text. There is a difference between 4 and "4".
Mom: <shakes head> It just makes no sense to me, dear.
And the outrageous thing is that she is actually right.
Now the Shell team at Microsoft is pretty amazing (and not just the famous ones like Raymond Chen).
There are those who whine when things other people do are not able to support their customers. That is not the Shell team, though. They are the ones who go write the code to support those customers. Thats pretty impressive to me.
Anyway, they have been getting feedback on this scenario for some time. And they finally did something about it. That API is StrCmpLogicalW, and it basically recognizes that 95% of the world treats the digits as numbers rather than text. And therefore it notices that 30.txt comes before 100.txt without having to zero-pad the numbers.
I have looked at the source for that function and every once in a while I even would use the problem as an interview question after seeing the clever solution they used (no, I do not use it anymore!). but it is a good solution to the problem of extending a function like CompareString to treat digits as numbers rather than text.
Of course the Windows Explorer, which is the Shell team's most important public face, uses this (unless you turn it off with group policy, which some people do).
In theory the command window could do the same in the Dir function, because as Peter pointed out the inconsistency is a tad noticable.
Other people ask about hex digits, which would be easy if the letters ABCDEF did not also have part time jobs as actual letters that must always be treated as text. Luckily this can also be answered by the core scenario that drove the API -- the type of person who is confused by the "digits as text" behavior is probably not using hexidecimal digits.
Thinking to my own areas of expertise it seems like a shame that ASCII 0-9 are handled but the other 20+ streams of digits, including the fullwidth to 0 to 9. Of course in some cases it is not really known what people would prefer, and as a function that it is in its heart of hearts designed to be intuitive its probably important to understand the expectations prior to blazing forth. Getting that feedback first would be important....
Plus I have to think about sort keys. They are pretty central to databases, and there is a design principle in our sort key implementation that LCMapString and CompareString should give back equivalent results in terms of comparisons. But creating sort keys that can handle strings of digits of unbounded size is not a trivial problem to consider (that too would maybe make a fun interview question but also a frustrating one so i don't think I'll try it).
Anyway, that's why one can get back different results across all locales when one looks at lists from the Dir function versus from Explorer. In fact, you may be able to use as a test of "geek-ness" how long it takes you to admit the truth in the overall scenario above (I have been "out" as a geek for years so I revel in the fact, but not every geek realizes it about themselves!).
This post sponsored by "௲" (U+0bf2, a.k.a. TAMIL NUMBER ONE THOUSAND)
"Some of my best friends are digits!"