What is up with number sorting?

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

What is up with number sorting?

  • Comments 55

Earlier this morning, Peter Ibbotson asked me:

Perhaps you explain something to me (I appreciate you may be the wrong person to ask, but you seem to be a sorting expert). When XP was in beta I put a bug report on this odd sorting behaviour with explorer, given a directory containing the following files:

C:\sorttest>dir /on /b
X12.TXT
X13.TXT
X1A.TXT
X1B.TXT
XAB.TXT

Shows in this order (when sorted by name) in explorer:
X1A.TXT
X1B.TXT
X12.TXT
X13.TXT
XAB.TXT

I've never understood why this happens (and it drives me nuts sometimes) Oh I'm in the UK in case that makes any difference.

I can speak to this question. :-)

The first method is the method that has existed for ages on computers and with which a geek such as myself is likely to be most comfortable -- treating numbers as bits of text, each individual number of which is compared to an individual letter. You can get it by calling CompareString or LCMapString, and everything is great.

Unfortunately, it is also incredibly nonintitive to all of the people who are not geeks.

Try explaining to your mom that 30.txt is supposed to come after 100.txt. Its not a pretty sight:

Mom: It just makes no sense to me, dear. 30 is less than 100!

Son: Yes, when it's a number. But what about when you treat it as text?

Mom: How can a number be text?

Son: Well, anything can be text. There is a difference between 4 and "4".

Mom: <shakes head> It just makes no sense to me, dear.

And the outrageous thing is that she is actually right.

Now the Shell team at Microsoft is pretty amazing (and not just the famous ones like Raymond Chen).

There are those who whine when things other people do are not able to support their customers. That is not the Shell team, though. They are the ones who go write the code to support those customers. Thats pretty impressive to me.

Anyway, they have been getting feedback on this scenario for some time. And they finally did something about it. That API is StrCmpLogicalW, and it basically recognizes that 95% of the world treats the digits as numbers rather than text. And therefore it notices that 30.txt comes before 100.txt without having to zero-pad the numbers.

I have looked at the source for that function and every once in a while I even would use the problem as an interview question after seeing the clever solution they used (no, I do not use it anymore!). but it is a good solution to the problem of extending a function like CompareString to treat digits as numbers rather than text.

Of course the Windows Explorer, which is the Shell team's most important public face, uses this (unless you turn it off with group policy, which some people do).

In theory the command window could do the same in the Dir function, because as Peter pointed out the inconsistency is a tad noticable.

Other people ask about hex digits, which would be easy if the letters ABCDEF did not also have part time jobs as actual letters that must always be treated as text. Luckily this can also be answered by the core scenario that drove the API -- the type of person who is confused by the "digits as text" behavior is probably not using hexidecimal digits.

Thinking to my own areas of expertise it seems like a shame that ASCII 0-9 are handled but the other 20+ streams of digits, including the fullwidth to 0 to 9. Of course in some cases it is not really known what people would prefer, and as a function that it is in its heart of hearts designed to be intuitive its probably important to understand the expectations prior to blazing forth. Getting that feedback first would be important....

Plus I have to think about sort keys. They are pretty central to databases, and there is a design principle in our sort key implementation that LCMapString and CompareString should give back equivalent results in terms of comparisons. But creating sort keys that can handle strings of digits of unbounded size is not a trivial problem to consider (that too would maybe make a fun interview question but also a frustrating one so i don't think I'll try it).

Anyway, that's why one can get back different results across all locales when one looks at lists from the Dir function versus from Explorer. In fact, you may be able to use as a test of "geek-ness" how long it takes you to admit the truth in the overall scenario above (I have been "out" as a geek for years so I revel in the fact, but not every geek realizes it about themselves!).

 

This post sponsored by "" (U+0bf2, a.k.a. TAMIL NUMBER ONE THOUSAND)
"Some of my best friends are digits!"

Comment on the blather
Leave a Comment
  • Please add 2 and 5 and type the answer here:
  • Post
Blog - Comment List
  • 1/6/2005 12:24 AM Raymond Chen

    > Norman: Are you arguing that Explorer should
    > sort files as follows:
    > two.txt
    > five.txt
    [...]

    Good point. But I really thought I had read somewhere that Windows APIs had been augmented to recognize digits written in various linguistic character sets. And we both know that digits as written in Chinese characters can (and often are) strung together as digit sequences.

    Full formal wording is a more difficult question since, even though each character is numeric, how do we say if the full formal construction is a numeral or a word phrase?

    1/6/2005 1:13 AM Michael Kaplan

    > Norman, re: Thai digits you are mistaken.

    OK, I can't find where I read that Windows APIs had been augmented in that manner. On the other hand though, I saw a Thai use Thai digits and saw Windows accept them. And that was Windows XP Japanese (not Thai, not US with MUI, not anything else) with the input locale set to Thai.
  • Well digits are accepted. But there is no Windows API that readds them as numbers.

    Either someone was mistaken when they spoke or you read them wrong. This applies to every version of Windows, no matter what the input locale is.

    Perhaps you are talking about unrelated issues with GDI and rendering based on native digits in a locale, but if so that has nothing to do with collation.
  • The Daily Grind 533
  • numeric file rename
  • Btw, Valorie's been on me for literally years now about the fact that OE doesn't sort email messages with text in them.

    She hates it that OE sorts:
    This is message one of three
    This is message two of three
    This is message three of three

    as:
    This is message one of three
    This is message three of three
    This is message two of three

  • I just mentioned the notion of addressing this to the owner of the collation data and her look could have killed. :-)

    Though I am inclined to agree with her -- its a dangerous road to go down for an API like CompareString. Or even for higher level callers like OE (luckily most of them are sorted by date anyway).
  • Thanks for that, I'd still like an option to turn it off though. One of the problems I have is that I have loads of file names with that kind of structure, and depending on what program I'm using to look at files, the file I want may be at either end of a list.
    It's a real pain as you swap between source control and explorer.
    Anyway thanks
  • Hi Peter,

    It actually *can* be changed via a Group Policy setting. Precisely because not everyone will neccesarily like it....

    You can even query this policy's settings with the SHRestricted API and the REST_NOSTRCMPLOGICAL member of the RESTRICTIONS enumeration.
  • The other sorting annoyance is this sort of thing:

    The Sea II.mp3
    The Sea.mp3

    This is also seen with Windows Media Player's Album list.
  • Yes, though (to me) it actually does make sense since space comes before dot....

    Though I would not mind if Explorer removed the extension from consideration for the sort, except to break ties. It seems like the results would be more intuitive more often.
  • Back in January I asked Michael Kaplan about the explorers number&#160;sorting order here.Meanwhile over here Rolando Ramirez has pointed out that this in TweakUI. Oddly when I mentioned this in the office someone else piped up and told me how to tweak it via
  • The other day, I had to take a look at the various unmanaged case insensitive string comparison functions....
Page 2 of 4 (55 items) 1234