# What is up with number sorting?

 Sorting it all Out Michael Kaplan's random stuff of dubious value Be sure to read the disclaimer here first!

### What is up with number sorting?

Earlier this morning, Peter Ibbotson asked me:

Perhaps you explain something to me (I appreciate you may be the wrong person to ask, but you seem to be a sorting expert). When XP was in beta I put a bug report on this odd sorting behaviour with explorer, given a directory containing the following files:

C:\sorttest>dir /on /b
X12.TXT
X13.TXT
X1A.TXT
X1B.TXT
XAB.TXT

Shows in this order (when sorted by name) in explorer:
X1A.TXT
X1B.TXT
X12.TXT
X13.TXT
XAB.TXT

I've never understood why this happens (and it drives me nuts sometimes) Oh I'm in the UK in case that makes any difference.

I can speak to this question. :-)

The first method is the method that has existed for ages on computers and with which a geek such as myself is likely to be most comfortable -- treating numbers as bits of text, each individual number of which is compared to an individual letter. You can get it by calling CompareString or LCMapString, and everything is great.

Unfortunately, it is also incredibly nonintitive to all of the people who are not geeks.

Try explaining to your mom that 30.txt is supposed to come after 100.txt. Its not a pretty sight:

Mom: It just makes no sense to me, dear. 30 is less than 100!

Son: Yes, when it's a number. But what about when you treat it as text?

Mom: How can a number be text?

Son: Well, anything can be text. There is a difference between 4 and "4".

Mom: <shakes head> It just makes no sense to me, dear.

And the outrageous thing is that she is actually right.

Now the Shell team at Microsoft is pretty amazing (and not just the famous ones like Raymond Chen).

There are those who whine when things other people do are not able to support their customers. That is not the Shell team, though. They are the ones who go write the code to support those customers. Thats pretty impressive to me.

Anyway, they have been getting feedback on this scenario for some time. And they finally did something about it. That API is StrCmpLogicalW, and it basically recognizes that 95% of the world treats the digits as numbers rather than text. And therefore it notices that 30.txt comes before 100.txt without having to zero-pad the numbers.

I have looked at the source for that function and every once in a while I even would use the problem as an interview question after seeing the clever solution they used (no, I do not use it anymore!). but it is a good solution to the problem of extending a function like CompareString to treat digits as numbers rather than text.

Of course the Windows Explorer, which is the Shell team's most important public face, uses this (unless you turn it off with group policy, which some people do).

In theory the command window could do the same in the Dir function, because as Peter pointed out the inconsistency is a tad noticable.

Other people ask about hex digits, which would be easy if the letters ABCDEF did not also have part time jobs as actual letters that must always be treated as text. Luckily this can also be answered by the core scenario that drove the API -- the type of person who is confused by the "digits as text" behavior is probably not using hexidecimal digits.

Thinking to my own areas of expertise it seems like a shame that ASCII 0-9 are handled but the other 20+ streams of digits, including the fullwidth to ０ to ９. Of course in some cases it is not really known what people would prefer, and as a function that it is in its heart of hearts designed to be intuitive its probably important to understand the expectations prior to blazing forth. Getting that feedback first would be important....

Plus I have to think about sort keys. They are pretty central to databases, and there is a design principle in our sort key implementation that LCMapString and CompareString should give back equivalent results in terms of comparisons. But creating sort keys that can handle strings of digits of unbounded size is not a trivial problem to consider (that too would maybe make a fun interview question but also a frustrating one so i don't think I'll try it).

Anyway, that's why one can get back different results across all locales when one looks at lists from the Dir function versus from Explorer. In fact, you may be able to use as a test of "geek-ness" how long it takes you to admit the truth in the overall scenario above (I have been "out" as a geek for years so I revel in the fact, but not every geek realizes it about themselves!).

This post sponsored by "" (U+0bf2, a.k.a. TAMIL NUMBER ONE THOUSAND)
"Some of my best friends are digits!"

Comment on the blather
Blog - Comment List
• So what was there clever solution to sorting the numbers in StrCmpLogicalW? Just curious...
• Heh heh heh.... good question. Of course its not *my* clever solution so I don't want to make it sound like I am bragging (I'm not).

Let's just say that whoever wrote the code is a hire. :-)
• One possible way would be to go through each file name and pad any numbers with zero up to a fixed length and then sort with the regular CompareString function.

The only two problems with that would be 1. It's not very efficient to modify all the file names, especially for very long directory listings, and 2. It's difficult to know how much padding you need.

For 2. the obvious solution is to check the maximum length of a string of digits in all the file names and use that (so for example, if you have 3.txt and 1000.txt then you need to pad the first file with three zeros), but that just makes 1. more obvious (since you need to loop through all the file names once to find the maximum length of a string of digits, and again to actually pad digits).

For 1. I'm not sure how you could make it quicker, and still be able to use the normal CompareString to do the main work, because you'd /want/ to use the normal CompareString function otherwise you'd end up having to rewrite a very complex function for a fairly small gain.
• Re: clever -- Raymond Chen pointed out to me a slightly less cool version of this code that used to be around which was much less clever. To be clear I am talking about the code that shipped with Server 2003 and then later in XP service packs.... :-)

To Dean -- think of another issue -- what about the file A1A1A1A1A1A1A1A1A1A1A1A1A1A1.txt? Do you extend every one of those numbers?
• Ah, good point.

OK, another solution would be to sort it normally first. This'll get all the text-based ordering out of the way, and at least all the files with the same prefix and numbered suffixes will be together. So lets say you start off like this:

a.txt, b1.txt, def20.txt, b20.txt, def6.txt, b9.txt, c.txt

After you do your normal CompareString sort, you'd have this:

a.txt b1.txt, b20.txt, b9.txt, c.txt, def20.txt, def6.txt

Then you can group them so that files with the same prefix followed by a digit are in a group, this'd give two groups:

b1.txt, b20.txt, b9.txt and
def20.txt, def6.txt

Then it's just a matter of extracting the number from the correct position in each file name, and sorting by that number in-place. So your two groups become:

b1.txt, b9.txt, b20.txt and
def6.txt, def20.txt

And the whole thing is:

a.txt, b1.txt, b9.txt, b20.txt, c.txt, def6.txt, def20.txt

This way, you don't modify strings, you don't have to worry about padding with zero, and you'll still get the correct answer (I think :)
• > Mom: How can a number be text?

Send her back for a refresher in grade 4 arithmetic, or somewhere around there. The difference between numbers and numerals was taught there, and numerals are text. Different numeral bases came up in grade 6, which should make it more obvious that numerals are text.

Nonetheless, sorting them as numbers instead of numerals is indeed agreeable 95% of the time. I only wonder, as you said you do, why Windows Explorer only does it around 50% of the time instead of 95% of the time.

Since there are some APIs now that recognize the numericity of various kinds of characters, one would think that Explorer would properly sort the department names of the company that I work in, but no I have to find the folder in order by the pronunciation of the number[*] instead of the numeric value. Dai-ichi-nani-nani, dai-san-nani-nani, dai-yon-nani-nani[*], dai-ni-nani-nani, 1 3 4 2 and the 4 isn't even in the expected place by pronunciation[*].

[* The matter of multiple pronunciations of a Kanji has been discussed in other threads. The character meaning four is getting sorted by pronunciation shi instead of yon, which is understandable but still not the first place one would look for it. I've never heard anyone say "dai-shi-nani-nani", it's "dai-yon-nani-nani".]
• > Send her back for a refresher in grade 4 arithmetic, or somewhere around there.

Well, I will leave aside the fact that I would not talk to my mother that way, and point out that it would really not behoove Microsoft well to treat customers that way. So I will avoid that approach. :-)

> Since there are some APIs now that recognize the numericity of various kinds of characters

Well, the function in Windows today only handles 0123456789 -- the "ASCII digit" collection. It might be cool to do more, but you start running in to problems when you try to define what "more" should be since it is not always intuitive.

Go through the strings and contract all consecutive digits into a single character, for example in the Unicode Private Use Area. After that sort the strings in a normal way (provided the sort routine sorts PUAs according to their value). Prepending the private character with a fixed '0' could also ensure that it is sorted in the place of ASCII digits.

X1A.TXT -> X0'U+E001'A.TXT
X13.TXT -> X0'U+E00D'.TXT
XAB.TXT -> XAB.TXT
• >> Since there are some APIs now that
>> recognize the numericity of various kinds
>> of characters
>
> Well, the function in Windows today only
> handles 0123456789 -- the "ASCII digit"
> collection.

Where did I read that Windows APIs were added that recognized Thai digits and various other code page digits. I'm sure I asked somewhere and received an answer that Windows Explorer is only designed to recognize the numericity of ASCII digits, but the function in Windows today is comparatively internationalized.
• Norman: Are you arguing that Explorer should sort files as follows:

two.txt
five.txt
sex.txt
elf.txt
three hundred.txt

?
• Norman, re: Thai digits you are mistaken.

Speaking personally, I never claimed this API was interationalized, and I pointed out that doing so leads one to problems of non-intuitive behavior.

There are many functions in Windows that are well-internationalized. This not one of them, certainly when looking at other language digits.
• Marcel -- interestng thought. Though one of the problems with that plan is that it puts digits in a wierd place compared to other characters (since the PUA has weight that puts it somewhere very unlike where numbers go).

The fact that these code points already have weight that needs to be primary between them will also lead to problems

And of course it still isn't an unbounded answer....

FWIW, I am not going to be hiring anyone based on their anwsers. But I am impressed at the quality of the answers. :-)
• If you run a search on filenames via Explorer, the results always seem to come back in strict order (ie not using StrCmpLogicalW) and also in lower-case for some reason. Can you shed any light on this? It's a little frustrating trying to filter a folder using a filename substring and having the results presenetd differently to a regular Explorer listing.
• > Though one of the problems with
> that plan is that it puts digits
> in a wierd place compared to
> other characters

Therefore the prepended '0', which makes all those private values primarily sort into the same spot (of U+0030) and only secondarily use the PUA value.

But true, it's not unbound, though the range should be big enough to sort a few photos. Anyway, it was good enough for me considering I came up with it after a long night's work at 7am in my time zone ;-)
• Hi Mike -- The Shell filtering functionality and the way help indexing works are both topics I will have to cover another day. I find them frustrating, too....

Marcel -- I understood that part, I was more interested in how you would handle the string that was engineered to look the same (i.e. sends those exact PUA values). I am not slsmming the idea though, I was just pointing out the problems -- so you wouldn't feel like it had not been thought of and discarded regretfully. :-)
Page 1 of 4 (55 items) 1234