In SQL Server, the distance between A and Z is wider than you might think!

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

In SQL Server, the distance between A and Z is wider than you might think!

  • Comments 8

The other day George asked me via the Contact link:

I use SQL Server quite a bit, but I just hit some confusing behavior reported by a customer. It isn't a localization problem but it does involve sorting or at least matching.

I think I have tracked down the problem, but I'm not sure. Is it possible that

    LIKE '[A-Z]'

and

    LIKE '[ABCDEFGHIJKLMNOPQRSTUVWXYZ]'

would return different results?

Do you know what is going on here?

Indeed I do, George. And it is an internationalization problem, and a collation one, too!

If you look at the Windows table (which are slightly ahead of the SQL Server ones), there are about 1,229 letters in the Latin script, all but 20 of which are between A and Z, inclusive.

Now if you take the SQL Server case the numbers look slightly less scary (something like 979 letters in the Latin script, all but 15 of which are between A and Z, inclusive) but either way it puts that expression with the explicit ABCDEFGHIJKLMNOPQRSTUVWXYZ to shame, doesn't it? :-)

Now obviously if you are not using Unicode columns then you aren't really missing out on either 1,209 or 984 characters, but even then there will be some letters missed.

I'll get more info some other interesting nuances about these expressions another day....

 

This post brought to you by Z (U+005a, LATIN CAPITAL LETTER Z)

Comment on the blather
Leave a Comment
  • Please add 7 and 5 and type the answer here:
  • Post
Blog - Comment List
  • I may be a bit thick but I can't see how or why [A-Z] should be different to [ABC....XYZ]. If A = Unicode U+41 and Z = Uncode U+5a, then that to me means 26 characters and it makes no sense to interpret it any other way. I am no expert but to me it just does not make any sense for it to be otherwise

  • It does not work by saying "these discrete code units by their code points" -- it works by saying "everything that sorts between these letters" -- and all of these letters.

    And all of those various "A" like, "B" like, "C" like, and son on characters sort between.....

    Not sure if I'd call that something that makes you thick, by any means -- but there are two different possible ways for this to work, and it works in one way and you are expecting it to work in the other. That's all. :-)

  • Yesterday, I was blathering about how In SQL Server, the distance between A and Z is wider than you might

  • Previously I wrote about how In SQL Server, the distance between A and Z is wider than you might think

  • You might have started to sense a pattern developing here with the last few posts in this series: In

  • Previous posts in this series: In SQL Server, the distance between A and Z is wider than you might think

  • Previous posts in this series: In SQL Server, the distance between A and Z is wider than you might think

  • Previously in this series, I have talked about many things. Like how the distance between A and Z is

Page 1 of 1 (8 items)