How do those lyrics go again?

Wild[card] thing
You make my CHAR sing.
You make everything query
Wild[card] thing.

Wild[card] thing, your results might move me.
But I want to know for sure.
C'mon I'll run with you one more time.
Your results move me.

Or something like that?

Time to get serious now....

Documentation such as that for the LIKE Predicate in SQL Server's T-SQL queries is always something that one can easily look up if needed (though most developers who use SQL Server tend to just sort of learn these things, and they don't feel the need to look them up too often!).

In any case, an interesting point came up (while I was looking at another issue) that I have never seen any documentation about and that several people I asked honestly had no idea about either. First a bit from the above Books Online topic, where wildcards are explained:

Wildcard Description Example
% (percent) Matches zero or more of any character. 'comp%r' matches 'comp' followed by zero or more of any characters, ending in an r.
_ (underscore) Matches any single character. 'comp_ter' matches 'comp' followed by exactly one of any character, followed by 'ter'.
[ ] (square brackets) Matches any single character within the specified range or set. For example [a-z] specifies a range; [aeiou] specifies the set of vowels. 'comp[a-z]re' matches 'comp' followed by a single character in the range of lowercase a through lowercase z, followed by 're'.

'comp[ao]' matches 'comp' followed by a single character that must be either a lowercase a or a lowercase o.

[^ ] (caret) Matches any single character that is not within the specified range or set. For example, [^a-z] specifies a range that excludes lowercase a through lowercase z; [^aeiou] specifies a set that excludes lowercase vowels. 'comp[^u]' matches 'comp' followed by any single character that is not a lowercase u.

Now perhaps if you are a regular reader and remember reading When good SQL queries have trouble...., you may have some idea where I am going with this. :-)

Anyway, let's take a bit of SQL and try some things out. For the purposes of these queries, keep in mind that Traditional Spanish considers the letters ch to be a single unique sort element between c and d....

use master
IF DB_ID (N'dbTest') IS NOT NULL DROP DATABASE dbTest;
CREATE DATABASE dbTest;
GO
use dbTest
CREATE TABLE tblTest (
    colTest varchar (4) COLLATE Latin1_General_CS_AS_KS_WS NULL
)
GO
INSERT INTO tblTest (colTest) VALUES ('chap')

SELECT * FROM tblTest;

SELECT
* FROM tblTest WHERE colTest LIKE '_ap';

SELECT
* FROM tblTest WHERE colTest LIKE '_hap';

SELECT
* FROM tblTest WHERE colTest
    COLLATE Traditional_Spanish_CS_AS_KS_WS LIKE '_ap';

SELECT
* FROM tblTest WHERE colTest
    COLLATE Traditional_Spanish_CS_AS_KS_WS LIKE '_hap'

So if you just go ahead and run this whole script, here will be the results of the five queries:

  • QUERY #1: 1 row
  • QUERY #2: 0 rows
  • QUERY #3: 1 row
  • QUERY #4: 1 row
  • QUERY #5: 0 rows

There you have it! When SQL Server documentation claims that "_ (underscore) matches any single character", what it clearly means is not exactly a single CHARACTER at all. It means a single SORT ELEMENT!

(I have talked about sort elements in previous posts such as this one and this one.)

Of course there are problems here, since both the square brackets and caret style wildcards also refer to single characters but will only refer to sort elements in what they do not specify; everything specified has to be a single UTF-16 code unit, in most cases. Looking at some more queries:

SELECT * FROM tblTest WHERE colTest LIKE '[b-d]ap';            /* Returns 0 rows */

SELECT
* FROM tblTest WHERE colTest LIKE '[b-d]hap';           /* Returns 1 row  */

SELECT
* FROM tblTest WHERE colTest
    COLLATE Traditional_Spanish_CS_AS_KS_WS LIKE '[b-d]ap';    /* Returns 1 row  */

SELECT
* FROM tblTest WHERE colTest
    COLLATE Traditional_Spanish_CS_AS_KS_WS LIKE '[b-d]hap';   /* Returns 0 rows */

SELECT
* FROM tblTest WHERE colTest
    COLLATE Traditional_Spanish_CS_AS_KS_WS LIKE '[ch-d]ap';   /* Returns 0 rows */

SELECT
* FROM tblTest WHERE colTest LIKE '[^ch]ap';            /* Returns 0 rows */

SELECT
* FROM tblTest WHERE colTest
    COLLATE Traditional_Spanish_CS_AS_KS_WS LIKE '[^ch]ap';    /* Returns 1 row  */

SELECT
* FROM tblTest WHERE colTest
    COLLATE Traditional_Spanish_CS_AS_KS_WS LIKE '[^ch-d]hap'; /* Returns 1 row  */

SELECT * FROM tblTest WHERE colTest
   
COLLATE Traditional_Spanish_CS_AS_KS_WS LIKE '[^cdh]hap';  /* Returns 0 rows */

Clearly in all of these cases, just as in the ones before, it is assuming that what you type into the expression is individual UTF-16 code units, even though the engine will usually (maybe always?) treat compressions as single characters later on. This makes the whole definition of "single character" pretty fluid since it means different things depending on whether you are looking at the query beforehand or the engine running the query later.

So has anyone noticed this before and I (and those people writing docs) just haven't ever run across it?

I mean, is that really possible?

Or is this really a pseudo-stealth "feature" created by SQL Server's implementation of linguistic collation that nobody ever noticed before?

And either way, how on earth can this be "fixed" in the future, or even documented? I mean, given the need to avoid breaking existing (albeit undocumented) behavior?

This post brought to you by  C (U+0043, a.k.a. LATIN CAPITAL LETTER C)