Hi Folks,

I’m going to pick up from last time and discuss another major problem with moving databases from one place to another: collations. Some people will have never run into collation issues because they have a consistent collation across all of their environments: they have the same collation on all of their instances, and all databases share that same collation as well.

Others are not so lucky.

Let me illustrate the problem with a few concrete examples. The most common example people encounter involves mismatches between their database collation and the collation used by tempdb. For the purposes of this post, we’ll assume that the tempdb collation is the same as the instance collation. This is not strictly true, but is almost always the case.

Here’s the typical way this rears its ugly head. A developer creates a table and a temp table, and joins the two:

CREATE TABLE foo (x varchar(100))
GO
CREATE TABLE #bar (y varchar(100))
GO

-- ...

SELECT foo.x, #bar.y
FROM foo JOIN #bar ON foo.x = #bar.y

When the developer created this, the instance and database had the same collation, say SQL_Latin1_General_CP1_CI_AS. But at some later point, an administrator moved the database to a new instance with a different collation. It doesn’t matter what collation this is, but for purposes of this example, let’s say it was Turkish_100_CI_AI. Now when we create the temp table and perform the join, things don’t go so well:

CREATE TABLE #bar (y varchar(100))
GO

-- ...

SELECT foo.x, #bar.y
FROM foo JOIN #bar ON foo.x = #bar.y

This results in:

Msg 468, Level 16, State 9, Line 5
Cannot resolve the collation conflict between "Turkish_100_CI_AI" and 
"SQL_Latin1_General_CP1_CI_AS" in the equal to operation.

To get around this, the developer has to rewrite things slightly:

CREATE TABLE #bar (y varchar(100) COLLATE DATABASE_DEFAULT)
GO

 -- ...

SELECT foo.x, #bar.y
FROM foo JOIN #bar ON foo.x = #bar.y

Now things work as we’d like: the COLLATE DATABASE_DEFAULT statement ensures that the collation used for the data in the temp table will match the database, regardless of the instance collation. But it’s unlikely that the developer would have proactively done this, and it’s a pain to do in any case.

This example is fairly common. I want to emphasize that the next example is not: I have never seen or heard of this happening in the wild; the example was invented by one of our developers to illustrate what could go wrong. I would be very interested to hear if this has ever occurred for real.

Alright: let’s blow something up!

I am going to generate some collation-dependent behavior that manifests in a very bad way. To do this, I first need to identify some collation-dependent behavior, and for this example, I’m going to exploit a difference between Latin1_General and Turkish collations. I’ve chosen this because it’s kind of fun, but you can engineer similar behavior just case sensitivity. Also, to understand what’s going on, remember that variable binding is done using the instance collation. I.e., unlike most languages where there is only one way for variables to bind, in T-SQL you can change this by changing the instance collation.

So let’s play with the Turkish “I”. Imagine I have variables named @I and @İ. (Note that the first one is a usual “I”, but the second one has a dot.) These are different variables in both Latin1_General and Turkish collations. But what happens if I refer to @i (a little “i” with a dot) and my instance collation is case insensitive? It depends:

  • In Latin1_General, @i binds to @I.
  • In Turkish, @i binds to @İ.

My next step is to amplify this difference a bit:

CREATE FUNCTION snafu (@x int) RETURNS int
WITH SCHEMABINDING
AS
BEGIN
    DECLARE @I int = 1
    DECLARE int = 2
    RETURN @i * @x
END

This function exhibits behavior that is sensitive to the instance collation. On an instance with a case-insensitive Latin1_General collation, it’s the identity function on integers: for each input @x, it just returns @x times one. But on an instance with a case-insensitive Turkish collation, it doubles its input.

Now we can do something truly horrible. Imagine that we have our database on an instance with a Latin1_General collation, and we create:

CREATE TABLE baz (a int, b as dbo.snafu(a) PERSISTED)
GO

If we insert a few values and try a couple simple commands:

INSERT INTO baz VALUES (1), (2), (3)
GO

SELECT a, b
FROM baz
GO

DBCC CHECKTABLE (baz)
GO

We get back:

a           b
----------- -----------
1           1
2           2
3           3

(3 row(s) affected)

DBCC results for 'baz'.
There are 3 rows in 1 pages for object "baz".
DBCC execution completed. If DBCC printed error messages, 
contact your system administrator.

Everything looks fine. But now if we move the database to our Turkish instance and rerun the DBCC we get:

Msg 2537, Level 16, State 106, Line 1
Table error: object ID 885578193, index ID 0, partition ID 72057594039173120, 
alloc unit ID 72057594043498496 (type In-row data), page (1:223), row 0.
The record check (valid computed column) failed. The values are 2 and 0. Msg 2537, Level 16, State 106, Line 1 Table error: object ID 885578193, index ID 0, partition ID 72057594039173120,
alloc unit ID 72057594043498496 (type In-row data), page (1:223), row 1.
The record check (valid computed column) failed. The values are 2 and 0. Msg 2537, Level 16, State 106, Line 1 Table error: object ID 885578193, index ID 0, partition ID 72057594039173120,
alloc unit ID 72057594043498496 (type In-row data), page (1:223), row 2.
The record check (valid computed column) failed. The values are 2 and 0. DBCC results for 'baz'. There are 3 rows in 1 pages for object "baz". CHECKTABLE found 0 allocation errors and 3 consistency errors in table 'baz'
(object ID 885578193). repair_allow_data_loss is the minimum repair level for the errors found by
DBCC CHECKTABLE (foo.dbo.baz). DBCC execution completed. If DBCC printed error messages, contact your system
administrator.

Ack! Our table is corrupted! What’s going on? We’ve persisted values of a function that exhibits a dependency on the instance collation, and these values are no longer valid on the new instance. We can illustrate this another way by inserting the same values again and looking at the result:

INSERT INTO baz VALUES (1), (2), (3)
GO

SELECT a, b
FROM baz
GO

a           b
----------- -----------
1           1
2           2
3           3
1           2
2           4
3           6

(6 row(s) affected)

The values in column b should be dependent solely on those in column a. They aren’t.

The takeaway from this nightmare is that databases in SQL Server exhibit a strong dependency on the instance collation. There are other instance-level settings that have similar bindings, but collations are probably the most pervasive.

As with authentication, we’ve addressed this with contained databases. But before moving on to solutions, I want to step back and frame this problem a bit more generally. And that will be my next post.

Cheers,
-Isaac