I will start several articles describe my thought on UTF-16
support in SQL Server. Before I start to
discuss this topic, I use the content "Supplementary
Characters, Surrogate Pairs and SQL" to
describe the difference between UTF-16 and UCS-2:
UTF-16: This encoding stores the basic Unicode
characters using single 16 bit units and others characters using two 16 bit
units. UTF-16 is the primary encoding mechanism used by Microsoft Windows 2000,
Windows 2000 Server, Windows XP and Windows 2003 Server. On the other hand SQL
server 2000 uses the UCS-2 encoding schema to store Unicode data.
Note: The UCS-2 encoding scheme is
actually a subset of the UTF-16 scheme. Every UCS-2 encoded code point is
identical to the encoding of the same code point in UTF-16. Also, most new
implementations using the Unicode standard now employ UTF-16, UTF-8 or UTF-32
instead of UCS-2.
I recommend your guys read the above article, Wiki page UTF-16/USC-2 and also Supplementary
Characters in SQL Server 2005. Also, this
article explains why we call Windows support UTF-16 encoding even the core
Unicode data type WCHAR is not surrogate aware.
To answer the question of how SQL Server can support UTF-16
Encoding for nvarchar data type, we need to solve following two issues:
There are many cases that a good
UTF-16 data might become invalid, so we need to avoid such case. Using correct character semantic is one case
we certainly need to do. There are other
cases that we might generate invalid data from good data. Examples are: when sending/receiving data to/from the client,
we might send/receive data in chunk, then we need to be careful of not break
the surrogate data. Another case is for nvarchar(max) and ntext type which
store data in multiple physical pages, then store a surrogate pair in different
page might have potential issue.
Another question I like to raise is that what
if we failed to do UTF-16 data validation.
For example, suppose the data stored on the disk already invalid (this
can happen if we up-grade from old version of SQL Server, and there are some
invalid UTF-16 data stored. Of course,
we can do data validation during up-grade, but it will be a size of data operation
and we usually want to avoid it), what is the result of calling a string
function with such invalid data as input.
We can 1) throw an error and terminate the statement 2) treat a single
surrogate point as other normal character 3) or change the invalid data to a
valid data (such as a question mark ? which we did when converting a nvarchar
data into varchar data, but the data is not in the code page associated with
the varchar type). Case 2 raise another
interesting question which is that can we just treat a single surrogate point
is still a valid character, and don't do any data validation?
Let us end today's talk with an example of UTF-16 Support in
SQL Server today. In SQL Server 2005, we
have XML data type support. The XML data
type is internally stored as UTF-16 encoding (we use varbinary(max) data type
internally, but the content itself is UTF-16 encoded text). In particular, single surrogate point is not
allowed, and SQL Server will automatically detect such invalid data when doing
XML parsing. Some of the XQuery, such as
substring function internally call SQL Server's substring function, which made
these function surrogate unsafe.