I will start several articles describe my thought on UTF-16 support in SQL Server. Before I start to discuss this topic, I use the content "Supplementary Characters, Surrogate Pairs and SQL" to describe the difference between UTF-16 and UCS-2:
UTF-16: This encoding stores the basic Unicode characters using single 16 bit units and others characters using two 16 bit units. UTF-16 is the primary encoding mechanism used by Microsoft Windows 2000, Windows 2000 Server, Windows XP and Windows 2003 Server. On the other hand SQL server 2000 uses the UCS-2 encoding schema to store Unicode data.
Note: The UCS-2 encoding scheme is actually a subset of the UTF-16 scheme. Every UCS-2 encoded code point is identical to the encoding of the same code point in UTF-16. Also, most new implementations using the Unicode standard now employ UTF-16, UTF-8 or UTF-32 instead of UCS-2.
I recommend your guys read the above article, Wiki page UTF-16/USC-2 and also Supplementary Characters in SQL Server 2005. Also, this article explains why we call Windows support UTF-16 encoding even the core Unicode data type WCHAR is not surrogate aware.
To answer the question of how SQL Server can support UTF-16 Encoding for nvarchar data type, we need to solve following two issues:
There are many cases that a good UTF-16 data might become invalid, so we need to avoid such case. Using correct character semantic is one case we certainly need to do. There are other cases that we might generate invalid data from good data. Examples are: when sending/receiving data to/from the client, we might send/receive data in chunk, then we need to be careful of not break the surrogate data. Another case is for nvarchar(max) and ntext type which store data in multiple physical pages, then store a surrogate pair in different page might have potential issue.
Another question I like to raise is that what if we failed to do UTF-16 data validation. For example, suppose the data stored on the disk already invalid (this can happen if we up-grade from old version of SQL Server, and there are some invalid UTF-16 data stored. Of course, we can do data validation during up-grade, but it will be a size of data operation and we usually want to avoid it), what is the result of calling a string function with such invalid data as input. We can 1) throw an error and terminate the statement 2) treat a single surrogate point as other normal character 3) or change the invalid data to a valid data (such as a question mark ? which we did when converting a nvarchar data into varchar data, but the data is not in the code page associated with the varchar type). Case 2 raise another interesting question which is that can we just treat a single surrogate point is still a valid character, and don't do any data validation?
Let us end today's talk with an example of UTF-16 Support in SQL Server today. In SQL Server 2005, we have XML data type support. The XML data type is internally stored as UTF-16 encoding (we use varbinary(max) data type internally, but the content itself is UTF-16 encoded text). In particular, single surrogate point is not allowed, and SQL Server will automatically detect such invalid data when doing XML parsing. Some of the XQuery, such as substring function internally call SQL Server's substring function, which made these function surrogate unsafe.