As you discuss in here, in order to supporting UTF-16 encoding in SQL Server, we need to handle both "good" data and "bad" data. Today, I will discuss different approaches of adding UTF-16 support in SQL Server.  For each approach I described today, I will discuss the advantage and disadvantage, and leave readers to judge what one you prefer.

The first approach is mainly focusing on handling good data.  Since existing string functions are not surrogate aware, and have possibility of breaking a surrogate pair (good data) into single surrogate point (bad data), why want just introduce a set of new functions which use the correct character length semantic?  For example, we can introduce char_len, char_substring, char_left, char_right functions, which take a surrogate pair as one character.  For existing string functions, the semantic and result will not be changed.  What about "bad" data, I will say we possibility have to treat as "good" data since we haven't prevent existing string functions to generate bad data from good data.  The advantage of this approach is that we are adding new functions, and leave old function unchanged (in term of functionality and also performance). For people who don't care about surrogate pair, they can continue to use old string functions without seeing any difference.  For people who care about surrogate pair, they can start to use new functions.  The disadvantage of this approach is that it does not really solve the "bad" data issue. Also people have to choose between the new string functions and the old string functions.

Let us talk about the second approach.  It is pretty simple: we just fix all existing functions which take the correct character length semantic.  The good thing is that we will not break good data and always return correct result.   The cost is that we may pay certain performance penalty. For example, Before LEN() function just can just need return number of WCHARs, which take O(1) time, however, the new LEN() will have to traverse the whole string even this is no surrogate character in the string, which will be quite slow.  Another issue is that changing the result of existing functions is a breaking change. Len() function used to return 2 for a surrogate pair, but it return 1 now.  Even the new result is correct, some customer might still want keep the old behavior.   Also, I haven't checked the return type of these string functions (I will start another topic to cover this).  For example, if substring(c1, 1,2) return nvarchar(1), then it need either return nvarchar(2)  as data type.   This behavior change is not trivial and it is definitely a breaking change.   One possible solution is to use the compatibility level.  User can change the database's compatibility level to be lower than the current one, and it will see old behavior and also enjoy the better performance of old behavior.  Then, what about the "bad" data?  I am favor of fixing the bad data whenever possible.   What about database upgrade?  Do we want to scan the whole database to verify no bad data or we just assume that there is no bad data, and only generate error when we found bad data later?   I like to raise another question here.  Can you track whether there are surrogate characters in a column of one table?  If so, can we smartly use old algorithm for such column?

I have the third solution in mind, which is introducing the concept of Encoding in database level.  A database can have an Encoding option which indicates the encoding of nvarchar type. The default encoding is UCS-2 which is the current encoding we supported, and all existing behavior will be the same.  A UTF-16 Encoding will be introduced and can be set on database level.  Once the database encoding is set to be UTF-16, all string functions will follow the UTF-16 length semantic, and we will also do data validation as well.  The advantage of this approach is that user will enjoy the better performance of old behavior by default, and no upgrade issue at well.  New user can choose the new behavior only when they intend to do.  What about changing encoding from UCS-2 to UTF16, do we need to do data validation?  I don't know the answer.  BTW, other database vendors also use the same mechanism as well.

Finally, I like to briefly describe other approaches which I thought of, but I don't like personally. How about introduce new collations, such as latin1_general_UTF16_ci_as.  We already have enough collations to confuse our customers, I am not favor of adding more collations.  How about introduce new column level encoding?   I believe in most of case, people want the whole database have the same behavior, having fine granular control over the encoding might not be interesting to our customer.

In summary, performance and backwards compatibility will be the two issues which need to be carefully considered when implementing UTF-16 in SQL Server.   My next blog will exam all string functions to see what is the impact with UTF-16 Encoding.

BTW, Michael Kaplan posted a series of articles at here to describe his idea of UCS-2 to UTF16 migration for SQL Server.  I have some difficulty to follow his idea.  However, I can guess some basic principle here. In some case, a sequence of Unicode points together as one character from the end user's point view.  Examples are: the two small ss in German language, Korean Jamo and Thai Language.  Breaking such character sequence is not desired by end users.  Then, the question is that should we also make the new function take "true" character into consideration.  Here, I reference Unicode FAQ "How should characters (particularly composite characters) be counted, for the purposes of length, substrings, positions in a string, etc. "  as the end of this topic

A: In general, there are 3 different ways to count characters. Each is illustrated with the following sample string. 
"a" + umlaut + greek_alpha + \uE0000. (the latter is a private use character)

1. Code Units: e.g. how many bytes are in the physical representation of the string. Example:
In UTF-8, the sample has 9 bytes. [61 CC 88 CE B1 F3 A0 80 80]
In UTF-16BE, it has 10 bytes. [00 61 03 08 03 B1 DB 40 DC 00]
In UTF-32BE, it has 16 bytes. [00 00 00 61 00 00 03 08 00 00 03 B1 00 0E 00 00]

2. Codepoints: how may code points are in the string.
The sample has 4 code points. This is equivalent to the UTF-32BE count divided by 4.

3. Graphemes: what end-users consider as characters.
A default grapheme cluster is specified in UAX #29, Text Boundaries, as well as in UTS #18 Regular Expressions.

The choice of which one to use depends on the tradeoffs between efficiency and comprehension. For example, Java, Windows and ICU use #1 with UTF-16 for all low-level string operations, and then also supply layers above that provide for #2 and #3 boundaries when circumstances require them. This approach allows for efficient processing, with allowance for higher-level usage. However, for a very high level application, such as word-processing macros, graphemes alone will probably be sufficient.