1. We know that the Surrogate Characters have two bytes when counting, so when we design this character in DB table, we should set double size of your original type, shouldn't we?
2. When comparing string, if we can't use Binary collating sequence, then we will got different result from we think. So should we do some process to change collating sequence before comparing?
3. Since that Unicode will be supported limitedly, what should we pay our attention to?
To answer this question, I will briefly discuss the basic concept of Surrogate character and SQL Server's Unicode support. Unicode standard is a standard encoding to encode all character in the word. In Unicode standard, a Unicode code point is represented by using a sequence of Hex value from U+0000 to U+10FFFF. These code points are categorized into 16 planes. A Plane is a range of 65,536 (1000016) contiguous Unicode code points. Planes are numbered from 0 to 16, with the number being the first code point of the plane divided by 65,536. Thus Plane 0 is U+0000...U+FFFF, Plane 1 is U+10000..U+1FFFF, ..., and Plane 16 (1016) is U+100000..10FFFF. U+D800 to U+DFFF are reversed for encoding other planes in UTF-16 (discussed below). We call planes 1 to 16 as Supplementary Planes, and characters defined in these plains as Supplementary Character.
The Unicode Organization defined following Unicode encoding forms: UTF-8, UTF-16, and UTF 32. Each encoding form maps the Unicode code points to unique code unit sequences. In UTF-32, each character is encoded with 4 bytes. In UTF-8, the character can be encoded by using one, two, three, or four bytes. For UTF-16, we encode the plane 0 (U+0000 to U+FFFF) with 2 bytes, and 4 bytes for all other planes. The 4 byte code point in UTF-16 should always follow rules: the first 2 bytes is in the range of 0xD800-0xDBFF) and the second Unicode character is in 0xDC00-0xDFFF. We call such 4 bytes encode character as Surrogate pair. A surrogate pair represents a single Supplementary character; it should not be break into small piece.
What is defined in Supplementary planes?
Plane 1 currently has three allocation areas: a General Scripts Area at the beginning of theplane, containing various small historic scripts; a Large Historic Scripts Area, currently containing Sumero-Akkadian Cuneiform; and a Notational Systems Area, which currently contains sets of musical symbols, alphanumeric symbols for mathematics, and a system of divination symbols similar to those used for the Yijing.
Plane 2 consists primarily of one big area, starting from the first code point in the plane,dedicated to more unified CJK character encoding. Then there is a much smaller area,toward the end of the plane, dedicated to additional CJK compatibility ideographic characters—which are basically just duplicated character encodings required for round-trip conversionto various existing legacy East Asian character sets. The CJK compatibilityideographic characters in Plane 2 are currently all dedicated to round-trip conversion forthe CNS standard and are intended to supplement the CJK compatibility ideographic charactersin the BMP, a smaller number of characters dedicated to round-trip conversion forvarious Korean, Chinese, and Japanese standards.
SQL Server's Unicode support.
In SQL Server, we use the UCS-2 encoding for nvarchar/nchar types. The UCS-2 treats each 2 bytes as one Unicode character, and thus it treat a Surrogate pairs as two characters. As a result, all string functions which take position as input/output will return "wrong" result/break the surrogate pair when the input string contains surrogate pairs. Here is the high level information about this topic:
My suggestion to the question:
It really depends on the data distribution of your input data. If nearly all your characters are Surrogate character, then it make sense to double it. However, as far as I knew, the surrogate Characters defined in the Unicode CJK Extension B session are rarely used by Chinese people, which lead me to guess that surrogate characters will rarely used. It is true that the CLR Supplement aware code can handle surrogate without breaking it, but keep in mind that the CLR-version of string operation will be far slower that SQL server's string operation. For people who can read Chinese, This paper shows that without using CJK Extension B (which is defined in Unicode Surrogate Plane 1), the Chinese characters defined in Unicode can cover 99.99% of Modern Chinese characters.
You can always use any binary and binary2 collations although it wouldn't give you Linguistic correct result. For SQL Server 2005, you SHOULD use Chinese_PRC_90_CI_AS or Chinese_PRC_Stoke_90_CI_AS which support surrogate pair comparison (but not linguistic). For SQL Server 2008, you should use Chinese_Simplified_Pinyin_100_CI_AS and Chinese_Simplified_Stroke_Order_100_CI_AS which have better linguistic surrogate comparison. I do suggest you use these collation as your server/database/table collation instead of passing the collation name during comparison.
I think each string function has two properties: 1) whether it is sensitively to collation, i.e., whether it returns different result for different collation 2) whether it can handle surrogate characters without break it. Suppose the input of these functions are Nvarchar/NChar type, here is the information about whether these functions are collation aware, and whether they are surrogate aware.
The length of Surrogate pair is 2 instead of 1
LTRIM/RTRIM will only remove the "Space" (Unicode U+0020). Existing Surrogate characters are not affected.
Reverse will always change the order of a surrogate pair, cause unable to display/interpret the character
These functions might break surrogate pairs
Depends on the usage
These functions wouldn't break the surrogate character, and they are correctly search any input characters, including surrogate chars. However, the return result is not surrogate aware (i.e., it count 2 for surrogate character).
If the input parameter is surrogate pairs, the return value does not break them.
This function might break surrogate pairs
If the input is a surrogate pair, it only return the code point for first character.
Order By, Group By, String Comparison
Note, since we treat each character as 16bit code-point, all functional which have position as input parameter will not be surrogate aware, such functions include: LEN, LEFT, RIGHT, SUBSTRING. On the other hand, Like, Ordering, Grouping and String Comparison are collation aware operation, if user selects the appropriate collation, we have no problem with Surrogate character. Reverse always break the surrogate input.
The while paper “International Features in Microsoft SQL Server 2005” at http://msdn.microsoft.com/en-us/library/bb330962.aspx is a good reference.