Today,   I will give an example of different implementation of a string function.  Let us look at function:

SUBSTRING(value_expression ,start_expression , length_expression )

In here, I proposed four kinds of alternative functions by giving them different name to distinct the functionality of them.

1.       USC2-SUBSTRING.  This function has the same functionality as the SUBSTRING function.

2.       UTF16-SUBSTRING: This function take UTF-16 string as input, return type is also UTF-16 string

3.       UTF8-SUBSTRING: This function take UTF-8 string as input, return type is also UTF-8 string

4.       CHAR_SUBSTRING: This function is a linguistic version of substring, it wouldn't break the character in the given language, and the encoding can be either UCS-2/UTF-8 or UTF16.

For the length semantic, I will assume that it is character length with respected to the corresponding Encoding. I will use following table to show the differences between these functions when calling function:

                Declare @a nvarchar(10)

               select substring(@a, 5, 2)

, suppose the 10 here represents 10 characters  

 

Function Name

Input Encoding

Output Encoding

Output Type

Output  Buffer Bytes

Surrogate Aware

Collation

Sensitive

Algorithm

Complexity

USC2-SUBSTRING

UCS-2

UCS-2

Nvarchar(2)

4

No

No

Simple, and Fast

UTF16-SUBSTRING

UCS-16

UCS-16

Nvarchar(2)

8

Yes

No

Always scan from the beginning, need to handle surrogate pair.

UTF8-SUBSTRING

UTF-8

UTF-8

Nvarchar(2)

8

Yes

No

Always scan from the beginning. See (1)

CHAR_SUBSTRING

Variance

=input

5, see (2)

Depends on encoding

Yes, see (3)

Yes

More complex. See (4)

 

(1)     For UTF8-SUBSTRING. In theory, we can do a conversion from UTF-8 to UTF-16, and call UTF-16 version of the algorithm.  But it will have double conversion since the result need to convert back to UTF-8.

(2)    It is unknown for how many bytes one linguistic character will take until the run-time.  So we have to make the result type = input_length- input_index

(3)    A surrogate pair is always treated as one linguistic character, just it is surrogate aware

(4)    The function itself will be pretty complex. For looking for one character, it needs search deeper and the move the pointer back.  

In addition, I like to give some alternative solution for UTF16_SUBSTRING:

1)      We call UCS2-SUBSTRING internally, and the last character it return a leading surrogate character, just removing such character from the result. The advantage of this approach is that it is fast, and it wouldn't break surrogate pairs.  The drawback is that it doesn't respect UTF-16 length semantic, which will lead programming issues.

2)      We call the new UTF16-SUBSTRING function above, but return type is nvarchar(2), and here the 2 means that four UCS-2 characters.  I.e., we keep the existing return type, and semantic as the same.  Again, the result can only hold one UTF-16 character instead of 2.

3)      We call the new UTF16-SUBSTRING function above, but return type is nvarchar(4), and here the 4 means that four UCS-2 characters.  I.e., we obey the UTF-16 length semantic for input parameter, but not for output.   In these ways, we have the same meaning of nvarchar(4) in UCS-2 encoding and UCS-16 encoding.  But we still break the return type from nvarchar(2) to nvarchar(4).

Finally, I like to point out that we have CLR examples of Supplementary-Aware (Supplementary character refer to a Surrogate Pair) String Manipulation at here.  It internally uses StringInfo.ParseCombiningCharacters Method.  It not only handles Supplementary characters, but according to the API, it handles other combining character as well.  Note, the different between this method and the CHAR_SUBSTRING I presented here is that there are more language specific rules or character sequences which cannot be handled by the C# method.