Collation, DateTime, SParse Column and XML

  • Float DataType is evil

    Let us try following T-SQL script on SQL Server:

    declare @a float =300000000000000000000000000000000000
    select @a + 50 -@a
    select @a -@a +50
    go

    Guess what is the result?  the first is 0. and the second result  is 50.

    Floating point data is approximate; therefore, not all values in the data type range can be represented exactly. When you add a small number with a very big number,  the small number might just lost in the end result.

    As a end-user, we want consistent and correct result.  However,  float data type has no correct result since the value is not accurate.  Query result on float data type is also not consistent as well.  In the above example, we change the order of + and – operator, we will see different result.   Suppose we want aggregate float values, such as sum(floatcolumn),  SQL Server may choose to parallel scan the whole table with different thread, and sum up the result together.  In this case, the order of data values are random, and the end-result will be random. In one of our customer’s case,  he run the same query multiple times, and each time he get a totally different result.

    So, please try to avoid using float data type, especially  you want to do some aggregation on float types.  A workaround is to covert the type to a numeric type before doing aggregate.  For example, support X is float

    you can use  sum(cast(x as decimal(30,4))) to get consistent result

  • Tertiary Collation and the performance impact on order clause

    Today, One customer asked about collation SQL_Latin1_General_CP1_CI_AI with non unique nonclustered index we are getting a sort on the query plan.  The sample script is:

    CREATE TABLE TableWithASColumn(ID INT PRIMARY KEY, CharData VARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AS)

    CREATE NONCLUSTERED INDEX IX_CharData ON TableWithASColumn(CharData)

    CREATE TABLE TableWithAIColumn(ID INT PRIMARY KEY,CharData VARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AI)

    CREATE NONCLUSTERED INDEX IX_CharData ON TableWithAIColumn(CharData)

    CREATE TABLE TableWithAIColumnUniqueIndex(ID INT PRIMARY KEY,CharData VARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AI)

    CREATE UNIQUE NONCLUSTERED INDEX IX_CharData ON TableWithAIColumnUniqueIndex(CharData)

    SELECT ID, CharData FROM TableWithASColumn ORDER BY CharData

    SELECT ID, CharData FROM TableWithAIColumn ORDER BY CharData

    SELECT ID, CharData FROM TableWithAIColumnUniqueIndex ORDER BY CharData

    If you run the script inside SQL Server Management Studio, and Show the Actual Execution Plan,  you will see below picture:

    clip_image002

     

    At a glance,  it seems very wired because building an index on a key column means that the index already sort the data according to the key column, we just need scan the index, and the output should already be sorted.  Another wired issue is that this only happen on certain collation.  Before we talk about the root cause,  let us discuss a little about comparison and sort. 

    A typical comparison interface is  int Compare(object Other) which compare this instance with another instance. It returns 0 if two objects are equal, –1 if this instance is smaller, and 1 if it is bigger.  The sort algorithm is implemented by calling the Compare Method for two objects, and depends on the result of Compare() method, put two objects in the correct order.   If two objects are equal, the order of these two objects on totally random since there is no other way to ordering them.

    Now, let us look at the collation cases.  Suppose I am using a case insensitive collation, such as latin1_general_ci_ai it will treat character ‘b’ and ‘B’ are equal.  So when I order by the column,  I know that character a will be before ‘b’ and ‘B’, but the order of ‘b’ or ‘B"’ will be totally random. 

    What about a user want to sort ‘b’ before ‘B’ and treat them as equal during comparison (I think it is a totally reasonable requirement).  In this case, we need a comparison case insensitive and sorting case sensitive collation.  SQL Server don’t have such collation flag to allow us to achieve this officially.  However, there are a bunch of “old” collation (When I say old, I mean these collations are not changed/updated since 2000, and are not recommended to use in future development), which actually implement this feature.  Such collations are called SQL TERTIARY Collation, sql_latin1_cp1_ci_ai is one of such kind of collation.   In such collation, we internal have a tertiary weight table which defines a weight for each character.  During sorting on columns, and if two characters are the same, we will lookup the table to find the tertiary weight of the two characters, and sorting them according to the tertiary weight. 

    As you see from the description,  if I issue an order by clause on a column with TERTIARY collation,  I can’t simply scan the index defined on column since the index is not sorting any more (index is built on top of comparison, but not sort order),  and that is reason that we have to put a SORT on top of the index scan which explain the above wired result.

    After search the Internet,  I found one solution for this issue described in KB Article at http://support.microsoft.com/kb/951935. You can also look at TERTIARY_WEIGHTS (Transact-SQL) at http://msdn.microsoft.com/en-us/library/ms186881.aspx  to get a list of SQL TERTIARY Collations

    Now, the question is that should I use such collation in my application.  My suggestion is that don’t use these collations since they are “old” collation, and they can not sort all characters correctly.  Another thing you need to keep in mind is that the above behavior ONLY apply to char/varchar column, but not NVARCHAR column.  So it is another reason is avoid using these collations (because you might want consistent behavior between char and nchar columns).

  • How to make Like Case Sensitive?

    Today, I got a question from customer:

    I have one column (Type: nVarchar) in a table which has data something like this:

    a1

    aa

    aa2

    AB

    I want to show all the rows which have all the lower case letters in the above mentioned column.

    The following query works:

    select * from test WHERE LTRIM(RTRIM(a)) LIKE '%[abc]%' Collate Latin1_General_CS_AI  

    And Surprisingly this one does not:

    select * from test WHERE LTRIM(RTRIM(a)) LIKE '%[a-c]%' Collate Latin1_General_CS_AI  --Would actually be [a-z]

    Is it like the case sensitivity does not work with ranges i.e. [a-c] or I am missing something here??

    Here is my answer:

    Can you try

    select * from test WHERE LTRIM(RTRIM(a)) LIKE '%[a-c]%' Collate Latin1_General_BIN

      this should give you case sensitive result.

    The reason is that  [a-c] means include all characters which is bigger or equal than a, and smaller than c,  so uppercase A is also in the range even according to linguistic order for Latin1_General_CA_A. In other case, for case sensitive collation,  a is not equal to A, not  the order of the characters might still be  a <A <b < B <c <C, etc.

  • Unicode References

    In this article, I recommend several Unicode articles/websites for reference. Note, the list is not yet completed,  I will add more entries and make better categorization

  • Beta version of collation selecting tool availiable

    Hello, Guys

      Please the attached zip file for the SQL Server collation selection tool.  It is written in C#. 

  • SQL Server and UTF-8 Encoding (1) -True or False

    Today, I will start my series of articles about SQL Server and Unicode UTF-8 Encoding. In many times, I found when people ask me about UTF-8, they actually don't understand UTF-8.  So today's talk will be quite short. I just clarify some misunderstand.

    1.    SQL Server doesn't support Unicode, do you mean UTF-8?

        Sometime, people just say "SQL Server doesn't support Unicode". Actually, it is wrong, SQL Server support Unicode since SQL Server 7.0 by providing nchar/nvarchar/ntext data type.  But SQL Server doesn't support UTF-8 encoding for Unicode data, it supports the UTF-16 encoding.  I copy several definitions from Internet for these concepts:

       "Unicode is the universal character encoding, maintained by the Unicode Consortium. Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.  The unique number, which is called the code point of a character, is in the range 0 (hex) to 10FFFF(hex). It is normal to reference a Unicode code point by writing "U+" followed by its hexadecimal number. For example, Character A is defined as "Latin Uppercase Alphabet", and assigned a code point U+0041.  
          In additional to assign each character to a unique code point, Unicode Consortium also defined several Unicode transformation formats (UTFs) which are the algorithmic mapping from every Unicode code point to a unique byte sequence. Note, the Unicode code point itself has nothing related to how to store/transform in a Computer, but a UTF does. 
          The commonly used UTFs are UTF-16 and UTF-8. UTF-8 is the byte-oriented encoding form of Unicode which commonly used on Unix Operating System, Web Html and Xml File.  UTF-16 uses a single 16-bitcode unit to encode the most common 63K characters, and a pair of 16-bit code unites, called surrogates, to encode the 1M less commonly used characters in Unicode. UTF-16 is commonly used on Windows, .Net and Java. The transform between different UTFs are loseless and round-tripping. "

        In summary, don't confuse with Unicode and UTF-8 Encoding. They are totally different concepts. 

    2.    UTF-8 Encoding is much better than UTF-16 Encoding

    There are tons of articles comparing with UTF-8 encoding with UTF-16 encoding.  I will compare these two encoding side by side in my next article.  I can have 100 reasons to say UTF-8 Encoding is better than UTF-16 Encoding, and verse vice.  The correct answer is that no encoding is absolute better than the others.  User should choose the suitable encoding according to your application software requirement.  The operation system, programming language, database platform do matter when choosing the encoding.  UTF-8 is most common on the web. UTF-16 is used by Java and Windows. The conversions between all of them are algorithmically based, fast and lossless. This makes it easy to support data input or output in multiple formats, while using a particular UTF for internal storage or processing.

        So please don't jeopardize SQL Server's Unicode support because of it only support one of the UTFs.

    3.    SQL Server cannot store all Unicode Characters

    You may notice that I say SQL Server support UTF-16 Encoding in previous paragraph, but I also said SQL Server' nvarchar type encoding is UCS-2.  I intend to do this in this article because I want to discuss the different between these two Encodings in here. Let us describe in details in what area SQL Server support UTF-16 encoding:

    • SQL Server can store any Unicode characters in UTF-16 encoding. The reason is that the storage format for UCS-2 and UTF-16 are the same.
    • SQL Server can display any Unicode characters in UTF-16 encoding. The reason is that we internally call Windows functions to display characters, the Windows functions and fonts can take care of the supplementary character (a character take 4 bytes in UTF-16) correctly.
    • SQL Server can input any Unicode characters in UTF-16 encoding. The reason is that we internally call Windows IMEs (Input Method Editors) to input, the Windows IMEs can take care of the supplementary character (a character take 4 bytes in UTF-16) correctly.
    • SQL Server can sort/compare any defined Unicode characters in UTF-16 encoding.  Note, not all code points are map to valid Unicode character. For example, The Unicode Standard, Version 5.1 defines code points for around 10,000 characters.  All these characters can be compared/sorted in SQL Server latest version: SQL Server 2008.

    In contrast, I also list the UTF-16 thing SQL Server doesn't support:

    • SQL Server cannot detect Invalid UTF-16 sequence. Unpaired surrogate character is not valid in UTF-16 encoding, but SQL Server accept it as valid. Note, in reality, it is unlikely end-user will input invalid UTF-16 sequence since they are not support in any language or by any IMEs.
    • SQL Server treats a UTF-16 supplementary character as two characters. The Len function return 2 instead of 1 for such input.
    • SQL Server has potential risk of breaking a UTF-16 supplementary character into un-pair surrogate character, such as calling substring function. Note, in the real scenario, the chance of this can happen is much lower, because 1)  supplementary character is rare 2) string function will only break this when it happens be the boundary.  For example, when calling substring(s,5,1) will break if and only if the character at index 5 is a supplementary character.

    In summary, SQL Server DOES support storing all Unicode characters; although it has its own limitation.  Please refer to my previous blogs to details.

  • What I was doing these days?

    In recent days,  I am mainly reading test books since I am a Tester, and I need to know more about testing.  The good news for people who are interesting in collation topic is that 

    • I will have a SQL Server collation select tool available soon.
    • I hope I can start to write UTF-8 related stuff.

     So if you are interesting in the above topic, and want to get more information soon.  Please ping me by adding comments.

  • SQL Server’s Binary Collations

       Today, I will discuss SQL Server’s Binary collations.  In SQL Server, we have two kinds of binary collations: BIN collation, which has collation name ending with _BIN and BIN2 collation, which has collation name ending with _BIN2.  In SQL Server, a string value (either varchar or nvarchar value) is encoded and stored as a sequence of binaries.  For nvarchar value, we always use the UCS-2 encoding.  For varchar value, we use the codepage of corresponding collation associated with the value. 

        For “Binary” collation, we use a sorting algorithm based on the binary sequence of the string to sort the string data. Now, let us look the binary sorting algorithm.  For varchar type, the algorithm is quite simple; the sorting result of a set of varchar values with binary collation is the same as we sort them use the binary sequence.  In T-SQL syntax:

                Select * from T order by varcharC 

    Equals to

                Select * from T order by convert(varbinary(8000), varcharC) 

    Note, it is always true for both BIN collation and BIN2 collation.  

         Now, let us look at sorting the nvarchar values.  In this case, we can not sort them purely use the binary sequence, the reason is that the binary sequence represents a sequence of UCS-2 characters, which take two byte per character.  Because SQL Server internally uses Little-Endian to store the UCS-2 character, the binary sorting may result undesired result.   For example, character a (We usually use U+0061 to indicate its Unicode Code Point assigned by Unicode Org) is encoded as 0x6100 in SQL Server, so If we sort these characters in their binary sequence order,  we will not be able to get a Unicode Code Point ordered sorting result  which is  desired for many application.  In SQL Server’s BIN2 collation, we sort the nvarchar type according to their Unicode Code Points instead of the binary sequence (internally, we do comparison per WCHAR based, which is 2 bytes).  Now, comes the BIN Collation, the algorithm is wired.  It first compares two characters based on the first WCHAR values of the two characters, if the values are equal, the algorithm will do binary based comparison for the reminding binary sequences, i.e., byte per byte comparison.

         Hope your guys can follow my writing.  If not, you can simply assume that BIN2 is a Unicode Code Point based sorting algorithm, and it is desired by many customer. But BIN is not.

    From the sorting algorithm’s view, all BIN collation use the same algorithm and all binary2 collation use another algorithm.  Then why we have latin1_general_bin or Japanese_bin collation?  The reason is that they use different code page for encoding the characters sorting in the varchar type.  A Latin1_General_Bin collation can not hold any Japanese collation, so even they use the same algorithm to sort the data, but people still need to choose the language which the data will stored, such as English, Chinese, Japanese etc.  Note, this only apply to varchar data type.  For nvarchar data types, these collations have no difference at all.

         On the other hand, SQL Server has many linguistic collations which sort string linguistically using the language associated with the collation.  All collations which are not binary collation are linguistic collations.  For example, Latin1_General_CI_AS is a linguistic collation; it uses a sorting algorithm compatible with several of English language and many Western European languages.  Please don’t confuse with the name of Latin1_General, it actually can sort all Unicode characters defined in Unicode 3.2 characters and it can also sort many other languages correctly as well (if the language has no sorting conflict with the latin1_general sorting rule).  I will start another article to describe the linguistic collation in deep details.

        As a conclusion, binary collations have better performance than linguistic collations, and that is the main advantage of using a binary collation. A binary collation is always case sensitive and assent sensitive. User should try use BIN2 collation instead of BIN collation.

  • Be aware of comparing unicode constant with varchar column (updated)

        Today,  I looked for SQL Server JDBC Driver's  sendStringParametersAsUnicode, this parameter controls how Unicode String (Java only have Unicode string which is UTF-16 encoding) will be sending to SQL Server.   As the name of this parameter indicate,  when the parameter equals to true, which is the default setting,  we always send unicode data, i.e., the string with UTF-16 encoding, to SQL Server.  When it is set to false,  it also send ANSI data, i.e., the string encoded with a code page to SQL Server. In this case, it uses the code page of the current connected SQL Server database to convert  the Unicode String to a codepage encoded string.

         I am a Unicode Fan, and always want recommend people using Unicode. So when I knew this parameter,  my first question is that WOW, set the value to false has potential data corruption. When we convert the UTF-16 encoded string into a codepage encoded string, we may face the issue of some character is not defined in the code page, and will be replace as question mark (?).  Then, after I read this Article, I knew the reason of doing so is to enabling index seek when the target data type in SQL Server is varchar type.  Note, the Article ONLY apply when the collation of your column is a SQL Colaltion, but NOT Windows collation.  In most of case, people will use Windows collation as the collation, except us-English which still use sql_latin1_general_cp1_ci_as.  So If the target type is nvarchar type or your collation is Windows Collation,  don't set the parameter to false. 

         Come to conclusion,  for JDBC application,  please use nvarchar data type, and you will have full Unicode support, no data lose, no issue of not being able to use index, and better performance.  If you are using Windows Collation, don't set the flag to false.

     


     

  • Is Datetime2 compliant with Daylight Savings Time?

    Today, I got following questions:

    "The documentation states that Datetime2 is unaware of the time zone.
         Does that still guarantee that when the clock gets adjusted for Daylight Saving Time this will not result in duplicate timestamps? I.e. is Datetime2 internally mapped to UTC?
    "

    Here is the answer:

    • Datetime2 does not guarantee that you will get a unique value you would have to have a unique constraint on the column. If you have several batches running at the exact same time on a multiprocessor system you may end up with duplicate values without a constraint on the column, if you need guaranteed unique values you need to work with uniqueidentifier and the NEWID function. 
    • DateTime2 in SQL Server does not contain any information about
      the timezone of the value and whether the value is UTC or local time.  The SYSDATETIME, SYSUTCDATETIME,  SYSDATETIMEOFFSET built-in functions will adjust automatically when the Windows Operating System, which the SQL Server instance running on, adjusts the clock during daylight saving switch.  I.e.,  the datetime offset returned by SYSDATETIMEOFFSET will be changed from -08:0 to -07:00 for Pacific Time.  The SYSDATETIME value generated will have one hour gap when entering daylight saving, i.e., one hour is missing, but SYSUTCDATETIME does not.
    • My recommendation is that always store UTC time in datetime2 column or use SYSDATETIMEOFFSET data type, and handle local time issue by calling Windows/.Net API.
  • Why I write so many collation topics?

       You may wonder why I keep writing collation topics in my blogs.  I found our Books Online topics related to collation is not so clearly enough, and sometime it confuse our reader.  I intend to present my idea related to collation in a series of blogs, and finally I want group all these topics together to draw a clear picture about collation. My blogs will try to answer following question in details:

    • How I should choose a collation?
    • What is the different between collation A and B?
    • How collation impact the behavior of query, such as string functions,  variables assignment?
    • Best practices for using SQL Server's collation.

     Again, I strong suggest your guys send my comments or collation related question to me,   If I can not answer, I will find some language expert to answer this, and I will also share the answer as well.

  • Got Collation conflict, How to avoid this?

    The best way to avoid SQL Server's collation conflict issue is that avoid have different collations in your database schema. 

    •  If your server and database have the same collation, you will never see the conflict collation issue.
    •  if all string columns in a database use the same collation, you will see less collation conflict unless you do cross-database query.  You don't need explicit set the collation for string collations, they will pick up the database's collation by default.
    • Keep in mind, changing existing database's collation is allowed, but it wouldn't changing existing table's collation, so they will have change of mix of different collations.
    • I believe the common case of collation conflicts happens when your tempdb collation (which is your server collation) is different with your current database' collation.  Objects created on tempdb will automaticlly pick the server collation, not the current database collation.  The best way to solve this is this article:http://sqlskills.com/BLOGS/KIMBERLY/post/Changing-Database-Collation-and-dealing-with-TempDB-Objects.aspx
    • When you hit collation conflict,  you can always use explicit collate clause to resolve the conflict.
    • If you have any other concern or suggestion,  please send me detail about your issue.  You can also vote for letting SQL Server solve this issue at http://connect.microsoft.com/SQLServer/feedback/Validation.aspx?FeedbackID=324910
  • GB18030 Character Set Support in SQL Server

    Recently, I got several request related to GB18030 Character Set support in SQL Server.  Here is one of the requests:

    When I try to insert different combination of GB18030 characters in to a varchar column, it gives me unique constraint violation error.
    insert myTable
    values ('32066326-6E23-4E31-B226-9F8652CE11F3', N'
    㐀㐁ᠠᠡᠢཌཌྷꀀꀁﺶﺷﺸ', 'AC1', '')
    insert myTable
    values ('32066326-6e23-4e31-b226-9f8652ce11f7', N'
    ꀀꀁﺶﺷﺸ', 'AC1', '')

    does anyone know cause / solution / workaround?

    Before going into the solution,   I need to explain about what is GB18030 Character Set.  Here, I copied the Unicode Faq on GB1830:

    What is GB 18030?

    GB 18030 is a new Chinese codepage standard that extends GB 2312-1980 and GBK (which itself is an extension of GB 2312-1980).

    What is new in GB 18030?

    It is a multi-byte encoding using 1-byte, 2-byte, and 4-byte codes. The 1-byte and 2-byte codes have the same assignments as in GBK, which itself is a superset of GB 2312-1980.
    There are about 1.6 million valid byte sequences.
    It is not possible to determine if a byte sequence is either 2 or 4 bytes long by just examining the lead byte - the second byte must be examined as well.

    Why is GB 18030 important?

    The Chinese Government has mandated that all applications released on or after 2001-Sep-01 must support GB 18030.

    How does GB 18030 relate to Unicode?

    The specification refers directly to a mapping of GB 18030 codes to and from ISO 10646/Unicode to define most character assignments. Some characters that used to be mapped for GBK to the PUA (Private Use Area) for Unicode 2.1 are now assigned in Unicode 3.0, and their mappings from GB 18030 use only the Unicode 3.0 code points.

    In addition, GB 18030 defines roundtrip mappings for all 1.1 million Unicode code points including unassigned and non-character ones, but excluding single surrogates. This makes GB 18030 functionally very similar to a UTF.

    So let me try to answer the customer's question.  Remember varchar type always associates with a code page, and can only stored the characters defined in the code page.  The code page which store Simplified Chinese is 936.  Code page 936 corresponds to Chinese GBK standard, which is a subset of GB18030 Encoding.   So it means that storing GB18030 characters in varchar type has potent data loss issue.  On the other hand, since Unicode 3.0 and GB18030 can be roundtrip, so the best way to store GB18030 characters is use SQL Server's nvarchar type.  Most of the new added characters in GB18030 map to the Unicode CJK Extension B block, while the block are all Supplementary characters which are represented with a surrogate pair.  So supporting Gb18030 character becomes to the same as support Surrogate character, which already discussed in previous blog

    However, the nvarchar type use UCS-2 encoding, which is not the GB18030 encoding. My next point is related to Gb18030 encoding itself.  What if user has GB18030 encoding data/files/scripts,   the answer is that SQL Server can recognize this GB18030 encoded input, and convert them into Unicode, so you will no issue with these files/scripts/data.

    References:

    Ask Dr. International,Column #15, New Chinese Encoding GB-18030

    GB 18030 WikiPage

    GB18030 Characters in SQL Server

    International Features in Microsoft SQL Server 2005

  • My recommendation of SQL Server’s Globalization Development

    In this blog, I will give some my recommendation of SQL Server's Globalization Development.

    1.       Use nchar/nvarchar instead of char/varchar type.   Nchar/nvarchar type is SQL Server's Unicode data type, which can store any characters defined by Unicode Org.  char/varchar type always associates with a code page, so the number of supported characters is limited.  I give you some fact to show why use Unicode data type is recommendation: 

    o   Starting from Windows 2000, Windows have full Unicode support implemented by using UTF-16 Encoding.  Windows APIs take WCHAR* as input which represent a Unicode String. 

    o   Virtual C++ has WCHAR, which is Unicode char type.

    o   .Net String is Unicode String only encoded in UTF-16.

    o   Java String is Unicode String only encoded in UTF-16.

    o   Char type might have data corruption issue if your client locale is different with server locale.

    So it is time to deprecated the old code page technique, and use the Unicode in your application, and also in your SQL Server database.   The only benefit of using char type is the space saving for single byte code page.  You can always use the Data Compression feature in SQL Server 2008 if you care about disk space, which can save more space than using char type.  

    2.       Also remember to put N' for your string literal in T-SQL.   String literal N'abc'   is a Unicode nvarchar string literal.  Literal ‘ab©'  is a varchar literal, it always associates with a codepage (string literal always use current database's collation). Suppose char © is not in the code page, you will get a question mark (?) when inserting the literal into a table,  even the column is nvarchar type. 

    3.       Use nvarchar instead of nchar.  The different between nvarchar and nchar is the storage.  A column with nchar(30) type always take 60 bytes to store on the disk, even the value is a single character.  The data size for a nvarchar(30) type column is not fixed,  it varies row by row or value by value.  A single character value takes 2 bytes to store, and a value with 30 characters long  takes 60 bytes to be stored.  Another different between nvarchar and nchar type is the performance.  Nchar types always stored in fixed location in every row, which can be retrieved faster than nvarchar type which is stored in different location for different row.  However, I believe the benefit of less stored space for nvarchar types usually overcomes the cost of locating the value in a row. 

    4.      If possible, avoid using a column collation which is different with the database's collation.   You will have less collation conflict.   If you want a query use a special collation's sorting rule, use explicit collate clause in that query.

    5.       Use Windows Collation instead of SQL Collation.   The only exception here is the default collation for en_US locale which is sql_latin1_general_cp1_ci_as.

    6.       Never store UTF-8 string in varchar types, you will get data corruption.

    7.       Keep in mind, string comparison always ignore trailing spaces (Unicode U+0020), no matter what collation you are using.

    8.       Keep in mind, LEN function always return the number of characters exclude the trailing spaces. DataLength function returns the storage size in term of bytes.

    9.       using _BIN2 collation, instead of  _BIN collation if you want binary, code page base string comparison behavior.

  • UTF-16 Encoding and SQL Server (4) - String Function Case Study

    Today,   I will give an example of different implementation of a string function.  Let us look at function:

    SUBSTRING(value_expression ,start_expression , length_expression )

    In here, I proposed four kinds of alternative functions by giving them different name to distinct the functionality of them.

    1.       USC2-SUBSTRING.  This function has the same functionality as the SUBSTRING function.

    2.       UTF16-SUBSTRING: This function take UTF-16 string as input, return type is also UTF-16 string

    3.       UTF8-SUBSTRING: This function take UTF-8 string as input, return type is also UTF-8 string

    4.       CHAR_SUBSTRING: This function is a linguistic version of substring, it wouldn't break the character in the given language, and the encoding can be either UCS-2/UTF-8 or UTF16.

    For the length semantic, I will assume that it is character length with respected to the corresponding Encoding. I will use following table to show the differences between these functions when calling function:

                    Declare @a nvarchar(10)

                   select substring(@a, 5, 2)

    , suppose the 10 here represents 10 characters  

     

    Function Name

    Input Encoding

    Output Encoding

    Output Type

    Output  Buffer Bytes

    Surrogate Aware

    Collation

    Sensitive

    Algorithm

    Complexity

    USC2-SUBSTRING

    UCS-2

    UCS-2

    Nvarchar(2)

    4

    No

    No

    Simple, and Fast

    UTF16-SUBSTRING

    UCS-16

    UCS-16

    Nvarchar(2)

    8

    Yes

    No

    Always scan from the beginning, need to handle surrogate pair.

    UTF8-SUBSTRING

    UTF-8

    UTF-8

    Nvarchar(2)

    8

    Yes

    No

    Always scan from the beginning. See (1)

    CHAR_SUBSTRING

    Variance

    =input

    5, see (2)

    Depends on encoding

    Yes, see (3)

    Yes

    More complex. See (4)

     

    (1)     For UTF8-SUBSTRING. In theory, we can do a conversion from UTF-8 to UTF-16, and call UTF-16 version of the algorithm.  But it will have double conversion since the result need to convert back to UTF-8.

    (2)    It is unknown for how many bytes one linguistic character will take until the run-time.  So we have to make the result type = input_length- input_index

    (3)    A surrogate pair is always treated as one linguistic character, just it is surrogate aware

    (4)    The function itself will be pretty complex. For looking for one character, it needs search deeper and the move the pointer back.  

    In addition, I like to give some alternative solution for UTF16_SUBSTRING:

    1)      We call UCS2-SUBSTRING internally, and the last character it return a leading surrogate character, just removing such character from the result. The advantage of this approach is that it is fast, and it wouldn't break surrogate pairs.  The drawback is that it doesn't respect UTF-16 length semantic, which will lead programming issues.

    2)      We call the new UTF16-SUBSTRING function above, but return type is nvarchar(2), and here the 2 means that four UCS-2 characters.  I.e., we keep the existing return type, and semantic as the same.  Again, the result can only hold one UTF-16 character instead of 2.

    3)      We call the new UTF16-SUBSTRING function above, but return type is nvarchar(4), and here the 4 means that four UCS-2 characters.  I.e., we obey the UTF-16 length semantic for input parameter, but not for output.   In these ways, we have the same meaning of nvarchar(4) in UCS-2 encoding and UCS-16 encoding.  But we still break the return type from nvarchar(2) to nvarchar(4).

    Finally, I like to point out that we have CLR examples of Supplementary-Aware (Supplementary character refer to a Surrogate Pair) String Manipulation at here.  It internally uses StringInfo.ParseCombiningCharacters Method.  It not only handles Supplementary characters, but according to the API, it handles other combining character as well.  Note, the different between this method and the CHAR_SUBSTRING I presented here is that there are more language specific rules or character sequences which cannot be handled by the C# method.

     

More Posts Next page »

© 2009 Microsoft Corporation. All rights reserved. Terms of Use  |  Trademarks  |  Privacy Statement
Microsoft
Page view tracker