Collation, DateTime, SParse Column and XML

  • Unicode References

    In this article, I recommend several Unicode articles/websites for reference. Note, the list is not yet completed,  I will add more entries and make better categorization

  • Beta version of collation selecting tool availiable

    Hello, Guys

      Please the attached zip file for the SQL Server collation selection tool.  It is written in C#. 

  • SQL Server and UTF-8 Encoding (1) -True or False

    Today, I will start my series of articles about SQL Server and Unicode UTF-8 Encoding. In many times, I found when people ask me about UTF-8, they actually don't understand UTF-8.  So today's talk will be quite short. I just clarify some misunderstand.

    1.    SQL Server doesn't support Unicode, do you mean UTF-8?

        Sometime, people just say "SQL Server doesn't support Unicode". Actually, it is wrong, SQL Server support Unicode since SQL Server 7.0 by providing nchar/nvarchar/ntext data type.  But SQL Server doesn't support UTF-8 encoding for Unicode data, it supports the UTF-16 encoding.  I copy several definitions from Internet for these concepts:

       "Unicode is the universal character encoding, maintained by the Unicode Consortium. Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.  The unique number, which is called the code point of a character, is in the range 0 (hex) to 10FFFF(hex). It is normal to reference a Unicode code point by writing "U+" followed by its hexadecimal number. For example, Character A is defined as "Latin Uppercase Alphabet", and assigned a code point U+0041.  
          In additional to assign each character to a unique code point, Unicode Consortium also defined several Unicode transformation formats (UTFs) which are the algorithmic mapping from every Unicode code point to a unique byte sequence. Note, the Unicode code point itself has nothing related to how to store/transform in a Computer, but a UTF does. 
          The commonly used UTFs are UTF-16 and UTF-8. UTF-8 is the byte-oriented encoding form of Unicode which commonly used on Unix Operating System, Web Html and Xml File.  UTF-16 uses a single 16-bitcode unit to encode the most common 63K characters, and a pair of 16-bit code unites, called surrogates, to encode the 1M less commonly used characters in Unicode. UTF-16 is commonly used on Windows, .Net and Java. The transform between different UTFs are loseless and round-tripping. "

        In summary, don't confuse with Unicode and UTF-8 Encoding. They are totally different concepts. 

    2.    UTF-8 Encoding is much better than UTF-16 Encoding

    There are tons of articles comparing with UTF-8 encoding with UTF-16 encoding.  I will compare these two encoding side by side in my next article.  I can have 100 reasons to say UTF-8 Encoding is better than UTF-16 Encoding, and verse vice.  The correct answer is that no encoding is absolute better than the others.  User should choose the suitable encoding according to your application software requirement.  The operation system, programming language, database platform do matter when choosing the encoding.  UTF-8 is most common on the web. UTF-16 is used by Java and Windows. The conversions between all of them are algorithmically based, fast and lossless. This makes it easy to support data input or output in multiple formats, while using a particular UTF for internal storage or processing.

        So please don't jeopardize SQL Server's Unicode support because of it only support one of the UTFs.

    3.    SQL Server cannot store all Unicode Characters

    You may notice that I say SQL Server support UTF-16 Encoding in previous paragraph, but I also said SQL Server' nvarchar type encoding is UCS-2.  I intend to do this in this article because I want to discuss the different between these two Encodings in here. Let us describe in details in what area SQL Server support UTF-16 encoding:

    • SQL Server can store any Unicode characters in UTF-16 encoding. The reason is that the storage format for UCS-2 and UTF-16 are the same.
    • SQL Server can display any Unicode characters in UTF-16 encoding. The reason is that we internally call Windows functions to display characters, the Windows functions and fonts can take care of the supplementary character (a character take 4 bytes in UTF-16) correctly.
    • SQL Server can input any Unicode characters in UTF-16 encoding. The reason is that we internally call Windows IMEs (Input Method Editors) to input, the Windows IMEs can take care of the supplementary character (a character take 4 bytes in UTF-16) correctly.
    • SQL Server can sort/compare any defined Unicode characters in UTF-16 encoding.  Note, not all code points are map to valid Unicode character. For example, The Unicode Standard, Version 5.1 defines code points for around 10,000 characters.  All these characters can be compared/sorted in SQL Server latest version: SQL Server 2008.

    In contrast, I also list the UTF-16 thing SQL Server doesn't support:

    • SQL Server cannot detect Invalid UTF-16 sequence. Unpaired surrogate character is not valid in UTF-16 encoding, but SQL Server accept it as valid. Note, in reality, it is unlikely end-user will input invalid UTF-16 sequence since they are not support in any language or by any IMEs.
    • SQL Server treats a UTF-16 supplementary character as two characters. The Len function return 2 instead of 1 for such input.
    • SQL Server has potential risk of breaking a UTF-16 supplementary character into un-pair surrogate character, such as calling substring function. Note, in the real scenario, the chance of this can happen is much lower, because 1)  supplementary character is rare 2) string function will only break this when it happens be the boundary.  For example, when calling substring(s,5,1) will break if and only if the character at index 5 is a supplementary character.

    In summary, SQL Server DOES support storing all Unicode characters; although it has its own limitation.  Please refer to my previous blogs to details.

  • What I was doing these days?

    In recent days,  I am mainly reading test books since I am a Tester, and I need to know more about testing.  The good news for people who are interesting in collation topic is that 

    • I will have a SQL Server collation select tool available soon.
    • I hope I can start to write UTF-8 related stuff.

     So if you are interesting in the above topic, and want to get more information soon.  Please ping me by adding comments.

  • SQL Server’s Binary Collations

       Today, I will discuss SQL Server’s Binary collations.  In SQL Server, we have two kinds of binary collations: BIN collation, which has collation name ending with _BIN and BIN2 collation, which has collation name ending with _BIN2.  In SQL Server, a string value (either varchar or nvarchar value) is encoded and stored as a sequence of binaries.  For nvarchar value, we always use the UCS-2 encoding.  For varchar value, we use the codepage of corresponding collation associated with the value. 

        For “Binary” collation, we use a sorting algorithm based on the binary sequence of the string to sort the string data. Now, let us look the binary sorting algorithm.  For varchar type, the algorithm is quite simple; the sorting result of a set of varchar values with binary collation is the same as we sort them use the binary sequence.  In T-SQL syntax:

                Select * from T order by varcharC 

    Equals to

                Select * from T order by convert(varbinary(8000), varcharC) 

    Note, it is always true for both BIN collation and BIN2 collation.  

         Now, let us look at sorting the nvarchar values.  In this case, we can not sort them purely use the binary sequence, the reason is that the binary sequence represents a sequence of UCS-2 characters, which take two byte per character.  Because SQL Server internally uses Little-Endian to store the UCS-2 character, the binary sorting may result undesired result.   For example, character a (We usually use U+0061 to indicate its Unicode Code Point assigned by Unicode Org) is encoded as 0x6100 in SQL Server, so If we sort these characters in their binary sequence order,  we will not be able to get a Unicode Code Point ordered sorting result  which is  desired for many application.  In SQL Server’s BIN2 collation, we sort the nvarchar type according to their Unicode Code Points instead of the binary sequence (internally, we do comparison per WCHAR based, which is 2 bytes).  Now, comes the BIN Collation, the algorithm is wired.  It first compares two characters based on the first WCHAR values of the two characters, if the values are equal, the algorithm will do binary based comparison for the reminding binary sequences, i.e., byte per byte comparison.

         Hope your guys can follow my writing.  If not, you can simply assume that BIN2 is a Unicode Code Point based sorting algorithm, and it is desired by many customer. But BIN is not.

    From the sorting algorithm’s view, all BIN collation use the same algorithm and all binary2 collation use another algorithm.  Then why we have latin1_general_bin or Japanese_bin collation?  The reason is that they use different code page for encoding the characters sorting in the varchar type.  A Latin1_General_Bin collation can not hold any Japanese collation, so even they use the same algorithm to sort the data, but people still need to choose the language which the data will stored, such as English, Chinese, Japanese etc.  Note, this only apply to varchar data type.  For nvarchar data types, these collations have no difference at all.

         On the other hand, SQL Server has many linguistic collations which sort string linguistically using the language associated with the collation.  All collations which are not binary collation are linguistic collations.  For example, Latin1_General_CI_AS is a linguistic collation; it uses a sorting algorithm compatible with several of English language and many Western European languages.  Please don’t confuse with the name of Latin1_General, it actually can sort all Unicode characters defined in Unicode 3.2 characters and it can also sort many other languages correctly as well (if the language has no sorting conflict with the latin1_general sorting rule).  I will start another article to describe the linguistic collation in deep details.

        As a conclusion, binary collations have better performance than linguistic collations, and that is the main advantage of using a binary collation. A binary collation is always case sensitive and assent sensitive. User should try use BIN2 collation instead of BIN collation.

  • Be aware of comparing unicode constant with varchar column (updated)

        Today,  I looked for SQL Server JDBC Driver's  sendStringParametersAsUnicode, this parameter controls how Unicode String (Java only have Unicode string which is UTF-16 encoding) will be sending to SQL Server.   As the name of this parameter indicate,  when the parameter equals to true, which is the default setting,  we always send unicode data, i.e., the string with UTF-16 encoding, to SQL Server.  When it is set to false,  it also send ANSI data, i.e., the string encoded with a code page to SQL Server. In this case, it uses the code page of the current connected SQL Server database to convert  the Unicode String to a codepage encoded string.

         I am a Unicode Fan, and always want recommend people using Unicode. So when I knew this parameter,  my first question is that WOW, set the value to false has potential data corruption. When we convert the UTF-16 encoded string into a codepage encoded string, we may face the issue of some character is not defined in the code page, and will be replace as question mark (?).  Then, after I read this Article, I knew the reason of doing so is to enabling index seek when the target data type in SQL Server is varchar type.  Note, the Article ONLY apply when the collation of your column is a SQL Colaltion, but NOT Windows collation.  In most of case, people will use Windows collation as the collation, except us-English which still use sql_latin1_general_cp1_ci_as.  So If the target type is nvarchar type or your collation is Windows Collation,  don't set the parameter to false. 

         Come to conclusion,  for JDBC application,  please use nvarchar data type, and you will have full Unicode support, no data lose, no issue of not being able to use index, and better performance.  If you are using Windows Collation, don't set the flag to false.

     


     

  • Is Datetime2 compliant with Daylight Savings Time?

    Today, I got following questions:

    "The documentation states that Datetime2 is unaware of the time zone.
         Does that still guarantee that when the clock gets adjusted for Daylight Saving Time this will not result in duplicate timestamps? I.e. is Datetime2 internally mapped to UTC?
    "

    Here is the answer:

    • Datetime2 does not guarantee that you will get a unique value you would have to have a unique constraint on the column. If you have several batches running at the exact same time on a multiprocessor system you may end up with duplicate values without a constraint on the column, if you need guaranteed unique values you need to work with uniqueidentifier and the NEWID function. 
    • DateTime2 in SQL Server does not contain any information about
      the timezone of the value and whether the value is UTC or local time.  The SYSDATETIME, SYSUTCDATETIME,  SYSDATETIMEOFFSET built-in functions will adjust automatically when the Windows Operating System, which the SQL Server instance running on, adjusts the clock during daylight saving switch.  I.e.,  the datetime offset returned by SYSDATETIMEOFFSET will be changed from -08:0 to -07:00 for Pacific Time.  The SYSDATETIME value generated will have one hour gap when entering daylight saving, i.e., one hour is missing, but SYSUTCDATETIME does not.
    • My recommendation is that always store UTC time in datetime2 column or use SYSDATETIMEOFFSET data type, and handle local time issue by calling Windows/.Net API.
  • Why I write so many collation topics?

       You may wonder why I keep writing collation topics in my blogs.  I found our Books Online topics related to collation is not so clearly enough, and sometime it confuse our reader.  I intend to present my idea related to collation in a series of blogs, and finally I want group all these topics together to draw a clear picture about collation. My blogs will try to answer following question in details:

    • How I should choose a collation?
    • What is the different between collation A and B?
    • How collation impact the behavior of query, such as string functions,  variables assignment?
    • Best practices for using SQL Server's collation.

     Again, I strong suggest your guys send my comments or collation related question to me,   If I can not answer, I will find some language expert to answer this, and I will also share the answer as well.

  • Got Collation conflict, How to avoid this?

    The best way to avoid SQL Server's collation conflict issue is that avoid have different collations in your database schema. 

    •  If your server and database have the same collation, you will never see the conflict collation issue.
    •  if all string columns in a database use the same collation, you will see less collation conflict unless you do cross-database query.  You don't need explicit set the collation for string collations, they will pick up the database's collation by default.
    • Keep in mind, changing existing database's collation is allowed, but it wouldn't changing existing table's collation, so they will have change of mix of different collations.
    • I believe the common case of collation conflicts happens when your tempdb collation (which is your server collation) is different with your current database' collation.  Objects created on tempdb will automaticlly pick the server collation, not the current database collation.  The best way to solve this is this article:http://sqlskills.com/BLOGS/KIMBERLY/post/Changing-Database-Collation-and-dealing-with-TempDB-Objects.aspx
    • When you hit collation conflict,  you can always use explicit collate clause to resolve the conflict.
    • If you have any other concern or suggestion,  please send me detail about your issue.  You can also vote for letting SQL Server solve this issue at http://connect.microsoft.com/SQLServer/feedback/Validation.aspx?FeedbackID=324910
  • GB18030 Character Set Support in SQL Server

    Recently, I got several request related to GB18030 Character Set support in SQL Server.  Here is one of the requests:

    When I try to insert different combination of GB18030 characters in to a varchar column, it gives me unique constraint violation error.
    insert myTable
    values ('32066326-6E23-4E31-B226-9F8652CE11F3', N'
    㐀㐁ᠠᠡᠢཌཌྷꀀꀁﺶﺷﺸ', 'AC1', '')
    insert myTable
    values ('32066326-6e23-4e31-b226-9f8652ce11f7', N'
    ꀀꀁﺶﺷﺸ', 'AC1', '')

    does anyone know cause / solution / workaround?

    Before going into the solution,   I need to explain about what is GB18030 Character Set.  Here, I copied the Unicode Faq on GB1830:

    What is GB 18030?

    GB 18030 is a new Chinese codepage standard that extends GB 2312-1980 and GBK (which itself is an extension of GB 2312-1980).

    What is new in GB 18030?

    It is a multi-byte encoding using 1-byte, 2-byte, and 4-byte codes. The 1-byte and 2-byte codes have the same assignments as in GBK, which itself is a superset of GB 2312-1980.
    There are about 1.6 million valid byte sequences.
    It is not possible to determine if a byte sequence is either 2 or 4 bytes long by just examining the lead byte - the second byte must be examined as well.

    Why is GB 18030 important?

    The Chinese Government has mandated that all applications released on or after 2001-Sep-01 must support GB 18030.

    How does GB 18030 relate to Unicode?

    The specification refers directly to a mapping of GB 18030 codes to and from ISO 10646/Unicode to define most character assignments. Some characters that used to be mapped for GBK to the PUA (Private Use Area) for Unicode 2.1 are now assigned in Unicode 3.0, and their mappings from GB 18030 use only the Unicode 3.0 code points.

    In addition, GB 18030 defines roundtrip mappings for all 1.1 million Unicode code points including unassigned and non-character ones, but excluding single surrogates. This makes GB 18030 functionally very similar to a UTF.

    So let me try to answer the customer's question.  Remember varchar type always associates with a code page, and can only stored the characters defined in the code page.  The code page which store Simplified Chinese is 936.  Code page 936 corresponds to Chinese GBK standard, which is a subset of GB18030 Encoding.   So it means that storing GB18030 characters in varchar type has potent data loss issue.  On the other hand, since Unicode 3.0 and GB18030 can be roundtrip, so the best way to store GB18030 characters is use SQL Server's nvarchar type.  Most of the new added characters in GB18030 map to the Unicode CJK Extension B block, while the block are all Supplementary characters which are represented with a surrogate pair.  So supporting Gb18030 character becomes to the same as support Surrogate character, which already discussed in previous blog

    However, the nvarchar type use UCS-2 encoding, which is not the GB18030 encoding. My next point is related to Gb18030 encoding itself.  What if user has GB18030 encoding data/files/scripts,   the answer is that SQL Server can recognize this GB18030 encoded input, and convert them into Unicode, so you will no issue with these files/scripts/data.

    References:

    Ask Dr. International,Column #15, New Chinese Encoding GB-18030

    GB 18030 WikiPage

    GB18030 Characters in SQL Server

    International Features in Microsoft SQL Server 2005

  • My recommendation of SQL Server’s Globalization Development

    In this blog, I will give some my recommendation of SQL Server's Globalization Development.

    1.       Use nchar/nvarchar instead of char/varchar type.   Nchar/nvarchar type is SQL Server's Unicode data type, which can store any characters defined by Unicode Org.  char/varchar type always associates with a code page, so the number of supported characters is limited.  I give you some fact to show why use Unicode data type is recommendation: 

    o   Starting from Windows 2000, Windows have full Unicode support implemented by using UTF-16 Encoding.  Windows APIs take WCHAR* as input which represent a Unicode String. 

    o   Virtual C++ has WCHAR, which is Unicode char type.

    o   .Net String is Unicode String only encoded in UTF-16.

    o   Java String is Unicode String only encoded in UTF-16.

    o   Char type might have data corruption issue if your client locale is different with server locale.

    So it is time to deprecated the old code page technique, and use the Unicode in your application, and also in your SQL Server database.   The only benefit of using char type is the space saving for single byte code page.  You can always use the Data Compression feature in SQL Server 2008 if you care about disk space, which can save more space than using char type.  

    2.       Also remember to put N' for your string literal in T-SQL.   String literal N'abc'   is a Unicode nvarchar string literal.  Literal ‘ab©'  is a varchar literal, it always associates with a codepage (string literal always use current database's collation). Suppose char © is not in the code page, you will get a question mark (?) when inserting the literal into a table,  even the column is nvarchar type. 

    3.       Use nvarchar instead of nchar.  The different between nvarchar and nchar is the storage.  A column with nchar(30) type always take 60 bytes to store on the disk, even the value is a single character.  The data size for a nvarchar(30) type column is not fixed,  it varies row by row or value by value.  A single character value takes 2 bytes to store, and a value with 30 characters long  takes 60 bytes to be stored.  Another different between nvarchar and nchar type is the performance.  Nchar types always stored in fixed location in every row, which can be retrieved faster than nvarchar type which is stored in different location for different row.  However, I believe the benefit of less stored space for nvarchar types usually overcomes the cost of locating the value in a row. 

    4.      If possible, avoid using a column collation which is different with the database's collation.   You will have less collation conflict.   If you want a query use a special collation's sorting rule, use explicit collate clause in that query.

    5.       Use Windows Collation instead of SQL Collation.   The only exception here is the default collation for en_US locale which is sql_latin1_general_cp1_ci_as.

    6.       Never store UTF-8 string in varchar types, you will get data corruption.

    7.       Keep in mind, string comparison always ignore trailing spaces (Unicode U+0020), no matter what collation you are using.

    8.       Keep in mind, LEN function always return the number of characters exclude the trailing spaces. DataLength function returns the storage size in term of bytes.

    9.       using _BIN2 collation, instead of  _BIN collation if you want binary, code page base string comparison behavior.

  • UTF-16 Encoding and SQL Server (4) - String Function Case Study

    Today,   I will give an example of different implementation of a string function.  Let us look at function:

    SUBSTRING(value_expression ,start_expression , length_expression )

    In here, I proposed four kinds of alternative functions by giving them different name to distinct the functionality of them.

    1.       USC2-SUBSTRING.  This function has the same functionality as the SUBSTRING function.

    2.       UTF16-SUBSTRING: This function take UTF-16 string as input, return type is also UTF-16 string

    3.       UTF8-SUBSTRING: This function take UTF-8 string as input, return type is also UTF-8 string

    4.       CHAR_SUBSTRING: This function is a linguistic version of substring, it wouldn't break the character in the given language, and the encoding can be either UCS-2/UTF-8 or UTF16.

    For the length semantic, I will assume that it is character length with respected to the corresponding Encoding. I will use following table to show the differences between these functions when calling function:

                    Declare @a nvarchar(10)

                   select substring(@a, 5, 2)

    , suppose the 10 here represents 10 characters  

     

    Function Name

    Input Encoding

    Output Encoding

    Output Type

    Output  Buffer Bytes

    Surrogate Aware

    Collation

    Sensitive

    Algorithm

    Complexity

    USC2-SUBSTRING

    UCS-2

    UCS-2

    Nvarchar(2)

    4

    No

    No

    Simple, and Fast

    UTF16-SUBSTRING

    UCS-16

    UCS-16

    Nvarchar(2)

    8

    Yes

    No

    Always scan from the beginning, need to handle surrogate pair.

    UTF8-SUBSTRING

    UTF-8

    UTF-8

    Nvarchar(2)

    8

    Yes

    No

    Always scan from the beginning. See (1)

    CHAR_SUBSTRING

    Variance

    =input

    5, see (2)

    Depends on encoding

    Yes, see (3)

    Yes

    More complex. See (4)

     

    (1)     For UTF8-SUBSTRING. In theory, we can do a conversion from UTF-8 to UTF-16, and call UTF-16 version of the algorithm.  But it will have double conversion since the result need to convert back to UTF-8.

    (2)    It is unknown for how many bytes one linguistic character will take until the run-time.  So we have to make the result type = input_length- input_index

    (3)    A surrogate pair is always treated as one linguistic character, just it is surrogate aware

    (4)    The function itself will be pretty complex. For looking for one character, it needs search deeper and the move the pointer back.  

    In addition, I like to give some alternative solution for UTF16_SUBSTRING:

    1)      We call UCS2-SUBSTRING internally, and the last character it return a leading surrogate character, just removing such character from the result. The advantage of this approach is that it is fast, and it wouldn't break surrogate pairs.  The drawback is that it doesn't respect UTF-16 length semantic, which will lead programming issues.

    2)      We call the new UTF16-SUBSTRING function above, but return type is nvarchar(2), and here the 2 means that four UCS-2 characters.  I.e., we keep the existing return type, and semantic as the same.  Again, the result can only hold one UTF-16 character instead of 2.

    3)      We call the new UTF16-SUBSTRING function above, but return type is nvarchar(4), and here the 4 means that four UCS-2 characters.  I.e., we obey the UTF-16 length semantic for input parameter, but not for output.   In these ways, we have the same meaning of nvarchar(4) in UCS-2 encoding and UCS-16 encoding.  But we still break the return type from nvarchar(2) to nvarchar(4).

    Finally, I like to point out that we have CLR examples of Supplementary-Aware (Supplementary character refer to a Surrogate Pair) String Manipulation at here.  It internally uses StringInfo.ParseCombiningCharacters Method.  It not only handles Supplementary characters, but according to the API, it handles other combining character as well.  Note, the different between this method and the CHAR_SUBSTRING I presented here is that there are more language specific rules or character sequences which cannot be handled by the C# method.

     

  • Introduce the Turkish I issue

    Today, I will discuss the famous Turkish I issue.  Below is one customer's problem related to Turkish collation?

    When I use TURKISH_CI_AS collation, some of the field names in the database becomes case-sensitive, as a result queries are returning Invalid Column Name exception if there is a difference between the case of queries and case in the table definition.
    This error is particularly evident with 'I' character. Any field name with 'I' character results in error. I'm not sure about what other characters will cause error.

    I use following script to demo this issue.

    create database TurkishDB collate turkish_ci_as;

    use TurkishDB;

    create table Information(c1 int, c2 Int);

    select * from Information;

    c1          c2

    ----------- -----------

    (0 row(s) affected)

    select * from information

    Msg 208, Level 16, State 1, Line 1

    Invalid object name 'information'.

    As you can see from the result, querying table Information get correct result, but querying information

    Does not.   The problem here is that in Turkish, the upper case of character of i (lower case letter i with dot) is İ (capital i with dot).  Also, the lower case of character of I (capital i without dot) is (lower case letter i without dot).   In one word, i and I are not a match pair any more. In SQL Server's Turkish_CI_AS collation, even we do case insensitive comparison, i and I are not equal.

    In SQL Server, the nvarchar columns of the system tables use the same collation as the database collation.  As a result, when we search for Object information in the above database, it does not match the Object name Information any more. That is the root reason of the above issue.

    This is one more trick in the above example.  In the above creating statement,  int and Int both works fine.  Here is the answer from our dev:

    Built-in function names (including min and MAX) are matched like TSQL keywords (English, case insensitive). Name binding for all user objects would follow the collation rules. Note that persisted system objects (from sys. Schema) also require name binding using current collation.

    Internally, we use Latin1_General_CI_AS to match keywords. In the later stage of query compilation, we need bind a given object name with internal object id, we use the current database's collation for name matching.  

    Personal, I think the system table should use a language invariant collation to do name matching instead of using the current database's collation.  That is another good reason for introducing language neutral, binary collation

    Finally, here is my suggestion of avoid the Turkish I issue.  First, for object name, etc, try use either all lower case letter, or all upper case letter.  In addition, when reference to the object during query, make sure you use the same object name in your query.  Second, keep in mind, this Turkish I issue also happen in your data.   For example, when you match input values in one of your stored procedure, and do different operations correspondingly.  You might consider to use a more generic collation, such as latin_general_ci_as to do string match.  Third, keep in mind, upper() and lower() built-in always was impacted by this issue,  you might also use a more generic collation for this.  For example, suppose your default collation is english collation

    select UPPER ('i') collate turkish_ci_as   --will give you english upper/case rule, but the result is covered to turkish collation

    select UPPER ('i' collate turkish_ci_as )  --will give your turkish collation rule,  the result is also turkish collation

    select @a = UPPER (@a collate turkish_ci_as )  --will give your turkish collation rule,  the result is also english collation

  • UTF-16 Encoding and SQL Server (3) - Length Semantic

     As you discuss in here and here before, in order to supporting UTF-16 encoding in SQL Server, we need to handle both "good" data and "bad" data, there are different approaches to achieve this.  Today, I will talk about the length semantic again.

    In SQL Server, we have two categories of string type: char/varchar/varchar(max)/text types always associate with a code page, and can only store the characters in the range of that code page;  nchar/nvarchar/nvarchar(max)/ntext types always use USC-2 encoding, and can store any Unicode characters.   When I declare a table has two columns, one for varchar, another for nvarchar:

                    Create table t1(c1 varchar(30), c2 nvarchar(30))

    For column c1, the 30 means that the system will allocate 30 bytes for that variable, and it can only hold as many characters as the 30 byte can hold.  The above table has no collate clause associate with the column definition, thus they will use the current database's collation as their collation.  If the codepage of column c1 (which is controlled by the collation of that column) is one of the single-byte character-set (SBCS, is used to refer to character sets which use exactly one byte for each graphic character), the column can ALWAYS store up to 30 characters. However, if the code page is one of the MultiByte Character Sets (MBCS, is used to refer to character set which use variable length of bytes for each graphic character), then it can still store up to 30 characters, but it is not always true.  Suppose the collation of column c1 is Chinese_prc_ci_as, which has code page 936 (Simplified Chinese).  The codepage encodes ASCII characters (range from 0 to 127) with one byte, and all other Chinese characters in two bytes.  In this case, column c1 can always hold at least 15 characters (when all these charactes are Chinese), and at most 30 characters (if all of them are ASCII).  In summary, for varchar type, the length semantic is the number of bytes.

    Let us consider the case for nvarchar type.  Here is the definition of this type I copied from BOL:

    Unicode data and use the UNICODE UCS-2 character set.

    nchar [ ( n ) ]

    Fixed-length Unicode character data of n characters. n must be a value from 1 through 4,000. The storage size is two times n bytes. The ISO synonyms for nchar are national char and national character.

    nvarchar [ ( n | max ) ]

    Variable-length Unicode character data. n can be a value from 1 through 4,000. max indicates that the maximum storage size is 2^31-1 bytes. The storage size, in bytes, is two times the number of characters entered + 2 bytes. The data entered can be 0 characters in length. The ISO synonyms for nvarchar are national char varying and national character varying.

    so column c2 in above table can ALWAYS hold up to 30 Unicode UCS-2 characters.  So we can say that, for nvarchar type, the length semantic is number of characters.

    What if column c2, the nvarchar type, is encoded in UTF-16?  We have two options: 1)   we say c2 can hold up to 30 Unicode UTF-16 characters, but NOT ALWAYS.  It is because of a surrogate pair takes 4 bytes, if we still allocate 62 bytes for values stored in c2 which cannot hold 30 surrogate pair.  2) We can say c2 can ALWAYS hold up to 30 Unicode UTF-16 characters by internally allocate 122 bytes for values stored in c2.

    Let us compare about the two approaches.  Case one has less code change and no breaking change in existing functionality, but it will more like the varchar's length semantic, and customer need aware of this.  Case two has better semantic meaning of character length, and customer might desire for this.  Drawback is that user can only declare length up to 2000. For example, c2 nvarchar(2001) will be invalid since in theory, we might end-up with 8006 bytes which cannot hold in one physical page.  It is a breaking change for customer, and it also has certain benefit (I will discuss later). Oracle and MySQL both use the second approach.

    Now, let us examine one of the string functions: substring which is declared at

    SUBSTRING(value_expression ,start_expression , length_expression )

    In UTF-16, the start_expression and length_expression argument should both use the correct UTF-16 character length semantic; such that a surrogate pair is always treat as one character. If I run query

                    Select substring(c2, 10,2) from t1

    What is the return type for the above query.  Today, since we know that the query return at most two character per value, so the return type will be nvarchar(2).  It will be problematic when c2 is UTF-16 Encoding.  The reason is that if the returned two characters are both surrogate-pairs, what is the return type then?    The answer really depends on the above two approach of UTF-16 length semantic.  For approach one, we need either return nvarchar(4), or return nvarchar(2) but only hold the first surrogate pair.   For approach two, it is easy, the return type is nvarchar(2) although this nvarchar(2) has different meaning of our old nvarchar(2) in UCS-2.

    What about this query:                

                   Select substring(c2, 10,@len) from t1,

     since the system don't know the value of variable @len, the return type will be nvarchar(30).

    What about substring with varchar type since some characters are also double bytes.  I will check tomorrow, and update the result here.

  • UTF-16 Encoding and SQL Server (2)

    As you discuss in here, in order to supporting UTF-16 encoding in SQL Server, we need to handle both "good" data and "bad" data. Today, I will discuss different approaches of adding UTF-16 support in SQL Server.  For each approach I described today, I will discuss the advantage and disadvantage, and leave readers to judge what one you prefer.

    The first approach is mainly focusing on handling good data.  Since existing string functions are not surrogate aware, and have possibility of breaking a surrogate pair (good data) into single surrogate point (bad data), why want just introduce a set of new functions which use the correct character length semantic?  For example, we can introduce char_len, char_substring, char_left, char_right functions, which take a surrogate pair as one character.  For existing string functions, the semantic and result will not be changed.  What about "bad" data, I will say we possibility have to treat as "good" data since we haven't prevent existing string functions to generate bad data from good data.  The advantage of this approach is that we are adding new functions, and leave old function unchanged (in term of functionality and also performance). For people who don't care about surrogate pair, they can continue to use old string functions without seeing any difference.  For people who care about surrogate pair, they can start to use new functions.  The disadvantage of this approach is that it does not really solve the "bad" data issue. Also people have to choose between the new string functions and the old string functions.

    Let us talk about the second approach.  It is pretty simple: we just fix all existing functions which take the correct character length semantic.  The good thing is that we will not break good data and always return correct result.   The cost is that we may pay certain performance penalty. For example, Before LEN() function just can just need return number of WCHARs, which take O(1) time, however, the new LEN() will have to traverse the whole string even this is no surrogate character in the string, which will be quite slow.  Another issue is that changing the result of existing functions is a breaking change. Len() function used to return 2 for a surrogate pair, but it return 1 now.  Even the new result is correct, some customer might still want keep the old behavior.   Also, I haven't checked the return type of these string functions (I will start another topic to cover this).  For example, if substring(c1, 1,2) return nvarchar(1), then it need either return nvarchar(2)  as data type.   This behavior change is not trivial and it is definitely a breaking change.   One possible solution is to use the compatibility level.  User can change the database's compatibility level to be lower than the current one, and it will see old behavior and also enjoy the better performance of old behavior.  Then, what about the "bad" data?  I am favor of fixing the bad data whenever possible.   What about database upgrade?  Do we want to scan the whole database to verify no bad data or we just assume that there is no bad data, and only generate error when we found bad data later?   I like to raise another question here.  Can you track whether there are surrogate characters in a column of one table?  If so, can we smartly use old algorithm for such column?

    I have the third solution in mind, which is introducing the concept of Encoding in database level.  A database can have an Encoding option which indicates the encoding of nvarchar type. The default encoding is UCS-2 which is the current encoding we supported, and all existing behavior will be the same.  A UTF-16 Encoding will be introduced and can be set on database level.  Once the database encoding is set to be UTF-16, all string functions will follow the UTF-16 length semantic, and we will also do data validation as well.  The advantage of this approach is that user will enjoy the better performance of old behavior by default, and no upgrade issue at well.  New user can choose the new behavior only when they intend to do.  What about changing encoding from UCS-2 to UTF16, do we need to do data validation?  I don't know the answer.  BTW, other database vendors also use the same mechanism as well.

    Finally, I like to briefly describe other approaches which I thought of, but I don't like personally. How about introduce new collations, such as latin1_general_UTF16_ci_as.  We already have enough collations to confuse our customers, I am not favor of adding more collations.  How about introduce new column level encoding?   I believe in most of case, people want the whole database have the same behavior, having fine granular control over the encoding might not be interesting to our customer.

    In summary, performance and backwards compatibility will be the two issues which need to be carefully considered when implementing UTF-16 in SQL Server.   My next blog will exam all string functions to see what is the impact with UTF-16 Encoding.

    BTW, Michael Kaplan posted a series of articles at here to describe his idea of UCS-2 to UTF16 migration for SQL Server.  I have some difficulty to follow his idea.  However, I can guess some basic principle here. In some case, a sequence of Unicode points together as one character from the end user's point view.  Examples are: the two small ss in German language, Korean Jamo and Thai Language.  Breaking such character sequence is not desired by end users.  Then, the question is that should we also make the new function take "true" character into consideration.  Here, I reference Unicode FAQ "How should characters (particularly composite characters) be counted, for the purposes of length, substrings, positions in a string, etc. "  as the end of this topic

    A: In general, there are 3 different ways to count characters. Each is illustrated with the following sample string. 
    "a" + umlaut + greek_alpha + \uE0000. (the latter is a private use character)

    1. Code Units: e.g. how many bytes are in the physical representation of the string. Example:
    In UTF-8, the sample has 9 bytes. [61 CC 88 CE B1 F3 A0 80 80]
    In UTF-16BE, it has 10 bytes. [00 61 03 08 03 B1 DB 40 DC 00]
    In UTF-32BE, it has 16 bytes. [00 00 00 61 00 00 03 08 00 00 03 B1 00 0E 00 00]

    2. Codepoints: how may code points are in the string.
    The sample has 4 code points. This is equivalent to the UTF-32BE count divided by 4.

    3. Graphemes: what end-users consider as characters.
    A default grapheme cluster is specified in UAX #29, Text Boundaries, as well as in UTS #18 Regular Expressions.

    The choice of which one to use depends on the tradeoffs between efficiency and comprehension. For example, Java, Windows and ICU use #1 with UTF-16 for all low-level string operations, and then also supply layers above that provide for #2 and #3 boundaries when circumstances require them. This approach allows for efficient processing, with allowance for higher-level usage. However, for a very high level application, such as word-processing macros, graphemes alone will probably be sufficient.

     

More Posts Next page »

© 2009 Microsoft Corporation. All rights reserved. Terms of Use  |  Trademarks  |  Privacy Statement
Microsoft
Page view tracker