Collation, DateTime, SParse Column and XML

SQL Server and UTF-8 Encoding (1) -True or False

Today, I will start my series of articles about SQL Server and Unicode UTF-8 Encoding. In many times, I found when people ask me about UTF-8, they actually don't understand UTF-8.  So today's talk will be quite short. I just clarify some misunderstand.

1.    SQL Server doesn't support Unicode, do you mean UTF-8?

    Sometime, people just say "SQL Server doesn't support Unicode". Actually, it is wrong, SQL Server support Unicode since SQL Server 7.0 by providing nchar/nvarchar/ntext data type.  But SQL Server doesn't support UTF-8 encoding for Unicode data, it supports the UTF-16 encoding.  I copy several definitions from Internet for these concepts:

   "Unicode is the universal character encoding, maintained by the Unicode Consortium. Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.  The unique number, which is called the code point of a character, is in the range 0 (hex) to 10FFFF(hex). It is normal to reference a Unicode code point by writing "U+" followed by its hexadecimal number. For example, Character A is defined as "Latin Uppercase Alphabet", and assigned a code point U+0041.  
      In additional to assign each character to a unique code point, Unicode Consortium also defined several Unicode transformation formats (UTFs) which are the algorithmic mapping from every Unicode code point to a unique byte sequence. Note, the Unicode code point itself has nothing related to how to store/transform in a Computer, but a UTF does. 
      The commonly used UTFs are UTF-16 and UTF-8. UTF-8 is the byte-oriented encoding form of Unicode which commonly used on Unix Operating System, Web Html and Xml File.  UTF-16 uses a single 16-bitcode unit to encode the most common 63K characters, and a pair of 16-bit code unites, called surrogates, to encode the 1M less commonly used characters in Unicode. UTF-16 is commonly used on Windows, .Net and Java. The transform between different UTFs are loseless and round-tripping. "

    In summary, don't confuse with Unicode and UTF-8 Encoding. They are totally different concepts. 

2.    UTF-8 Encoding is much better than UTF-16 Encoding

There are tons of articles comparing with UTF-8 encoding with UTF-16 encoding.  I will compare these two encoding side by side in my next article.  I can have 100 reasons to say UTF-8 Encoding is better than UTF-16 Encoding, and verse vice.  The correct answer is that no encoding is absolute better than the others.  User should choose the suitable encoding according to your application software requirement.  The operation system, programming language, database platform do matter when choosing the encoding.  UTF-8 is most common on the web. UTF-16 is used by Java and Windows. The conversions between all of them are algorithmically based, fast and lossless. This makes it easy to support data input or output in multiple formats, while using a particular UTF for internal storage or processing.

    So please don't jeopardize SQL Server's Unicode support because of it only support one of the UTFs.

3.    SQL Server cannot store all Unicode Characters

You may notice that I say SQL Server support UTF-16 Encoding in previous paragraph, but I also said SQL Server' nvarchar type encoding is UCS-2.  I intend to do this in this article because I want to discuss the different between these two Encodings in here. Let us describe in details in what area SQL Server support UTF-16 encoding:

  • SQL Server can store any Unicode characters in UTF-16 encoding. The reason is that the storage format for UCS-2 and UTF-16 are the same.
  • SQL Server can display any Unicode characters in UTF-16 encoding. The reason is that we internally call Windows functions to display characters, the Windows functions and fonts can take care of the supplementary character (a character take 4 bytes in UTF-16) correctly.
  • SQL Server can input any Unicode characters in UTF-16 encoding. The reason is that we internally call Windows IMEs (Input Method Editors) to input, the Windows IMEs can take care of the supplementary character (a character take 4 bytes in UTF-16) correctly.
  • SQL Server can sort/compare any defined Unicode characters in UTF-16 encoding.  Note, not all code points are map to valid Unicode character. For example, The Unicode Standard, Version 5.1 defines code points for around 10,000 characters.  All these characters can be compared/sorted in SQL Server latest version: SQL Server 2008.

In contrast, I also list the UTF-16 thing SQL Server doesn't support:

  • SQL Server cannot detect Invalid UTF-16 sequence. Unpaired surrogate character is not valid in UTF-16 encoding, but SQL Server accept it as valid. Note, in reality, it is unlikely end-user will input invalid UTF-16 sequence since they are not support in any language or by any IMEs.
  • SQL Server treats a UTF-16 supplementary character as two characters. The Len function return 2 instead of 1 for such input.
  • SQL Server has potential risk of breaking a UTF-16 supplementary character into un-pair surrogate character, such as calling substring function. Note, in the real scenario, the chance of this can happen is much lower, because 1)  supplementary character is rare 2) string function will only break this when it happens be the boundary.  For example, when calling substring(s,5,1) will break if and only if the character at index 5 is a supplementary character.

In summary, SQL Server DOES support storing all Unicode characters; although it has its own limitation.  Please refer to my previous blogs to details.

Published Friday, April 10, 2009 6:44 AM by Qingsong Yao
Filed under: ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Ravi said:

Hi,

How to set the Collation to UTF-16 in SQL Server 2000 and SQL Server 2005. could you please give the steps to change the collation.

What is the name of the Collation for UTF-16 in SQL Server 2000/2005.

Thanks

April 14, 2009 6:53 AM
 

Qingsong Yao said:

Hello, Ravi

 When you choose nvarchar as your data type,  you automatically get UTF-16 Encoding.  Please note,  For nvarchar types, we always use UTF-16 encoding to store the data no matter which collation you use.  The collation is only matter on what method it will use to compare two string.  Since different language has different sorting rule, that is the reason we have many different collations.

 The most common collation is latin1_general_ci_as. Although the name seems indicate it can only sort latin based character correctly, but it actually can sort many or majority of the language correctly.

April 15, 2009 2:14 AM
 

Community Blogs said:

SharePoint Uploading Files to SharePoint Server 2007 from ASP.NET Web Applications by Using the HTTP

June 12, 2009 8:11 PM
 

William said:

Hi Qingsong, I'm dealing with chinese characters. How come when I imported mt utf-8 file with chinese chars in sql server I get shattered non chinese characters even though I use nvarchar? Can you kindly advise? Thank you

July 10, 2009 2:43 AM
 

Qingsong Yao said:

Hello, William

 If you are using SQL Server 2005, you can use bcp.exe to bulk load UTF-8 data into your table,  try

[-C code page specifier] option and use 65001 as code page specifier.  Note,  I never tried this option, but I know it works.

 If you are using SQL Server 2008, the bcp.exe shipped with SQL Server 2008 don't support import UTF-8 data in.  As a workaround, you can install the SQLNCLI library shipped in SQL Server 2005 (I beleive it is a downable package) on another machine, and bulk load UTF-8 data into SQL Server 2008 table.

 Another workaround is converting your data from UTF-8 to UTF-16 encoding, you can either write a simple C# program or use NotePad.

 Note: the default chinese character's encoding is GB2313, which is supported as code page 936. which is natively supported.

July 10, 2009 1:00 PM
 

CJK said:

So, what you're saying is that SQL Server doesn't support UTF-16 in all cases, but does support UCS-2 ? (equivalent to UTF-16 BMP only). Unfortunately that will mean that not all Chinese, Japanese and Korean characters can be stored in SQL Server. A partial implementation is not much use if you are doing global internationalization.

Thanks for taking the time to write the article.

August 23, 2009 11:45 PM
 

Qingsong Yao said:

Hello, CJK

 All Chinese, Japanese and Korean characters can be stored in SQL Server.  Do you have a case that a character failed to store or retrieve from SQL Server?

August 24, 2009 12:05 AM
 

Mary Ellen said:

Great Article!! Can you please tell me where the next article is, that you referred to, where you describe the difference between UTF-16 and UTF-8? Thanks for writing the article.

September 18, 2009 10:25 AM
 

James Brook said:

Hi,

I have to create a database (SQL2005) for multinational usage (English and Korean languages included).

This database will be used by an international application.

Here is my question :

when creating a new database I'm not sure what collation option to pick to support the Korean language.

Thanks in advance for your help

October 2, 2009 8:48 AM
 

Qingsong Yao said:

HEllo, James

 The short answer is that a collation can support many languages.  A collation has a major language, such as latin1_general_ci_as has major language us_English.  It  also support Korean language since Korean language is compatiable with us_English.  So this collation is enough for your application.  

 However, the latin1_general_ci_as only support Unicode 2.0 characters, so  if you want more advacned Korean Language support, and Korean is the majority language of your application, you should choose Korean_90_CI_As as your collation.

  Please let me know if you have further question about this.

Qingsong

October 5, 2009 1:09 PM
 

Frank Lu said:

Hi Qingsong,

Here's a case need your help.  I created a linked server in SQL Server 2005 connecting to MYSQL.  I need to query the UTF-8 data from MYSQL and then insert them into local table.  After insert, the English chars are ok but Chinese chars are not displayed correctly.  Is there a way to use linked server retrieving data from MYSQL correctly?

November 3, 2009 3:27 AM
 

Qingsong Yao said:

Hello, Frank

  First, please make sure that you are using nvarchar type in your SQL Server column.  Second, how you setup your linked server, which OLE-DB provide you are using?

November 3, 2009 10:18 AM
 

Frank Lu said:

Hi Qingsong,

Thanks for the reply.

For linked server, I use Microsoft OLE DB Provider for ODBC Drivers which connect to a DSN - MySQL ODBC 5.1 Driver.

Other information:  Database collation - Chinese_PRC_CI_AS;  Coding language for application - Java

Please advise.

November 3, 2009 8:01 PM
 

Frank Lu said:

Hello, Frank

 Are you using nvarchar type in SQL Server? What about you only select the data throught linked server, but not insert into the table.  

  The behavior really depends on how we map the character datatype defined in MySQL to OLE-DB type.    If it maps to varchar type,  you will have data loss, other it maps to nvarchar type,  you should be fine

Qingsong

November 3, 2009 10:28 PM
 

BartJ said:

I have a SQL server 2005 database that is has the default Collation of SQL_Latin1_General_CP1_CI_AS.   I am trying to store Turkish Data (8859/9 and/or codepage 1254).  There are actually only six characters that don't store correctly in varchar columns.  For example, one of them is "Ş" (LATIN CAPITAL LETTER S WITH CEDILLA, U+015E, decimal 351).   So, for a test, I created new NVarchar(10) column thinking that would be all that would be required.  However when I run the following in SSMS T-SQL command in SSMS:

INSERT INTO mytable (MynVarchar) Values ('Ş')

the "Ş" gets changed to a Latin1 Captal "S" without the Cedilla--UNLESS I preface the unicode character with an upper-case 'N'  as below

INSERT INTO mytable (MynVarchar) Values (N'Ş')

I read in the article below that this is a requirement, but I do NOT understand WHY it is this way.   Why can't I just use NVARCHARS and have the data inserted correctly as you wrote in your article above?  To me this means that it's not a simple matter of changing my columns from Varchar to NVarchar, but in addition, I have to change *ALL* of my source code to include the prefaced 'N'.   From your article above, (and many others on the internet, excluding the one above from Microsoft), I keep reading that I just need to use NVARCHAR, but this is not correct?  What am I missing here?   Very confusing...

http://support.microsoft.com/kb/239530

November 10, 2009 1:00 PM
 

BartJ said:

I should add that I also tried change the collation of the Database from the SQL_Latin1_General_CP1_CI_AS to Turkish_CI_AS.   What I found was that the command below worked WITHOUT the prefaced 'N':

INSERT INTO mytable (MynVarchar) Values ('Ş')

So at that point I said to myself "GREAT!!! No problem!--I will simply change the collation to Turkish_CI_AS on all Turkish Servers, and all Varchars to NVarchars and then we are good to go--WITHOUT having to modify all of my source code to include the 'N'......   But then I connected to server in Turkey to test the same thing, (created a DB with Turkish_CI_AS as the collation, created an nVarchar(50) column and then ran the same insert command, and BOOM, the Ş is screwed up again and is now an "S"   I am extremely puzzled!!!

November 10, 2009 1:23 PM
 

Qingsong Yao said:

The reason is that 'Ş' is a varchar type which has a code page associate with it.  N'Ş' is a nvarch type and it is a Unicode character.  the sql_latin1_general_ci_as code page don't have the character 'Ş' defined in the code page, so we will use the best matched charcter for replacement.

For your secondary question,  I don't know the reason.  It might be client side issue, can you try to use SQL Server Management Studio to run the query and see the result?

November 10, 2009 1:40 PM
 

BartJ said:

I actually remote into their server and have been doing the tests inside of SQL server management Studio.   I just tried it again though, and this time, it worked!  I must have done something wrong when I set the collation the first time (I have been switching back and forth between the default collation and the Turkish and I must have goofed in my testing.)  

Anyway, is it appears that setting collation to Turkish_CI_AS, I can then issue the Insert command below without the N and the Ş character is preserved.

There are many articles that never mention collation settings though and seem to suggest that at least in terms of storing data "the collation doesn't matter as long as you use NVarchar instead of Varchar as your data type"  These articles seem to me to suggest that the collation only has value in regards to sorting and comparing data.  

In fact I just re-read this page of yours and you write:

"When you choose nvarchar as your data type,  you automatically get UTF-16 Encoding.  Please note,  For nvarchar types, we always use UTF-16 encoding to store the data no matter which collation you use.  The collation is only matter on what method it will use to compare two string.  Since different language has different sorting rule, that is the reason we have many different collations."

This again suggests that the collation doesn't matter.   Can you explain?  Thank you.

November 10, 2009 2:39 PM
 

sai said:

I have the following problem:

server 1 : sql server 2005 sp1 with proper collation supporting UTF 16 for chinese language support

I wanted to move the entire database to new server as it is , so i took backup and then moved it to new sql 2005 sp1 server and restored the backup , now when i query chinese characters are not displaying

I have compared each and every setting but no luck.

please advice what to check.

November 10, 2009 6:30 PM
 

Qingsong Yao said:

Hello, Sai

 First,  let us confirm whether it is a display issue or data missing/lose issue. From my experience, it is unlikely the back/restore will cause data lose.  So I guess it might related to data display issue.  Can you confirm that you install the correct chinese fonts.  It happens for Winodw XP/2003 which did not install Chinese Package by default.  You can go to Control Panel -> Language, and install it

November 10, 2009 6:43 PM
 

sai said:

hi qingsong,

thank you very much for replying me back.

i have confirmed that chinese package is installed , so please assist what would be my next steps

November 10, 2009 8:22 PM
 

Qingsong Yao said:

Hello, Sai

 Can you privde some more information on what knind of query you are using,

and the table structure. Also, what you see on your screen.

November 10, 2009 10:38 PM
 

BartJ said:

Hi Qingsong,

I am hoping you did not forget about me and my last question regarding whether the database collation is essential or not for "storing" unicode.  From my tests, it appear that it is, but as I've read from your posts above, and from many others, usually people say that the collation only matters in relationship to sorting and comparing teh stored data.

November 11, 2009 8:28 AM
 

Qingsong Yao said:

Hello, BartJ

 For nvarchar type, the collation only matter for sorting and comparing data,

but for varchar type, the collation also controls which language you can insert into.

For developing International Application, please use nvarchar type, and the prefix N'',   it will help you reduce a lot of issues later.

November 11, 2009 10:14 AM
 

Resolved said:

Hi Qingsong,

Thank you for your response, it turned out to be chinese font not installed :)

November 16, 2009 12:54 AM
 

Nelins said:

Hi Quinsong,

actually a have a mysql database linked to SQLServer 2000 using oleDb for ODBC and MySQL ODBC 5.1.  The MySql Database is in UTF-8 Charset.  When I run a query, it's returns a bad word, i mean instead to returns Bogotá, it's returns Bogotá.

I can't change the collation, directly on mySQL database because it's not mine.  How can I read those fields in the right way?  I've been trying to do it using collations syntax but it doesn't works.

I'll wait for your response.

Thnx.  Nelins

November 20, 2009 1:57 PM
 

Qingsong Yao said:

Hello, Nelins

 I never have experimence using linked server with MySQL.  I will try sometime today.  Can you explore some other way to port data from MySQL to SQL Server

November 20, 2009 2:04 PM
 

Nelins said:

Quinsong,

I don't know... I've been using this way in others linked Database in the same server and it works... just on this one is failing.

November 20, 2009 2:27 PM

Leave a Comment

(required) 
(optional)
(required) 

  
Enter Code Here: Required
Submit

© 2009 Microsoft Corporation. All rights reserved. Terms of Use  |  Trademarks  |  Privacy Statement
Microsoft
Page view tracker