|
|
-
In this article, I recommend several Unicode articles/websites for reference. Note, the list is not yet completed, I will add more entries and make better categorization - My blog is a good site for collation issues in SQL Server.
- Sort it all out. Michael Kaplan's random stuff of dubious value is a great source for learning Globalization techniques with Microsoft Software.
- International Features in Microsoft SQL Server 2005. The title says.
- Description of storing UTF-8 data in SQL Server: this is a article you need to read if you want to store UTF-8 data into SQL Server
- UTF-8 and Unicode FAQ for Unix/Linux. Although the title indicates about Linux/Unix, this is a very good papers talking about UTF-8. If you want to learn UTF-8, it is the best document.
- A tutorial on character code issues: one comprehensive article about character encoding. It is a good reference.
- Migrating Software to Supplementary Characters. This 57 pages presentation is a complete list of things you need to consider when migrating from UCS-2 to UTF-16. Read it, you will see that it is really not a trivial task.
- Globalization issues in ASP and ASP.NET. This is a really good article for ASP and ASP .Net
- Oracle® Database Globalization Support Guide: Programming with Unicode. I have to say oracle’s document is better sometimes. This book is pretty good reference if you want to learn Oracle’s Globalization support. it can also serve as a reference to Unicode.
- Avoid treating binary data as a String. In this blog, Shawn Steele discussed that interchange between binaries and string by using Encoding class is not always safe and round-trip since the binaries might be a mal-formed UTF-16 sequence. This link is the change in details.
- “How to: Send and Retrieve UTF-8 Data (SQL Server 2005 Driver for PHP)“. It is a must read document if you are developing PHP with SQL Server.
- UTF8 Security and Whidbey Changes. In this blog, Shawn Steel described behavior changes related to UTF-8 encoding functions in Windows Vista and .Net 2.0. For people using Windows MultiBytesToWCHAR and .Net UTF8Encoding class, this is a must read document.
- Some musings on Oracle Character Sets. A very good article about our friend Oracle’s Unicode character set support.
- Surrogates and Supplementary Characters. MSDN Win32 UTF-16 Support document.
- Implementation of Unicode in SQL Server. This is a pretty good introduction paper about Unicode Support in SQL Server
- “Saving UTF-8 in SQL Server 2005” is one forum thread where people talk about storing UTF-8 data in SQL Server.
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). This is a very good article which talks about the basic about Unicode in a simple way. A good introduction ariticle.
- Unicode in XML and other Markup Languages. This is W3C ‘s official document for using Unicode Standard for XML language
- Form of Unicode: this is a pretty handy document which talking about Unicode Encoding and the different format of Unicode character. This is my favorite document.
- Unicode in Visual Studio 2003 contains good examples how C#’s Unicode Support, such as FileEncoding, Globalization Config, ASP .Net Request and Response Encoding.
- The Basic of UTF-8. In this article, Marius did a brief introduction about UTF-8 Encoding. Then he gave a C++ code example of convert from/to UTF-16 to UTF-8. This is another example of writing UTF-8 in C++
- UTF-8 Versus Windows UNICODE. In this article, Ben did a comparison between UTF-8 and UTF-16 Encoding.
- Guide to Unicode is a series very extensive articles about Unicode. My favorite articles.
- Globalization Step by Step: a comprehensive guide-line for doing globalization development with Microsoft Products.
- Supplementary-Aware String Manipulation Sample is a set of SQLCLR functions which can do supplementary-ware string manipulation. Works for SQL Server 2005 and 2008.
- Autotranslation of Character Data. If you works on SQL Server varchar type, and retrieve by using ODBC driver, you might hit the issues discussed in the article when your client OS is different with your server OS.
- Oracle’s string function length semantic
- Oracle’s data type length semantic
- MySQL’s string function length semantic
- MySQL’s Unicode support
- Implement 4-byte UTF8, UTF16 and UTF32. This is MySQL’s design note for implementing different Unicode Encodings for version 6.0. It is a good reference if you want to know the challenge of supporting different Encoding in a database
- Understanding Unicode and ODBC Data Access: this is a good article for ODBC Driver’s Unicode Support.
- Using the Easysoft ODBC-Oracle Driver with Unicode Data: another article for ODBC Driver’s Unicode Support.
- Application encoding schemes and DB2 z/OS ODBC”: DB2’s Unicode Support
- MySQL 6.0 Manual: 9.1.4. Connection Character Sets and Collations: MySQL driver/protocol support for Unicode
- Esaysoft ODBC for MS SQL SERVER with UTF-8 support: If you want to get UTF-8 data back from SQL Server, try this.
- SQL SERVER字符集的研究. If you can read the title, you will know what is talking about.
- 众多字符集编码的区别. very good article in Chinese which talks about Unicode Encoding.
- The current PHP version 5 have NO Unicode Support, according to this web site.
- The next release of PHP version 6 will have native Unicode Support, here is the description from web site "Upcoming PHP release will offer Unicode support"
- PHP 6.0 will switch from having a single, generic string type to having two: a Unicode string type for text data, implemented through UTF-16, and a binary type, which will include actual binary data and text data for legacy locales.
- Can the CP_ACP be UTF-8? If you are working with Visual C++, and wonder whether the UTF-8 can set as the default locale of the system or current thread’s locale. You might need read this document which says No
|
-
Hello, Guys
Please the attached zip file for the SQL Server collation selection tool. It is written in C#.
|
-
Today, I will start my series of articles about SQL Server and Unicode UTF-8 Encoding. In many times, I found when people ask me about UTF-8, they actually don't understand UTF-8. So today's talk will be quite short. I just clarify some misunderstand. 1. SQL Server doesn't support Unicode, do you mean UTF-8?
Sometime, people just say "SQL Server doesn't support Unicode". Actually, it is wrong, SQL Server support Unicode since SQL Server 7.0 by providing nchar/nvarchar/ntext data type. But SQL Server doesn't support UTF-8 encoding for Unicode data, it supports the UTF-16 encoding. I copy several definitions from Internet for these concepts: "Unicode is the universal character encoding, maintained by the Unicode Consortium. Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. The unique number, which is called the code point of a character, is in the range 0 (hex) to 10FFFF(hex). It is normal to reference a Unicode code point by writing "U+" followed by its hexadecimal number. For example, Character A is defined as "Latin Uppercase Alphabet", and assigned a code point U+0041. In additional to assign each character to a unique code point, Unicode Consortium also defined several Unicode transformation formats (UTFs) which are the algorithmic mapping from every Unicode code point to a unique byte sequence. Note, the Unicode code point itself has nothing related to how to store/transform in a Computer, but a UTF does. The commonly used UTFs are UTF-16 and UTF-8. UTF-8 is the byte-oriented encoding form of Unicode which commonly used on Unix Operating System, Web Html and Xml File. UTF-16 uses a single 16-bitcode unit to encode the most common 63K characters, and a pair of 16-bit code unites, called surrogates, to encode the 1M less commonly used characters in Unicode. UTF-16 is commonly used on Windows, .Net and Java. The transform between different UTFs are loseless and round-tripping. "
In summary, don't confuse with Unicode and UTF-8 Encoding. They are totally different concepts. 2. UTF-8 Encoding is much better than UTF-16 Encoding
There are tons of articles comparing with UTF-8 encoding with UTF-16 encoding. I will compare these two encoding side by side in my next article. I can have 100 reasons to say UTF-8 Encoding is better than UTF-16 Encoding, and verse vice. The correct answer is that no encoding is absolute better than the others. User should choose the suitable encoding according to your application software requirement. The operation system, programming language, database platform do matter when choosing the encoding. UTF-8 is most common on the web. UTF-16 is used by Java and Windows. The conversions between all of them are algorithmically based, fast and lossless. This makes it easy to support data input or output in multiple formats, while using a particular UTF for internal storage or processing.
So please don't jeopardize SQL Server's Unicode support because of it only support one of the UTFs. 3. SQL Server cannot store all Unicode Characters
You may notice that I say SQL Server support UTF-16 Encoding in previous paragraph, but I also said SQL Server' nvarchar type encoding is UCS-2. I intend to do this in this article because I want to discuss the different between these two Encodings in here. Let us describe in details in what area SQL Server support UTF-16 encoding:
- SQL Server can store any Unicode characters in UTF-16 encoding. The reason is that the storage format for UCS-2 and UTF-16 are the same.
- SQL Server can display any Unicode characters in UTF-16 encoding. The reason is that we internally call Windows functions to display characters, the Windows functions and fonts can take care of the supplementary character (a character take 4 bytes in UTF-16) correctly.
- SQL Server can input any Unicode characters in UTF-16 encoding. The reason is that we internally call Windows IMEs (Input Method Editors) to input, the Windows IMEs can take care of the supplementary character (a character take 4 bytes in UTF-16) correctly.
- SQL Server can sort/compare any defined Unicode characters in UTF-16 encoding. Note, not all code points are map to valid Unicode character. For example, The Unicode Standard, Version 5.1 defines code points for around 10,000 characters. All these characters can be compared/sorted in SQL Server latest version: SQL Server 2008.
In contrast, I also list the UTF-16 thing SQL Server doesn't support:
- SQL Server cannot detect Invalid UTF-16 sequence. Unpaired surrogate character is not valid in UTF-16 encoding, but SQL Server accept it as valid. Note, in reality, it is unlikely end-user will input invalid UTF-16 sequence since they are not support in any language or by any IMEs.
- SQL Server treats a UTF-16 supplementary character as two characters. The Len function return 2 instead of 1 for such input.
- SQL Server has potential risk of breaking a UTF-16 supplementary character into un-pair surrogate character, such as calling substring function. Note, in the real scenario, the chance of this can happen is much lower, because 1) supplementary character is rare 2) string function will only break this when it happens be the boundary. For example, when calling substring(s,5,1) will break if and only if the character at index 5 is a supplementary character.
In summary, SQL Server DOES support storing all Unicode characters; although it has its own limitation. Please refer to my previous blogs to details.
|
-
In recent days, I am mainly reading test books since I am a Tester, and I need to know more about testing. The good news for people who are interesting in collation topic is that - I will have a SQL Server collation select tool available soon.
- I hope I can start to write UTF-8 related stuff.
So if you are interesting in the above topic, and want to get more information soon. Please ping me by adding comments.
|
-
Today, I will
discuss SQL Server’s Binary collations.
In SQL Server, we have two kinds of binary collations: BIN collation,
which has collation name ending with _BIN and BIN2 collation, which has
collation name ending with _BIN2. In SQL
Server, a string value (either varchar or nvarchar value) is encoded and stored
as a sequence of binaries. For nvarchar
value, we always use the UCS-2 encoding.
For varchar value, we use the codepage of corresponding collation
associated with the value.
For “Binary” collation, we use a sorting
algorithm based on the binary sequence of the string to sort the string data. Now,
let us look the binary sorting algorithm.
For varchar type, the algorithm is quite simple; the sorting result of a
set of varchar values with binary collation is the same as we sort them use the
binary sequence. In T-SQL syntax:
Select *
from T order by varcharC
Equals to
Select *
from T order by convert(varbinary(8000), varcharC)
Note, it is always true for both BIN collation and BIN2
collation.
Now, let us look
at sorting the nvarchar values. In this
case, we can not sort them purely use the binary sequence, the reason is that
the binary sequence represents a sequence of UCS-2 characters, which take two
byte per character. Because SQL Server
internally uses Little-Endian to store the UCS-2 character, the binary sorting
may result undesired result. For
example, character a (We usually use U+0061 to indicate its Unicode Code Point
assigned by Unicode Org) is encoded as 0x6100 in SQL Server, so If we sort
these characters in their binary sequence order, we will not be able to get a Unicode Code
Point ordered sorting result which is desired for many application. In SQL Server’s BIN2 collation, we sort the
nvarchar type according to their Unicode Code Points instead of the binary
sequence (internally, we do comparison per WCHAR based, which is 2 bytes). Now, comes the BIN Collation, the algorithm
is wired. It first compares two characters
based on the first WCHAR values of the two characters, if the values are equal,
the algorithm will do binary based comparison for the reminding binary sequences,
i.e., byte per byte comparison.
Hope your guys
can follow my writing. If not, you can
simply assume that BIN2 is a Unicode Code Point based sorting algorithm, and it
is desired by many customer. But BIN is not.
From the sorting algorithm’s view,
all BIN collation use the same algorithm and all binary2 collation use another
algorithm. Then why we have
latin1_general_bin or Japanese_bin collation?
The reason is that they use different code page for encoding the
characters sorting in the varchar type. A
Latin1_General_Bin collation can not hold any Japanese collation, so even they
use the same algorithm to sort the data, but people still need to choose the
language which the data will stored, such as English, Chinese, Japanese
etc. Note, this only apply to varchar
data type. For nvarchar data types,
these collations have no difference at all.
On the other
hand, SQL Server has many linguistic collations which sort string linguistically
using the language associated with the collation. All collations which are not binary collation
are linguistic collations. For example,
Latin1_General_CI_AS is a linguistic collation; it uses a sorting algorithm
compatible with several of English language and many Western European languages. Please don’t confuse with the name of
Latin1_General, it actually can sort all Unicode characters defined in Unicode
3.2 characters and it can also sort many other languages correctly as well (if
the language has no sorting conflict with the latin1_general sorting rule). I will start another article to describe the linguistic
collation in deep details.
As a conclusion, binary collations have
better performance than linguistic collations, and that is the main advantage
of using a binary collation. A binary collation is always case sensitive and
assent sensitive. User should try use BIN2 collation instead of BIN collation.
|
-
Today, I looked for SQL Server JDBC Driver's sendStringParametersAsUnicode, this parameter controls how Unicode String (Java only have Unicode string which is UTF-16 encoding) will be sending to SQL Server. As the name of this parameter indicate, when the parameter equals to true, which is the default setting, we always send unicode data, i.e., the string with UTF-16 encoding, to SQL Server. When it is set to false, it also send ANSI data, i.e., the string encoded with a code page to SQL Server. In this case, it uses the code page of the current connected SQL Server database to convert the Unicode String to a codepage encoded string.
I am a Unicode Fan, and always want recommend people using Unicode. So when I knew this parameter, my first question is that WOW, set the value to false has potential data corruption. When we convert the UTF-16 encoded string into a codepage encoded string, we may face the issue of some character is not defined in the code page, and will be replace as question mark (?). Then, after I read this Article, I knew the reason of doing so is to enabling index seek when the target data type in SQL Server is varchar type. Note, the Article ONLY apply when the collation of your column is a SQL Colaltion, but NOT Windows collation. In most of case, people will use Windows collation as the collation, except us-English which still use sql_latin1_general_cp1_ci_as. So If the target type is nvarchar type or your collation is Windows Collation, don't set the parameter to false.
Come to conclusion, for JDBC application, please use nvarchar data type, and you will have full Unicode support, no data lose, no issue of not being able to use index, and better performance. If you are using Windows Collation, don't set the flag to false.
|
-
Today, I got following questions: "The documentation states that Datetime2 is unaware of the time zone.
Does that still guarantee that when the clock gets adjusted for
Daylight Saving Time this will not result in duplicate timestamps? I.e. is Datetime2 internally mapped to UTC?" Here is the answer: - Datetime2 does not guarantee that you will get a unique value you would
have to have a unique constraint on the column. If you have several
batches running at the exact same time on a multiprocessor system you
may end up with duplicate values without a constraint on the column, if
you need guaranteed unique values you need to work with
uniqueidentifier and the NEWID function.
- DateTime2 in SQL Server does not contain any information about
the timezone of the value and whether the value is UTC or local time.
The SYSDATETIME, SYSUTCDATETIME, SYSDATETIMEOFFSET built-in functions
will adjust automatically when the Windows Operating System, which the SQL Server instance running on, adjusts the clock during
daylight saving switch. I.e., the datetime offset returned by
SYSDATETIMEOFFSET will be changed from -08:0 to -07:00 for Pacific Time. The
SYSDATETIME value generated will have one hour gap when entering
daylight saving, i.e., one hour is missing, but SYSUTCDATETIME does not.
- My recommendation is that always store UTC time in datetime2 column or use SYSDATETIMEOFFSET data type, and handle local time issue by calling Windows/.Net API.
|
-
You may wonder why I keep writing collation topics in my blogs. I found our Books Online topics related to collation is not so clearly enough, and sometime it confuse our reader. I intend to present my idea related to collation in a series of blogs, and finally I want group all these topics together to draw a clear picture about collation. My blogs will try to answer following question in details:
- How I should choose a collation?
- What is the different between collation A and B?
- How collation impact the behavior of query, such as string functions, variables assignment?
- Best practices for using SQL Server's collation.
Again, I strong suggest your guys send my comments or collation related question to me, If I can not answer, I will find some language expert to answer this, and I will also share the answer as well.
|
-
The best way to avoid SQL Server's collation conflict issue is that avoid have different collations in your database schema. - If your server and database have the same collation, you will never see the conflict collation issue.
- if all string columns in a database use the same collation, you will see less collation conflict unless you do cross-database query. You don't need explicit set the collation for string collations, they will pick up the database's collation by default.
- Keep in mind, changing existing database's collation is allowed, but it wouldn't changing existing table's collation, so they will have change of mix of different collations.
- I believe the common case of collation conflicts happens when your tempdb collation (which is your server collation) is different with your current database' collation. Objects created on tempdb will automaticlly pick the server collation, not the current database collation. The best way to solve this is this article:http://sqlskills.com/BLOGS/KIMBERLY/post/Changing-Database-Collation-and-dealing-with-TempDB-Objects.aspx
- When you hit collation conflict, you can always use explicit collate clause to resolve the conflict.
- If you have any other concern or suggestion, please send me detail about your issue. You can also vote for letting SQL Server solve this issue at http://connect.microsoft.com/SQLServer/feedback/Validation.aspx?FeedbackID=324910
|
-
Recently, I got several request related to GB18030 Character Set support in SQL Server. Here is one of the requests:
When I try to insert different combination of GB18030 characters in to a varchar column, it gives me unique constraint violation error. insert myTable values ('32066326-6E23-4E31-B226-9F8652CE11F3', N'㐀㐁ᠠᠡᠢཌཌྷꀀꀁﺶﺷﺸ', 'AC1', '') insert myTable values ('32066326-6e23-4e31-b226-9f8652ce11f7', N'ꀀꀁﺶﺷﺸ', 'AC1', '')
does anyone know cause / solution / workaround?
Before going into the solution, I need to explain about what is GB18030 Character Set. Here, I copied the Unicode Faq on GB1830: What is GB 18030?
GB 18030 is a new Chinese codepage standard that extends GB 2312-1980 and GBK (which itself is an extension of GB 2312-1980). What is new in GB 18030?
It is a multi-byte encoding using 1-byte, 2-byte, and 4-byte codes. The 1-byte and 2-byte codes have the same assignments as in GBK, which itself is a superset of GB 2312-1980. There are about 1.6 million valid byte sequences. It is not possible to determine if a byte sequence is either 2 or 4 bytes long by just examining the lead byte - the second byte must be examined as well. Why is GB 18030 important?
The Chinese Government has mandated that all applications released on or after 2001-Sep-01 must support GB 18030. How does GB 18030 relate to Unicode?
The specification refers directly to a mapping of GB 18030 codes to and from ISO 10646/Unicode to define most character assignments. Some characters that used to be mapped for GBK to the PUA (Private Use Area) for Unicode 2.1 are now assigned in Unicode 3.0, and their mappings from GB 18030 use only the Unicode 3.0 code points.
In addition, GB 18030 defines roundtrip mappings for all 1.1 million Unicode code points including unassigned and non-character ones, but excluding single surrogates. This makes GB 18030 functionally very similar to a UTF.
So let me try to answer the customer's question. Remember varchar type always associates with a code page, and can only stored the characters defined in the code page. The code page which store Simplified Chinese is 936. Code page 936 corresponds to Chinese GBK standard, which is a subset of GB18030 Encoding. So it means that storing GB18030 characters in varchar type has potent data loss issue. On the other hand, since Unicode 3.0 and GB18030 can be roundtrip, so the best way to store GB18030 characters is use SQL Server's nvarchar type. Most of the new added characters in GB18030 map to the Unicode CJK Extension B block, while the block are all Supplementary characters which are represented with a surrogate pair. So supporting Gb18030 character becomes to the same as support Surrogate character, which already discussed in previous blog.
However, the nvarchar type use UCS-2 encoding, which is not the GB18030 encoding. My next point is related to Gb18030 encoding itself. What if user has GB18030 encoding data/files/scripts, the answer is that SQL Server can recognize this GB18030 encoded input, and convert them into Unicode, so you will no issue with these files/scripts/data.
References:
Ask Dr. International,Column #15, New Chinese Encoding GB-18030
GB 18030 WikiPage
GB18030 Characters in SQL Server
International Features in Microsoft SQL Server 2005
|
-
In this blog, I will give some my recommendation of SQL
Server's Globalization Development.
1.
Use nchar/nvarchar instead of char/varchar
type. Nchar/nvarchar type is SQL
Server's Unicode data type, which can store any characters defined by Unicode
Org. char/varchar type always associates
with a code page, so the number of supported characters is limited. I give you some fact to show why use Unicode
data type is recommendation:
o
Starting from Windows 2000, Windows have full
Unicode support implemented by using UTF-16 Encoding. Windows APIs take WCHAR* as input which
represent a Unicode String.
o
Virtual C++ has WCHAR, which is Unicode char
type.
o
.Net String is Unicode String only encoded in
UTF-16.
o
Java String is Unicode String only encoded in
UTF-16.
o
Char type might have data corruption issue if
your client locale is different with server locale.
So it is time to deprecated the old
code page technique, and use the Unicode in your application, and also in your
SQL Server database. The only benefit
of using char type is the space saving for single byte code page. You can always use the Data Compression
feature in SQL Server 2008 if you care about disk space, which can save more
space than using char type.
2.
Also remember to put N' for your string literal in T-SQL. String literal N'abc' is a Unicode nvarchar
string literal. Literal ‘ab©' is a varchar literal, it always associates
with a codepage (string literal always use current database's collation).
Suppose char © is not in the code
page, you will get a question mark (?) when inserting the literal into a
table, even the column is nvarchar
type.
3.
Use nvarchar instead of nchar. The different between nvarchar and
nchar is
the storage. A column with nchar(30) type
always take 60 bytes to store on the disk, even the value is a single
character. The data size for a nvarchar(30) type column
is not fixed, it varies row by row or
value by value. A single character value
takes 2 bytes to store, and a value with 30 characters long takes 60
bytes to be stored. Another different between nvarchar and nchar
type is the performance. Nchar types
always stored in fixed location in every row, which can be retrieved
faster
than nvarchar type which is stored in different location for different
row. However, I believe the benefit of
less stored space for nvarchar types usually overcomes the cost of
locating the
value in a row.
4. If possible, avoid using a column collation which is different with the database's collation. You will have
less collation conflict. If you want
a query use a special collation's sorting rule, use explicit collate clause in that query.
5. Use Windows Collation
instead of SQL Collation. The only
exception here is the default collation for en_US locale which is
sql_latin1_general_cp1_ci_as.
6.
Never store UTF-8 string in varchar types, you
will get data corruption.
7.
Keep in mind, string comparison always ignore
trailing spaces (Unicode U+0020), no matter what collation you are using.
8.
Keep in mind, LEN function always return the
number of characters exclude the trailing spaces. DataLength function returns the storage size in term of bytes.
9.
using _BIN2 collation, instead of _BIN collation if you want binary, code page
base string comparison behavior.
|
-
Today, I will give
an example of different implementation of a string function. Let us look at function:
SUBSTRING(value_expression
,start_expression , length_expression )
In here, I proposed
four kinds of alternative functions by giving them different name to distinct
the functionality of them.
1.
USC2-SUBSTRING. This function has the same functionality as the SUBSTRING
function.
2.
UTF16-SUBSTRING: This function take UTF-16 string as
input, return type is also UTF-16 string
3.
UTF8-SUBSTRING: This function take UTF-8 string as input,
return type is also UTF-8 string
4.
CHAR_SUBSTRING: This function is a linguistic
version of substring, it wouldn't break the character in the given language,
and the encoding can be either UCS-2/UTF-8 or UTF16.
For the length semantic, I will assume that it is character
length with respected to the corresponding Encoding. I will use following table
to show the differences between these functions when calling function:
Declare
@a nvarchar(10)
select substring(@a, 5, 2)
, suppose the 10 here represents 10
characters
|
Function Name
|
Input
Encoding
|
Output
Encoding
|
Output Type
|
Output Buffer Bytes
|
Surrogate
Aware
|
Collation
Sensitive
|
Algorithm
Complexity
|
|
USC2-SUBSTRING
|
UCS-2
|
UCS-2
|
Nvarchar(2)
|
4
|
No
|
No
|
Simple, and
Fast
|
|
UTF16-SUBSTRING
|
UCS-16
|
UCS-16
|
Nvarchar(2)
|
8
|
Yes
|
No
|
Always scan
from the beginning, need to handle surrogate pair.
|
|
UTF8-SUBSTRING
|
UTF-8
|
UTF-8
|
Nvarchar(2)
|
8
|
Yes
|
No
|
Always scan
from the beginning. See (1)
|
|
CHAR_SUBSTRING
|
Variance
|
=input
|
5, see (2)
|
Depends on encoding
|
Yes, see (3)
|
Yes
|
More complex.
See (4)
|
(1)
For
UTF8-SUBSTRING. In theory, we can do a conversion from UTF-8 to UTF-16, and
call UTF-16 version of the algorithm.
But it will have double conversion since the result need to convert back
to UTF-8.
(2)
It is unknown for how many bytes one linguistic
character will take until the run-time.
So we have to make the result type = input_length- input_index
(3)
A surrogate pair is always treated as one
linguistic character, just it is surrogate aware
(4)
The function itself will be pretty complex. For
looking for one character, it needs search deeper and the move the pointer back.
In addition, I like to give some
alternative solution for UTF16_SUBSTRING:
1)
We call UCS2-SUBSTRING internally, and the last
character it return a leading surrogate character, just removing such character
from the result. The advantage of this approach is that it is fast, and it
wouldn't break surrogate pairs. The drawback
is that it doesn't respect UTF-16 length semantic, which will lead programming
issues.
2)
We call the new UTF16-SUBSTRING function above, but
return type is nvarchar(2), and here the 2 means that four UCS-2 characters. I.e., we keep the existing return type, and
semantic as the same. Again, the result
can only hold one UTF-16 character instead of 2.
3)
We call the new UTF16-SUBSTRING function above, but
return type is nvarchar(4), and here the 4 means that four UCS-2 characters. I.e., we obey the UTF-16 length semantic for
input parameter, but not for output. In
these ways, we have the same meaning of nvarchar(4) in UCS-2 encoding and
UCS-16 encoding. But we still break the
return type from nvarchar(2) to nvarchar(4).
Finally, I like to point out that we have CLR examples of Supplementary-Aware
(Supplementary character refer to a Surrogate Pair) String Manipulation at here. It internally uses StringInfo.ParseCombiningCharacters
Method. It not only handles Supplementary
characters, but according to the API, it handles other combining character
as well. Note, the different between this method and the CHAR_SUBSTRING I
presented here is that there are more language specific rules or character sequences
which cannot be handled by the C# method.
|
-
Today, I will discuss the famous Turkish I issue. Below is one customer's problem related to Turkish collation?
When I use TURKISH_CI_AS collation, some of the field names in the database becomes case-sensitive, as a result queries are returning Invalid Column Name exception if there is a difference between the case of queries and case in the table definition. This error is particularly evident with 'I' character. Any field name with 'I' character results in error. I'm not sure about what other characters will cause error.
I use following script to demo this issue.
create database TurkishDB collate turkish_ci_as;
use TurkishDB;
create table Information(c1 int, c2 Int);
select * from Information;
c1 c2
----------- -----------
(0 row(s) affected)
select * from information
Msg 208, Level 16, State 1, Line 1
Invalid object name 'information'.
As you can see from the result, querying table Information get correct result, but querying information
Does not. The problem here is that in Turkish, the upper case of character of i (lower case letter i with dot) is İ (capital i with dot). Also, the lower case of character of I (capital i without dot) is (lower case letter i without dot). In one word, i and I are not a match pair any more. In SQL Server's Turkish_CI_AS collation, even we do case insensitive comparison, i and I are not equal.
In SQL Server, the nvarchar columns of the system tables use the same collation as the database collation. As a result, when we search for Object information in the above database, it does not match the Object name Information any more. That is the root reason of the above issue.
This is one more trick in the above example. In the above creating statement, int and Int both works fine. Here is the answer from our dev:
Built-in function names (including min and MAX) are matched like TSQL keywords (English, case insensitive). Name binding for all user objects would follow the collation rules. Note that persisted system objects (from sys. Schema) also require name binding using current collation.
Internally, we use Latin1_General_CI_AS to match keywords. In the later stage of query compilation, we need bind a given object name with internal object id, we use the current database's collation for name matching.
Personal, I think the system table should use a language invariant collation to do name matching instead of using the current database's collation. That is another good reason for introducing language neutral, binary collation.
Finally, here is my suggestion of avoid the Turkish I issue. First, for object name, etc, try use either all lower case letter, or all upper case letter. In addition, when reference to the object during query, make sure you use the same object name in your query. Second, keep in mind, this Turkish I issue also happen in your data. For example, when you match input values in one of your stored procedure, and do different operations correspondingly. You might consider to use a more generic collation, such as latin_general_ci_as to do string match. Third, keep in mind, upper() and lower() built-in always was impacted by this issue, you might also use a more generic collation for this. For example, suppose your default collation is english collation
select UPPER ('i') collate turkish_ci_as --will give you english upper/case rule, but the result is covered to turkish collation
select UPPER ('i' collate turkish_ci_as ) --will give your turkish collation rule, the result is also turkish collation
select @a = UPPER (@a collate turkish_ci_as ) --will give your turkish collation rule, the result is also english collation
|
-
As you discuss in here and here before, in order to supporting UTF-16 encoding in SQL Server, we need to handle both "good" data and "bad" data, there are different approaches to achieve this. Today, I will talk about the length semantic again.
In SQL Server, we have two categories of string type: char/varchar/varchar(max)/text types always associate with a code page, and can only store the characters in the range of that code page; nchar/nvarchar/nvarchar(max)/ntext types always use USC-2 encoding, and can store any Unicode characters. When I declare a table has two columns, one for varchar, another for nvarchar:
Create table t1(c1 varchar(30), c2 nvarchar(30))
For column c1, the 30 means that the system will allocate 30 bytes for that variable, and it can only hold as many characters as the 30 byte can hold. The above table has no collate clause associate with the column definition, thus they will use the current database's collation as their collation. If the codepage of column c1 (which is controlled by the collation of that column) is one of the single-byte character-set (SBCS, is used to refer to character sets which use exactly one byte for each graphic character), the column can ALWAYS store up to 30 characters. However, if the code page is one of the MultiByte Character Sets (MBCS, is used to refer to character set which use variable length of bytes for each graphic character), then it can still store up to 30 characters, but it is not always true. Suppose the collation of column c1 is Chinese_prc_ci_as, which has code page 936 (Simplified Chinese). The codepage encodes ASCII characters (range from 0 to 127) with one byte, and all other Chinese characters in two bytes. In this case, column c1 can always hold at least 15 characters (when all these charactes are Chinese), and at most 30 characters (if all of them are ASCII). In summary, for varchar type, the length semantic is the number of bytes.
Let us consider the case for nvarchar type. Here is the definition of this type I copied from BOL:
Unicode data and use the UNICODE UCS-2 character set.
nchar [ ( n ) ]
Fixed-length Unicode character data of n characters. n must be a value from 1 through 4,000. The storage size is two times n bytes. The ISO synonyms for nchar are national char and national character.
nvarchar [ ( n | max ) ]
Variable-length Unicode character data. n can be a value from 1 through 4,000. max indicates that the maximum storage size is 2^31-1 bytes. The storage size, in bytes, is two times the number of characters entered + 2 bytes. The data entered can be 0 characters in length. The ISO synonyms for nvarchar are national char varying and national character varying.
so column c2 in above table can ALWAYS hold up to 30 Unicode UCS-2 characters. So we can say that, for nvarchar type, the length semantic is number of characters.
What if column c2, the nvarchar type, is encoded in UTF-16? We have two options: 1) we say c2 can hold up to 30 Unicode UTF-16 characters, but NOT ALWAYS. It is because of a surrogate pair takes 4 bytes, if we still allocate 62 bytes for values stored in c2 which cannot hold 30 surrogate pair. 2) We can say c2 can ALWAYS hold up to 30 Unicode UTF-16 characters by internally allocate 122 bytes for values stored in c2.
Let us compare about the two approaches. Case one has less code change and no breaking change in existing functionality, but it will more like the varchar's length semantic, and customer need aware of this. Case two has better semantic meaning of character length, and customer might desire for this. Drawback is that user can only declare length up to 2000. For example, c2 nvarchar(2001) will be invalid since in theory, we might end-up with 8006 bytes which cannot hold in one physical page. It is a breaking change for customer, and it also has certain benefit (I will discuss later). Oracle and MySQL both use the second approach.
Now, let us examine one of the string functions: substring which is declared at
SUBSTRING(value_expression ,start_expression , length_expression )
In UTF-16, the start_expression and length_expression argument should both use the correct UTF-16 character length semantic; such that a surrogate pair is always treat as one character. If I run query
Select substring(c2, 10,2) from t1
What is the return type for the above query. Today, since we know that the query return at most two character per value, so the return type will be nvarchar(2). It will be problematic when c2 is UTF-16 Encoding. The reason is that if the returned two characters are both surrogate-pairs, what is the return type then? The answer really depends on the above two approach of UTF-16 length semantic. For approach one, we need either return nvarchar(4), or return nvarchar(2) but only hold the first surrogate pair. For approach two, it is easy, the return type is nvarchar(2) although this nvarchar(2) has different meaning of our old nvarchar(2) in UCS-2.
What about this query:
Select substring(c2, 10,@len) from t1,
since the system don't know the value of variable @len, the return type will be nvarchar(30).
What about substring with varchar type since some characters are also double bytes. I will check tomorrow, and update the result here.
|
-
As you discuss in here,
in order to supporting UTF-16 encoding in SQL Server, we need to handle both "good"
data and "bad" data. Today, I will discuss different approaches of adding UTF-16
support in SQL Server. For each approach
I described today, I will discuss the advantage and disadvantage, and leave
readers to judge what one you prefer.
The first approach is mainly focusing on handling good
data. Since existing string functions are
not surrogate aware, and have possibility of breaking a surrogate pair (good data)
into single surrogate point (bad data), why want just introduce a set of new functions
which use the correct character length semantic? For example, we can introduce char_len,
char_substring, char_left, char_right functions, which take a surrogate pair as
one character. For existing string
functions, the semantic and result will not be changed. What about "bad" data, I will say we possibility
have to treat as "good" data since we haven't prevent existing string functions
to generate bad data from good data. The
advantage of this approach is that we are adding new functions, and leave old
function unchanged (in term of functionality and also performance). For people
who don't care about surrogate pair, they can continue to use old string
functions without seeing any difference.
For people who care about surrogate pair, they can start to use new functions. The disadvantage of this approach is that it
does not really solve the "bad" data issue. Also people have to choose between
the new string functions and the old string functions.
Let us talk about the second approach. It is pretty simple: we just fix all existing
functions which take the correct character length semantic. The good thing is that we will not break good
data and always return correct result.
The cost is that we may pay certain performance penalty. For example, Before
LEN() function just can just need return number of WCHARs, which take O(1) time,
however, the new LEN() will have to traverse the whole string even this is no
surrogate character in the string, which will be quite slow. Another issue is that changing the result of existing
functions is a breaking change. Len() function used to return 2 for a surrogate
pair, but it return 1 now. Even the new result
is correct, some customer might still want keep the old behavior. Also,
I haven't checked the return type of these string functions (I will start
another topic to cover this). For example,
if substring(c1, 1,2) return nvarchar(1), then it need either return nvarchar(2)
as data type. This behavior
change is not trivial and it is definitely a breaking change. One
possible solution is to use the compatibility level. User can change the database's compatibility level
to be lower than the current one, and it will see old behavior and also enjoy the
better performance of old behavior. Then,
what about the "bad" data? I am favor of
fixing the bad data whenever possible. What about database upgrade? Do we want to scan the whole database to verify
no bad data or we just assume that there is no bad data, and only generate
error when we found bad data later? I
like to raise another question here. Can
you track whether there are surrogate characters in a column of one table? If so, can we smartly use old algorithm for
such column?
I have the third solution in mind, which is introducing the
concept of Encoding in database level. A
database can have an Encoding option which indicates the encoding of nvarchar
type. The default encoding is UCS-2 which is the current encoding we supported,
and all existing behavior will be the same. A UTF-16 Encoding will be introduced and can
be set on database level. Once the
database encoding is set to be UTF-16, all string functions will follow the
UTF-16 length semantic, and we will also do data validation as well. The advantage of this approach is that user
will enjoy the better performance of old behavior by default, and no upgrade
issue at well. New user can choose the
new behavior only when they intend to do. What about changing encoding from UCS-2 to
UTF16, do we need to do data validation?
I don't know the answer. BTW, other
database vendors also use the same mechanism as well.
Finally, I like to briefly describe other approaches which I
thought of, but I don't like personally. How about introduce new collations,
such as latin1_general_UTF16_ci_as. We
already have enough collations to confuse our customers, I am not favor of
adding more collations. How about
introduce new column level encoding? I
believe in most of case, people want the whole database have the same behavior,
having fine granular control over the encoding might not be interesting to our
customer.
In summary, performance and backwards compatibility will be
the two issues which need to be carefully considered when implementing UTF-16
in SQL Server. My next blog will exam all
string functions to see what is the impact with UTF-16 Encoding.
BTW, Michael Kaplan posted a series of articles at here
to describe his idea of UCS-2 to UTF16 migration for SQL Server. I have some difficulty to follow his
idea. However, I can guess some basic
principle here. In some case, a sequence of Unicode points together as one
character from the end user's point view.
Examples are: the two small ss in German language, Korean Jamo and Thai Language. Breaking such character sequence is not
desired by end users. Then, the question
is that should we also make the new function take "true" character into consideration. Here, I reference Unicode FAQ "How should characters
(particularly composite characters) be counted, for the purposes of length,
substrings, positions in a string, etc. "
as the end of this topic
A: In
general, there are 3 different ways to count characters. Each is illustrated
with the following sample string.
"a" + umlaut + greek_alpha + \uE0000. (the latter is a private use
character)
1.
Code Units: e.g. how many bytes are in the physical representation of the
string. Example:
In UTF-8, the sample has 9 bytes. [61 CC 88 CE B1 F3 A0 80 80]
In UTF-16BE, it has 10 bytes. [00 61 03 08 03 B1 DB 40 DC 00]
In UTF-32BE, it has 16 bytes. [00 00 00 61 00 00 03 08 00 00 03 B1 00 0E 00 00]
2.
Codepoints: how may code points are in the string.
The sample has 4 code points. This is equivalent to the UTF-32BE count divided
by 4.
3.
Graphemes: what end-users consider as characters.
A default grapheme cluster is specified in UAX #29, Text Boundaries, as
well as in UTS #18 Regular
Expressions.
The
choice of which one to use depends on the tradeoffs between efficiency and
comprehension. For example, Java, Windows and ICU use #1 with UTF-16 for all
low-level string operations, and then also supply layers above that provide for
#2 and #3 boundaries when circumstances require them. This approach allows for
efficient processing, with allowance for higher-level usage. However, for a
very high level application, such as word-processing macros, graphemes alone
will probably be sufficient.
|
|
|
|