|
|
-
Let us try following T-SQL script on SQL Server: declare @a float =300000000000000000000000000000000000 select @a + 50 -@a select @a -@a +50 go Guess what is the result? the first is 0. and the second result is 50. Floating point data is approximate; therefore, not all values in the data type range can be represented exactly. When you add a small number with a very big number, the small number might just lost in the end result. As a end-user, we want consistent and correct result. However, float data type has no correct result since the value is not accurate. Query result on float data type is also not consistent as well. In the above example, we change the order of + and – operator, we will see different result. Suppose we want aggregate float values, such as sum(floatcolumn), SQL Server may choose to parallel scan the whole table with different thread, and sum up the result together. In this case, the order of data values are random, and the end-result will be random. In one of our customer’s case, he run the same query multiple times, and each time he get a totally different result. So, please try to avoid using float data type, especially you want to do some aggregation on float types. A workaround is to covert the type to a numeric type before doing aggregate. For example, support X is float you can use sum(cast(x as decimal(30,4))) to get consistent result
|
-
Today, One customer asked about collation SQL_Latin1_General_CP1_CI_AI with non unique nonclustered index we are getting a sort on the query plan. The sample script is: CREATE TABLE TableWithASColumn(ID INT PRIMARY KEY, CharData VARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AS) CREATE NONCLUSTERED INDEX IX_CharData ON TableWithASColumn(CharData) CREATE TABLE TableWithAIColumn(ID INT PRIMARY KEY,CharData VARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AI) CREATE NONCLUSTERED INDEX IX_CharData ON TableWithAIColumn(CharData) CREATE TABLE TableWithAIColumnUniqueIndex(ID INT PRIMARY KEY,CharData VARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AI) CREATE UNIQUE NONCLUSTERED INDEX IX_CharData ON TableWithAIColumnUniqueIndex(CharData) SELECT ID, CharData FROM TableWithASColumn ORDER BY CharData SELECT ID, CharData FROM TableWithAIColumn ORDER BY CharData SELECT ID, CharData FROM TableWithAIColumnUniqueIndex ORDER BY CharData If you run the script inside SQL Server Management Studio, and Show the Actual Execution Plan, you will see below picture:
At a glance, it seems very wired because building an index on a key column means that the index already sort the data according to the key column, we just need scan the index, and the output should already be sorted. Another wired issue is that this only happen on certain collation. Before we talk about the root cause, let us discuss a little about comparison and sort. A typical comparison interface is int Compare(object Other) which compare this instance with another instance. It returns 0 if two objects are equal, –1 if this instance is smaller, and 1 if it is bigger. The sort algorithm is implemented by calling the Compare Method for two objects, and depends on the result of Compare() method, put two objects in the correct order. If two objects are equal, the order of these two objects on totally random since there is no other way to ordering them. Now, let us look at the collation cases. Suppose I am using a case insensitive collation, such as latin1_general_ci_ai it will treat character ‘b’ and ‘B’ are equal. So when I order by the column, I know that character a will be before ‘b’ and ‘B’, but the order of ‘b’ or ‘B"’ will be totally random. What about a user want to sort ‘b’ before ‘B’ and treat them as equal during comparison (I think it is a totally reasonable requirement). In this case, we need a comparison case insensitive and sorting case sensitive collation. SQL Server don’t have such collation flag to allow us to achieve this officially. However, there are a bunch of “old” collation (When I say old, I mean these collations are not changed/updated since 2000, and are not recommended to use in future development), which actually implement this feature. Such collations are called SQL TERTIARY Collation, sql_latin1_cp1_ci_ai is one of such kind of collation. In such collation, we internal have a tertiary weight table which defines a weight for each character. During sorting on columns, and if two characters are the same, we will lookup the table to find the tertiary weight of the two characters, and sorting them according to the tertiary weight. As you see from the description, if I issue an order by clause on a column with TERTIARY collation, I can’t simply scan the index defined on column since the index is not sorting any more (index is built on top of comparison, but not sort order), and that is reason that we have to put a SORT on top of the index scan which explain the above wired result. After search the Internet, I found one solution for this issue described in KB Article at http://support.microsoft.com/kb/951935. You can also look at TERTIARY_WEIGHTS (Transact-SQL) at http://msdn.microsoft.com/en-us/library/ms186881.aspx to get a list of SQL TERTIARY Collations Now, the question is that should I use such collation in my application. My suggestion is that don’t use these collations since they are “old” collation, and they can not sort all characters correctly. Another thing you need to keep in mind is that the above behavior ONLY apply to char/varchar column, but not NVARCHAR column. So it is another reason is avoid using these collations (because you might want consistent behavior between char and nchar columns).
|
-
Today, I got a question from customer: I have one column (Type: nVarchar) in a table which has data something like this: a1 aa aa2 AB I want to show all the rows which have all the lower case letters in the above mentioned column. The following query works: select * from test WHERE LTRIM(RTRIM(a)) LIKE '%[abc]%' Collate Latin1_General_CS_AI And Surprisingly this one does not: select * from test WHERE LTRIM(RTRIM(a)) LIKE '%[a-c]%' Collate Latin1_General_CS_AI --Would actually be [a-z] Is it like the case sensitivity does not work with ranges i.e. [a-c] or I am missing something here?? Here is my answer: Can you try select * from test WHERE LTRIM(RTRIM(a)) LIKE '%[a-c]%' Collate Latin1_General_BIN this should give you case sensitive result. The reason is that [a-c] means include all characters which is bigger or equal than a, and smaller than c, so uppercase A is also in the range even according to linguistic order for Latin1_General_CA_A. In other case, for case sensitive collation, a is not equal to A, not the order of the characters might still be a <A <b < B <c <C, etc.
|
-
In this article, I recommend several Unicode articles/websites for reference. Note, the list is not yet completed, I will add more entries and make better categorization - My blog is a good site for collation issues in SQL Server.
- Sort it all out. Michael Kaplan's random stuff of dubious value is a great source for learning Globalization techniques with Microsoft Software.
- International Features in Microsoft SQL Server 2005. The title says.
- Description of storing UTF-8 data in SQL Server: this is a article you need to read if you want to store UTF-8 data into SQL Server
- UTF-8 and Unicode FAQ for Unix/Linux. Although the title indicates about Linux/Unix, this is a very good papers talking about UTF-8. If you want to learn UTF-8, it is the best document.
- A tutorial on character code issues: one comprehensive article about character encoding. It is a good reference.
- Migrating Software to Supplementary Characters. This 57 pages presentation is a complete list of things you need to consider when migrating from UCS-2 to UTF-16. Read it, you will see that it is really not a trivial task.
- Globalization issues in ASP and ASP.NET. This is a really good article for ASP and ASP .Net
- Oracle® Database Globalization Support Guide: Programming with Unicode. I have to say oracle’s document is better sometimes. This book is pretty good reference if you want to learn Oracle’s Globalization support. it can also serve as a reference to Unicode.
- Avoid treating binary data as a String. In this blog, Shawn Steele discussed that interchange between binaries and string by using Encoding class is not always safe and round-trip since the binaries might be a mal-formed UTF-16 sequence. This link is the change in details.
- “How to: Send and Retrieve UTF-8 Data (SQL Server 2005 Driver for PHP)“. It is a must read document if you are developing PHP with SQL Server.
- UTF8 Security and Whidbey Changes. In this blog, Shawn Steel described behavior changes related to UTF-8 encoding functions in Windows Vista and .Net 2.0. For people using Windows MultiBytesToWCHAR and .Net UTF8Encoding class, this is a must read document.
- Some musings on Oracle Character Sets. A very good article about our friend Oracle’s Unicode character set support.
- Surrogates and Supplementary Characters. MSDN Win32 UTF-16 Support document.
- Implementation of Unicode in SQL Server. This is a pretty good introduction paper about Unicode Support in SQL Server
- “Saving UTF-8 in SQL Server 2005” is one forum thread where people talk about storing UTF-8 data in SQL Server.
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). This is a very good article which talks about the basic about Unicode in a simple way. A good introduction ariticle.
- Unicode in XML and other Markup Languages. This is W3C ‘s official document for using Unicode Standard for XML language
- Form of Unicode: this is a pretty handy document which talking about Unicode Encoding and the different format of Unicode character. This is my favorite document.
- Unicode in Visual Studio 2003 contains good examples how C#’s Unicode Support, such as FileEncoding, Globalization Config, ASP .Net Request and Response Encoding.
- The Basic of UTF-8. In this article, Marius did a brief introduction about UTF-8 Encoding. Then he gave a C++ code example of convert from/to UTF-16 to UTF-8. This is another example of writing UTF-8 in C++
- UTF-8 Versus Windows UNICODE. In this article, Ben did a comparison between UTF-8 and UTF-16 Encoding.
- Guide to Unicode is a series very extensive articles about Unicode. My favorite articles.
- Globalization Step by Step: a comprehensive guide-line for doing globalization development with Microsoft Products.
- Supplementary-Aware String Manipulation Sample is a set of SQLCLR functions which can do supplementary-ware string manipulation. Works for SQL Server 2005 and 2008.
- Autotranslation of Character Data. If you works on SQL Server varchar type, and retrieve by using ODBC driver, you might hit the issues discussed in the article when your client OS is different with your server OS.
- Oracle’s string function length semantic
- Oracle’s data type length semantic
- MySQL’s string function length semantic
- MySQL’s Unicode support
- Implement 4-byte UTF8, UTF16 and UTF32. This is MySQL’s design note for implementing different Unicode Encodings for version 6.0. It is a good reference if you want to know the challenge of supporting different Encoding in a database
- Understanding Unicode and ODBC Data Access: this is a good article for ODBC Driver’s Unicode Support.
- Using the Easysoft ODBC-Oracle Driver with Unicode Data: another article for ODBC Driver’s Unicode Support.
- Application encoding schemes and DB2 z/OS ODBC”: DB2’s Unicode Support
- MySQL 6.0 Manual: 9.1.4. Connection Character Sets and Collations: MySQL driver/protocol support for Unicode
- Esaysoft ODBC for MS SQL SERVER with UTF-8 support: If you want to get UTF-8 data back from SQL Server, try this.
- SQL SERVER字符集的研究. If you can read the title, you will know what is talking about.
- 众多字符集编码的区别. very good article in Chinese which talks about Unicode Encoding.
- The current PHP version 5 have NO Unicode Support, according to this web site.
- The next release of PHP version 6 will have native Unicode Support, here is the description from web site "Upcoming PHP release will offer Unicode support"
- PHP 6.0 will switch from having a single, generic string type to having two: a Unicode string type for text data, implemented through UTF-16, and a binary type, which will include actual binary data and text data for legacy locales.
- Can the CP_ACP be UTF-8? If you are working with Visual C++, and wonder whether the UTF-8 can set as the default locale of the system or current thread’s locale. You might need read this document which says No
|
-
Hello, Guys
Please the attached zip file for the SQL Server collation selection tool. It is written in C#.
|
-
Today, I will start my series of articles about SQL Server and Unicode UTF-8 Encoding. In many times, I found when people ask me about UTF-8, they actually don't understand UTF-8. So today's talk will be quite short. I just clarify some misunderstand. 1. SQL Server doesn't support Unicode, do you mean UTF-8?
Sometime, people just say "SQL Server doesn't support Unicode". Actually, it is wrong, SQL Server support Unicode since SQL Server 7.0 by providing nchar/nvarchar/ntext data type. But SQL Server doesn't support UTF-8 encoding for Unicode data, it supports the UTF-16 encoding. I copy several definitions from Internet for these concepts: "Unicode is the universal character encoding, maintained by the Unicode Consortium. Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. The unique number, which is called the code point of a character, is in the range 0 (hex) to 10FFFF(hex). It is normal to reference a Unicode code point by writing "U+" followed by its hexadecimal number. For example, Character A is defined as "Latin Uppercase Alphabet", and assigned a code point U+0041. In additional to assign each character to a unique code point, Unicode Consortium also defined several Unicode transformation formats (UTFs) which are the algorithmic mapping from every Unicode code point to a unique byte sequence. Note, the Unicode code point itself has nothing related to how to store/transform in a Computer, but a UTF does. The commonly used UTFs are UTF-16 and UTF-8. UTF-8 is the byte-oriented encoding form of Unicode which commonly used on Unix Operating System, Web Html and Xml File. UTF-16 uses a single 16-bitcode unit to encode the most common 63K characters, and a pair of 16-bit code unites, called surrogates, to encode the 1M less commonly used characters in Unicode. UTF-16 is commonly used on Windows, .Net and Java. The transform between different UTFs are loseless and round-tripping. "
In summary, don't confuse with Unicode and UTF-8 Encoding. They are totally different concepts. 2. UTF-8 Encoding is much better than UTF-16 Encoding
There are tons of articles comparing with UTF-8 encoding with UTF-16 encoding. I will compare these two encoding side by side in my next article. I can have 100 reasons to say UTF-8 Encoding is better than UTF-16 Encoding, and verse vice. The correct answer is that no encoding is absolute better than the others. User should choose the suitable encoding according to your application software requirement. The operation system, programming language, database platform do matter when choosing the encoding. UTF-8 is most common on the web. UTF-16 is used by Java and Windows. The conversions between all of them are algorithmically based, fast and lossless. This makes it easy to support data input or output in multiple formats, while using a particular UTF for internal storage or processing.
So please don't jeopardize SQL Server's Unicode support because of it only support one of the UTFs. 3. SQL Server cannot store all Unicode Characters
You may notice that I say SQL Server support UTF-16 Encoding in previous paragraph, but I also said SQL Server' nvarchar type encoding is UCS-2. I intend to do this in this article because I want to discuss the different between these two Encodings in here. Let us describe in details in what area SQL Server support UTF-16 encoding:
- SQL Server can store any Unicode characters in UTF-16 encoding. The reason is that the storage format for UCS-2 and UTF-16 are the same.
- SQL Server can display any Unicode characters in UTF-16 encoding. The reason is that we internally call Windows functions to display characters, the Windows functions and fonts can take care of the supplementary character (a character take 4 bytes in UTF-16) correctly.
- SQL Server can input any Unicode characters in UTF-16 encoding. The reason is that we internally call Windows IMEs (Input Method Editors) to input, the Windows IMEs can take care of the supplementary character (a character take 4 bytes in UTF-16) correctly.
- SQL Server can sort/compare any defined Unicode characters in UTF-16 encoding. Note, not all code points are map to valid Unicode character. For example, The Unicode Standard, Version 5.1 defines code points for around 10,000 characters. All these characters can be compared/sorted in SQL Server latest version: SQL Server 2008.
In contrast, I also list the UTF-16 thing SQL Server doesn't support:
- SQL Server cannot detect Invalid UTF-16 sequence. Unpaired surrogate character is not valid in UTF-16 encoding, but SQL Server accept it as valid. Note, in reality, it is unlikely end-user will input invalid UTF-16 sequence since they are not support in any language or by any IMEs.
- SQL Server treats a UTF-16 supplementary character as two characters. The Len function return 2 instead of 1 for such input.
- SQL Server has potential risk of breaking a UTF-16 supplementary character into un-pair surrogate character, such as calling substring function. Note, in the real scenario, the chance of this can happen is much lower, because 1) supplementary character is rare 2) string function will only break this when it happens be the boundary. For example, when calling substring(s,5,1) will break if and only if the character at index 5 is a supplementary character.
In summary, SQL Server DOES support storing all Unicode characters; although it has its own limitation. Please refer to my previous blogs to details.
|
-
In recent days, I am mainly reading test books since I am a Tester, and I need to know more about testing. The good news for people who are interesting in collation topic is that - I will have a SQL Server collation select tool available soon.
- I hope I can start to write UTF-8 related stuff.
So if you are interesting in the above topic, and want to get more information soon. Please ping me by adding comments.
|
-
Today, I will
discuss SQL Server’s Binary collations.
In SQL Server, we have two kinds of binary collations: BIN collation,
which has collation name ending with _BIN and BIN2 collation, which has
collation name ending with _BIN2. In SQL
Server, a string value (either varchar or nvarchar value) is encoded and stored
as a sequence of binaries. For nvarchar
value, we always use the UCS-2 encoding.
For varchar value, we use the codepage of corresponding collation
associated with the value.
For “Binary” collation, we use a sorting
algorithm based on the binary sequence of the string to sort the string data. Now,
let us look the binary sorting algorithm.
For varchar type, the algorithm is quite simple; the sorting result of a
set of varchar values with binary collation is the same as we sort them use the
binary sequence. In T-SQL syntax:
Select *
from T order by varcharC
Equals to
Select *
from T order by convert(varbinary(8000), varcharC)
Note, it is always true for both BIN collation and BIN2
collation.
Now, let us look
at sorting the nvarchar values. In this
case, we can not sort them purely use the binary sequence, the reason is that
the binary sequence represents a sequence of UCS-2 characters, which take two
byte per character. Because SQL Server
internally uses Little-Endian to store the UCS-2 character, the binary sorting
may result undesired result. For
example, character a (We usually use U+0061 to indicate its Unicode Code Point
assigned by Unicode Org) is encoded as 0x6100 in SQL Server, so If we sort
these characters in their binary sequence order, we will not be able to get a Unicode Code
Point ordered sorting result which is desired for many application. In SQL Server’s BIN2 collation, we sort the
nvarchar type according to their Unicode Code Points instead of the binary
sequence (internally, we do comparison per WCHAR based, which is 2 bytes). Now, comes the BIN Collation, the algorithm
is wired. It first compares two characters
based on the first WCHAR values of the two characters, if the values are equal,
the algorithm will do binary based comparison for the reminding binary sequences,
i.e., byte per byte comparison.
Hope your guys
can follow my writing. If not, you can
simply assume that BIN2 is a Unicode Code Point based sorting algorithm, and it
is desired by many customer. But BIN is not.
From the sorting algorithm’s view,
all BIN collation use the same algorithm and all binary2 collation use another
algorithm. Then why we have
latin1_general_bin or Japanese_bin collation?
The reason is that they use different code page for encoding the
characters sorting in the varchar type. A
Latin1_General_Bin collation can not hold any Japanese collation, so even they
use the same algorithm to sort the data, but people still need to choose the
language which the data will stored, such as English, Chinese, Japanese
etc. Note, this only apply to varchar
data type. For nvarchar data types,
these collations have no difference at all.
On the other
hand, SQL Server has many linguistic collations which sort string linguistically
using the language associated with the collation. All collations which are not binary collation
are linguistic collations. For example,
Latin1_General_CI_AS is a linguistic collation; it uses a sorting algorithm
compatible with several of English language and many Western European languages. Please don’t confuse with the name of
Latin1_General, it actually can sort all Unicode characters defined in Unicode
3.2 characters and it can also sort many other languages correctly as well (if
the language has no sorting conflict with the latin1_general sorting rule). I will start another article to describe the linguistic
collation in deep details.
As a conclusion, binary collations have
better performance than linguistic collations, and that is the main advantage
of using a binary collation. A binary collation is always case sensitive and
assent sensitive. User should try use BIN2 collation instead of BIN collation.
|
-
Today, I looked for SQL Server JDBC Driver's sendStringParametersAsUnicode, this parameter controls how Unicode String (Java only have Unicode string which is UTF-16 encoding) will be sending to SQL Server. As the name of this parameter indicate, when the parameter equals to true, which is the default setting, we always send unicode data, i.e., the string with UTF-16 encoding, to SQL Server. When it is set to false, it also send ANSI data, i.e., the string encoded with a code page to SQL Server. In this case, it uses the code page of the current connected SQL Server database to convert the Unicode String to a codepage encoded string.
I am a Unicode Fan, and always want recommend people using Unicode. So when I knew this parameter, my first question is that WOW, set the value to false has potential data corruption. When we convert the UTF-16 encoded string into a codepage encoded string, we may face the issue of some character is not defined in the code page, and will be replace as question mark (?). Then, after I read this Article, I knew the reason of doing so is to enabling index seek when the target data type in SQL Server is varchar type. Note, the Article ONLY apply when the collation of your column is a SQL Colaltion, but NOT Windows collation. In most of case, people will use Windows collation as the collation, except us-English which still use sql_latin1_general_cp1_ci_as. So If the target type is nvarchar type or your collation is Windows Collation, don't set the parameter to false.
Come to conclusion, for JDBC application, please use nvarchar data type, and you will have full Unicode support, no data lose, no issue of not being able to use index, and better performance. If you are using Windows Collation, don't set the flag to false.
|
-
Today, I got following questions: "The documentation states that Datetime2 is unaware of the time zone.
Does that still guarantee that when the clock gets adjusted for
Daylight Saving Time this will not result in duplicate timestamps? I.e. is Datetime2 internally mapped to UTC?" Here is the answer: - Datetime2 does not guarantee that you will get a unique value you would
have to have a unique constraint on the column. If you have several
batches running at the exact same time on a multiprocessor system you
may end up with duplicate values without a constraint on the column, if
you need guaranteed unique values you need to work with
uniqueidentifier and the NEWID function.
- DateTime2 in SQL Server does not contain any information about
the timezone of the value and whether the value is UTC or local time.
The SYSDATETIME, SYSUTCDATETIME, SYSDATETIMEOFFSET built-in functions
will adjust automatically when the Windows Operating System, which the SQL Server instance running on, adjusts the clock during
daylight saving switch. I.e., the datetime offset returned by
SYSDATETIMEOFFSET will be changed from -08:0 to -07:00 for Pacific Time. The
SYSDATETIME value generated will have one hour gap when entering
daylight saving, i.e., one hour is missing, but SYSUTCDATETIME does not.
- My recommendation is that always store UTC time in datetime2 column or use SYSDATETIMEOFFSET data type, and handle local time issue by calling Windows/.Net API.
|
-
You may wonder why I keep writing collation topics in my blogs. I found our Books Online topics related to collation is not so clearly enough, and sometime it confuse our reader. I intend to present my idea related to collation in a series of blogs, and finally I want group all these topics together to draw a clear picture about collation. My blogs will try to answer following question in details:
- How I should choose a collation?
- What is the different between collation A and B?
- How collation impact the behavior of query, such as string functions, variables assignment?
- Best practices for using SQL Server's collation.
Again, I strong suggest your guys send my comments or collation related question to me, If I can not answer, I will find some language expert to answer this, and I will also share the answer as well.
|
-
The best way to avoid SQL Server's collation conflict issue is that avoid have different collations in your database schema. - If your server and database have the same collation, you will never see the conflict collation issue.
- if all string columns in a database use the same collation, you will see less collation conflict unless you do cross-database query. You don't need explicit set the collation for string collations, they will pick up the database's collation by default.
- Keep in mind, changing existing database's collation is allowed, but it wouldn't changing existing table's collation, so they will have change of mix of different collations.
- I believe the common case of collation conflicts happens when your tempdb collation (which is your server collation) is different with your current database' collation. Objects created on tempdb will automaticlly pick the server collation, not the current database collation. The best way to solve this is this article:http://sqlskills.com/BLOGS/KIMBERLY/post/Changing-Database-Collation-and-dealing-with-TempDB-Objects.aspx
- When you hit collation conflict, you can always use explicit collate clause to resolve the conflict.
- If you have any other concern or suggestion, please send me detail about your issue. You can also vote for letting SQL Server solve this issue at http://connect.microsoft.com/SQLServer/feedback/Validation.aspx?FeedbackID=324910
|
-
Recently, I got several request related to GB18030 Character Set support in SQL Server. Here is one of the requests:
When I try to insert different combination of GB18030 characters in to a varchar column, it gives me unique constraint violation error. insert myTable values ('32066326-6E23-4E31-B226-9F8652CE11F3', N'㐀㐁ᠠᠡᠢཌཌྷꀀꀁﺶﺷﺸ', 'AC1', '') insert myTable values ('32066326-6e23-4e31-b226-9f8652ce11f7', N'ꀀꀁﺶﺷﺸ', 'AC1', '')
does anyone know cause / solution / workaround?
Before going into the solution, I need to explain about what is GB18030 Character Set. Here, I copied the Unicode Faq on GB1830: What is GB 18030?
GB 18030 is a new Chinese codepage standard that extends GB 2312-1980 and GBK (which itself is an extension of GB 2312-1980). What is new in GB 18030?
It is a multi-byte encoding using 1-byte, 2-byte, and 4-byte codes. The 1-byte and 2-byte codes have the same assignments as in GBK, which itself is a superset of GB 2312-1980. There are about 1.6 million valid byte sequences. It is not possible to determine if a byte sequence is either 2 or 4 bytes long by just examining the lead byte - the second byte must be examined as well. Why is GB 18030 important?
The Chinese Government has mandated that all applications released on or after 2001-Sep-01 must support GB 18030. How does GB 18030 relate to Unicode?
The specification refers directly to a mapping of GB 18030 codes to and from ISO 10646/Unicode to define most character assignments. Some characters that used to be mapped for GBK to the PUA (Private Use Area) for Unicode 2.1 are now assigned in Unicode 3.0, and their mappings from GB 18030 use only the Unicode 3.0 code points.
In addition, GB 18030 defines roundtrip mappings for all 1.1 million Unicode code points including unassigned and non-character ones, but excluding single surrogates. This makes GB 18030 functionally very similar to a UTF.
So let me try to answer the customer's question. Remember varchar type always associates with a code page, and can only stored the characters defined in the code page. The code page which store Simplified Chinese is 936. Code page 936 corresponds to Chinese GBK standard, which is a subset of GB18030 Encoding. So it means that storing GB18030 characters in varchar type has potent data loss issue. On the other hand, since Unicode 3.0 and GB18030 can be roundtrip, so the best way to store GB18030 characters is use SQL Server's nvarchar type. Most of the new added characters in GB18030 map to the Unicode CJK Extension B block, while the block are all Supplementary characters which are represented with a surrogate pair. So supporting Gb18030 character becomes to the same as support Surrogate character, which already discussed in previous blog.
However, the nvarchar type use UCS-2 encoding, which is not the GB18030 encoding. My next point is related to Gb18030 encoding itself. What if user has GB18030 encoding data/files/scripts, the answer is that SQL Server can recognize this GB18030 encoded input, and convert them into Unicode, so you will no issue with these files/scripts/data.
References:
Ask Dr. International,Column #15, New Chinese Encoding GB-18030
GB 18030 WikiPage
GB18030 Characters in SQL Server
International Features in Microsoft SQL Server 2005
|
-
In this blog, I will give some my recommendation of SQL
Server's Globalization Development.
1.
Use nchar/nvarchar instead of char/varchar
type. Nchar/nvarchar type is SQL
Server's Unicode data type, which can store any characters defined by Unicode
Org. char/varchar type always associates
with a code page, so the number of supported characters is limited. I give you some fact to show why use Unicode
data type is recommendation:
o
Starting from Windows 2000, Windows have full
Unicode support implemented by using UTF-16 Encoding. Windows APIs take WCHAR* as input which
represent a Unicode String.
o
Virtual C++ has WCHAR, which is Unicode char
type.
o
.Net String is Unicode String only encoded in
UTF-16.
o
Java String is Unicode String only encoded in
UTF-16.
o
Char type might have data corruption issue if
your client locale is different with server locale.
So it is time to deprecated the old
code page technique, and use the Unicode in your application, and also in your
SQL Server database. The only benefit
of using char type is the space saving for single byte code page. You can always use the Data Compression
feature in SQL Server 2008 if you care about disk space, which can save more
space than using char type.
2.
Also remember to put N' for your string literal in T-SQL. String literal N'abc' is a Unicode nvarchar
string literal. Literal ‘ab©' is a varchar literal, it always associates
with a codepage (string literal always use current database's collation).
Suppose char © is not in the code
page, you will get a question mark (?) when inserting the literal into a
table, even the column is nvarchar
type.
3.
Use nvarchar instead of nchar. The different between nvarchar and
nchar is
the storage. A column with nchar(30) type
always take 60 bytes to store on the disk, even the value is a single
character. The data size for a nvarchar(30) type column
is not fixed, it varies row by row or
value by value. A single character value
takes 2 bytes to store, and a value with 30 characters long takes 60
bytes to be stored. Another different between nvarchar and nchar
type is the performance. Nchar types
always stored in fixed location in every row, which can be retrieved
faster
than nvarchar type which is stored in different location for different
row. However, I believe the benefit of
less stored space for nvarchar types usually overcomes the cost of
locating the
value in a row.
4. If possible, avoid using a column collation which is different with the database's collation. You will have
less collation conflict. If you want
a query use a special collation's sorting rule, use explicit collate clause in that query.
5. Use Windows Collation
instead of SQL Collation. The only
exception here is the default collation for en_US locale which is
sql_latin1_general_cp1_ci_as.
6.
Never store UTF-8 string in varchar types, you
will get data corruption.
7.
Keep in mind, string comparison always ignore
trailing spaces (Unicode U+0020), no matter what collation you are using.
8.
Keep in mind, LEN function always return the
number of characters exclude the trailing spaces. DataLength function returns the storage size in term of bytes.
9.
using _BIN2 collation, instead of _BIN collation if you want binary, code page
base string comparison behavior.
|
-
Today, I will give
an example of different implementation of a string function. Let us look at function:
SUBSTRING(value_expression
,start_expression , length_expression )
In here, I proposed
four kinds of alternative functions by giving them different name to distinct
the functionality of them.
1.
USC2-SUBSTRING. This function has the same functionality as the SUBSTRING
function.
2.
UTF16-SUBSTRING: This function take UTF-16 string as
input, return type is also UTF-16 string
3.
UTF8-SUBSTRING: This function take UTF-8 string as input,
return type is also UTF-8 string
4.
CHAR_SUBSTRING: This function is a linguistic
version of substring, it wouldn't break the character in the given language,
and the encoding can be either UCS-2/UTF-8 or UTF16.
For the length semantic, I will assume that it is character
length with respected to the corresponding Encoding. I will use following table
to show the differences between these functions when calling function:
Declare
@a nvarchar(10)
select substring(@a, 5, 2)
, suppose the 10 here represents 10
characters
|
Function Name
|
Input
Encoding
|
Output
Encoding
|
Output Type
|
Output Buffer Bytes
|
Surrogate
Aware
|
Collation
Sensitive
|
Algorithm
Complexity
|
|
USC2-SUBSTRING
|
UCS-2
|
UCS-2
|
Nvarchar(2)
|
4
|
No
|
No
|
Simple, and
Fast
|
|
UTF16-SUBSTRING
|
UCS-16
|
UCS-16
|
Nvarchar(2)
|
8
|
Yes
|
No
|
Always scan
from the beginning, need to handle surrogate pair.
|
|
UTF8-SUBSTRING
|
UTF-8
|
UTF-8
|
Nvarchar(2)
|
8
|
Yes
|
No
|
Always scan
from the beginning. See (1)
|
|
CHAR_SUBSTRING
|
Variance
|
=input
|
5, see (2)
|
Depends on encoding
|
Yes, see (3)
|
Yes
|
More complex.
See (4)
|
(1)
For
UTF8-SUBSTRING. In theory, we can do a conversion from UTF-8 to UTF-16, and
call UTF-16 version of the algorithm.
But it will have double conversion since the result need to convert back
to UTF-8.
(2)
It is unknown for how many bytes one linguistic
character will take until the run-time.
So we have to make the result type = input_length- input_index
(3)
A surrogate pair is always treated as one
linguistic character, just it is surrogate aware
(4)
The function itself will be pretty complex. For
looking for one character, it needs search deeper and the move the pointer back.
In addition, I like to give some
alternative solution for UTF16_SUBSTRING:
1)
We call UCS2-SUBSTRING internally, and the last
character it return a leading surrogate character, just removing such character
from the result. The advantage of this approach is that it is fast, and it
wouldn't break surrogate pairs. The drawback
is that it doesn't respect UTF-16 length semantic, which will lead programming
issues.
2)
We call the new UTF16-SUBSTRING function above, but
return type is nvarchar(2), and here the 2 means that four UCS-2 characters. I.e., we keep the existing return type, and
semantic as the same. Again, the result
can only hold one UTF-16 character instead of 2.
3)
We call the new UTF16-SUBSTRING function above, but
return type is nvarchar(4), and here the 4 means that four UCS-2 characters. I.e., we obey the UTF-16 length semantic for
input parameter, but not for output. In
these ways, we have the same meaning of nvarchar(4) in UCS-2 encoding and
UCS-16 encoding. But we still break the
return type from nvarchar(2) to nvarchar(4).
Finally, I like to point out that we have CLR examples of Supplementary-Aware
(Supplementary character refer to a Surrogate Pair) String Manipulation at here. It internally uses StringInfo.ParseCombiningCharacters
Method. It not only handles Supplementary
characters, but according to the API, it handles other combining character
as well. Note, the different between this method and the CHAR_SUBSTRING I
presented here is that there are more language specific rules or character sequences
which cannot be handled by the C# method.
|
|
|
|