In my previous blog, I had mentioned that unicode compression will be available as part of next SQL Server release named SQL 2008R2. You can down load the CTP2 version http://www.microsoft.com/sqlserver/2008/en/us/r2.aspx to play with it. This is a new exciting addition to the data compression offering.
Let me describe the problem and how it has been addressed in SQL Server.
Problem: As more and more businesses developing global customer base, applications are being developed/modified to use unicode based data types such as NCHAR and NVARCHAR. SQL Server uses UCS-2 encoding scheme that takes two bytes of storage regardless of the locale. For example, in ASCII character set when stored as NCHAR, each character only needs 1 byte of storage but it is stored using 2 bytes with the significant byte being 0. In fact most European languages need only 1 byte of storage. When an application is either converted or written to user unicode based data types, it can, depending on the size of strings, increase the storage requirements significantly.
Solution: SQL Server implements unicode compression using SCSU (Simple Compression Schecme for Unicode Data), a standard compression scheme for Unicode data (http://unicode.org/reports/tr6/). The following table shows the compression achieved with SCSU and compares it with UTF-8 as both DB2/Oracle have implemented UTF-8 encoding for unicode data. You will notice that for English (i.e. ASCII) we are compressing 50% or in other words, we will take 1 byte instead of 2. Same is true for German locale. On the other hand for Korean locale, SCSU scheme takes 2 bytes so there is no compression savings. The compression achieved using SCSU is very comparable to UTF-8.
The Unicode compression is enabled or disabled as part of ROW or PAGE compression. The implications are
1. The customers who are already using ROW or PAGE compression will get the benefit of unicode compression but it will require them to rebuild the index or table after the upgrade to SQL 2008R2. The interesting thing is that the Unicode compression will kick in when a new row inserted or an existing Unicode value is modified even if the corresponding compressed index or the table is not immediately rebuilt after the upgrade.
2. Existing scripts used for data compression can be used without modifications
If NCHAR or NVARCHAR string does provide any compression savings with unicode compression, like the example with korean locale, unicode compression is not applied. For this case, the compression overhead is paid at the time of compression to know if we can get unicode compression savings but not when the value is read. Like before no application changes are needed to take advantage of unicode compression.
At this time, the SQL Server will not support Unicode compression for NVARCHAR (MAX) type, including in-row values or for NTEXT. This is something we will consider in the future. In my next blog, I will provide an example using AdventureWorksDW database to show you the compression benefits achieved using new improved ROW compression.
Thanks for the post, this really helps me understand the benefits of USC-2 compression better!