In SQL Server, each string type, such as varchar, nvarchar, text, ntext associate with a collation no matter whether you specified or not during creation time.   If your app just works fine and you never deal with collation explicitly or you even don't know collation, Congratulation! You are really lucky as most of our customers do.  However, as soon as you hit collation issues in your app, you will find the collation in SQL Server is really messy.   I helped several customers to solving their collation issues, and I like to share some knowledge to your guys in a serie of blogs. In this article, I will try to cover the basic concept of collations.  Before I start this, I recommend your guys read a great article: International Features in Microsoft SQL Server 2000 by Michael Kaplan. 

In this post, I will use varchar type to represent SQL Server's non-Unicode character data types, which are char, varchar, varchar(max) and text, and use nvarchar type to represent SQL Server's Unicode character data types, which are nchar, nvarchar, nvarchar(max) and ntext type. So Let us start on what is collation? In SQL Server, a collation is an attribute/property associate with a string type.  It specifies  

  • How to encode strings into byte sequence and store into a varchartype? Each collation has a codepage, which was used for this purpose.
  • The language of the strings stored in the character type, such as us-English, Chinese, etc. Each collation has a Language Identifiers to identify the language in a country.
  • How to sort and compare character data.

For example, there is a collation Chinese_prc_stroke_90_ci_as introduced in SQL Server 2005.   It use code page 936 to encode varchar types, and the language is Simplified Chinese (LCID 0x0804, which is mainly used by Mainland China).  The Chinese have two different sorting methods, one is based on the pronunciation, and another is based the stroke order (A stroke is a movement of a writing instrument, and the stroke order refers to the correct order in which the strokes of a Chinese character are written.

Wait a minute, what is the "90", and "ci_as" mean?  The 90 is the collation version.  We introduce new collation in each release of SQL Server, the 90 level collations are introduced in SQL Server 2005 (the internal version is 9.0).  If you see a collation has name of _100_, which means that it was introduced in SQL Server 2008 (internal version is 10.0).   If the version is missing, which means that it was introduced in SQL Server 2000 or prior release.  "ci_as" is called comparison flag.  Sometime, we want slightly different comparison behavior, such as case insensitive, etc, that is the "ci_as" part comes into play.  I am not going to discuss details about the comparison flags, you can find the information from  Windows Collation Sorting Styles 

Let us end today's topic by answering following question: how can I get a list of collations supported by SQL Server and the properties of a collation.  You can find the list of collations from BOL "Collation Settings in Setup" or function fn_helpcollations.  You can get the details about one collation by calling function CollationProperty.  Tomorrow, I will discuss different types of collations, it will be a very interesting topic since your guys will be confused by these types.