For “Binary” collation, we use a sorting algorithm based on the binary sequence of the string to sort the string data. Now, let us look the binary sorting algorithm. For varchar type, the algorithm is quite simple; the sorting result of a set of varchar values with binary collation is the same as we sort them use the binary sequence. In T-SQL syntax:
Select * from T order by varcharC
Equals to
Select * from T order by convert(varbinary(8000), varcharC)
Note, it is always true for both BIN collation and BIN2 collation.
Now, let us look at sorting the nvarchar values. In this case, we can not sort them purely use the binary sequence, the reason is that the binary sequence represents a sequence of UCS-2 characters, which take two byte per character. Because SQL Server internally uses Little-Endian to store the UCS-2 character, the binary sorting may result undesired result. For example, character a (We usually use U+0061 to indicate its Unicode Code Point assigned by Unicode Org) is encoded as 0x6100 in SQL Server, so If we sort these characters in their binary sequence order, we will not be able to get a Unicode Code Point ordered sorting result which is desired for many application. In SQL Server’s BIN2 collation, we sort the nvarchar type according to their Unicode Code Points instead of the binary sequence (internally, we do comparison per WCHAR based, which is 2 bytes). Now, comes the BIN Collation, the algorithm is wired. It first compares two characters based on the first WCHAR values of the two characters, if the values are equal, the algorithm will do binary based comparison for the reminding binary sequences, i.e., byte per byte comparison.
Hope your guys can follow my writing. If not, you can simply assume that BIN2 is a Unicode Code Point based sorting algorithm, and it is desired by many customer. But BIN is not.
From the sorting algorithm’s view, all BIN collation use the same algorithm and all binary2 collation use another algorithm. Then why we have latin1_general_bin or Japanese_bin collation? The reason is that they use different code page for encoding the characters sorting in the varchar type. A Latin1_General_Bin collation can not hold any Japanese collation, so even they use the same algorithm to sort the data, but people still need to choose the language which the data will stored, such as English, Chinese, Japanese etc. Note, this only apply to varchar data type. For nvarchar data types, these collations have no difference at all.
On the other hand, SQL Server has many linguistic collations which sort string linguistically using the language associated with the collation. All collations which are not binary collation are linguistic collations. For example, Latin1_General_CI_AS is a linguistic collation; it uses a sorting algorithm compatible with several of English language and many Western European languages. Please don’t confuse with the name of Latin1_General, it actually can sort all Unicode characters defined in Unicode 3.2 characters and it can also sort many other languages correctly as well (if the language has no sorting conflict with the latin1_general sorting rule). I will start another article to describe the linguistic collation in deep details.
As a conclusion, binary collations have better performance than linguistic collations, and that is the main advantage of using a binary collation. A binary collation is always case sensitive and assent sensitive. User should try use BIN2 collation instead of BIN collation.