I thought I would explain a bit more about how surrogates work in Unicode, since it does not seem very well described in a whole lot of places. First, some definitions (all from the Unicode Glossary and the Unicode Roadmap sites):

Ok, it is all as clear as mud now, right? :-)

The problem is that even if the definitions are applied consistently, there is no good feel for exactly how they work, how high and low surrogates combine, and so on.

(Other questions, like why do high surrogates have lower numbers than low surrogates are covered in other posts)

Let's see if we can't do something about that....

(Warning: some MATH content ahead!)

We start with the Basic Multilingual Plane -- it is the code units from U+0000 to U+FFFF. Some of these code points are assigned; and a large subset of those are assigned characters. In all there are 65,536 code units in this and every other plane; you can also think of this as 1000016 or just 216 code units. Whatever you find easiest, conceptually.

Now what happens with those high surrogate code points is that the block of 1024 of them are divided into 16 blocks of 64 each. And each one of those blocks is used for a plane:

  • U+d800 - U+d83f (Plane 1, Supplementary Multilingual Plane)
  • U+d840 - U+d87f (Plane 2, Supplementary Ideographic Plane)
  • U+d880 - U+d8bf (Plane 3, Reserved)
  • U+d8c0 - U+d8ff (Plane 4, Reserved)
  • U+d900 - U+d93f (Plane 5, Reserved)
  • U+d940 - U+d97f (Plane 6, Reserved)
  • U+d980 - U+d9bf (Plane 7, Reserved)
  • U+d9c0 - U+d9ff (Plane 8, Reserved)
  • U+da00 - U+da3f (Plane 9, Reserved)
  • U+da40 - U+da7f (Plane 10, Reserved)
  • U+da80 - U+dabf (Plane 11, Reserved)
  • U+dac0 - U+daff (Plane 12, Reserved)
  • U+db00 - U+db3f (Plane 13, Reserved)
  • U+db40 - U+db7f (Plane 14, Supplementary Special-purpose Plane)
  • U+db80 - U+dbbf (Plane 15, Supplementary Private Use Area A)
  • U+dbc0 - U+dbff (Plane 16, Supplementary Private Use Area B)

By convention, U+[##]FFFE and U+[##]FFFF of each plane are set aside and reserved, never to be assigned. This allows internal processes to use them as sentinels. Note that they should never be interchanged with any other process!

Now the way things are numbered, each high surrogate is used, serially, combining with every possible one of the 1024 low surrogates before moving onto the next high surrogate. Thus for supplementary characters you see the following order:

  • U+d800 U+dc00 -> U+10000
  • U+d800 U+dc01 -> U+10001
  • U+d800 U+dc02 -> U+10002
  • ...
  • U+d800 U+dffd -> U+103fd
  • U+d800 U+dffe -> U+103fe
  • U+d800 U+dfff -> U+103ff
  • U+d801 U+dc00 -> U+10400
  • U+d801 U+dc01 -> U+10401
  • U+d801 U+dc02 -> U+10402
  • ...
  • U+dbff U+dffd -> U+10fffd
  • U+dbff U+dffe -> U+10fffe
  • U+dbff U+dfff -> U+10ffff

(I skipped some spaces in there for obvious reasons!)

This mechanism allows for many things such as simple range checking and easy conversions between code point an surrogate pair (it is a simple algorithmic macro to do the conversion when/if it is ever needed).

When combined with the way that scripts are assigned in blocks, it is easy to notice things like the following (not a complete list, just a sample!):

So when you combine the BMP's 216 code units with the 16 planes of 64 * 1024 (which is also 216 code units!), you get 17 * 216 or 1,114,112 code units in total -- which is where that interestingly arbitrary-looking number comes from!

Unicode's Roadmap site has a lot of information about the potential placement of future character allocations in Unicode, for those who are interested.

And for a more reality-based set of links, if you look ahead to Windows Vista three macros have been added to the winnls.h that comes with the Vista SDK:

I would expect that the meanings are pretty self-explanatory, but if not you can look at the VSDK topics to which I linked. :-)

(On a side note, I find it very cool that the Windows Vista SDK is available right now to everyone, whether they are on the Vista beta or not. It really does help to explain features and functions!)

Now in future posts I could perhaps get into other topics, like algorithmic conversion between UTF-16 and UTF-32....

 

This post brought to you by all of the supplementary planes of Unicode