Saturday, September 24, 2005 12:01 AM
Michael S. Kaplan
The basics of supplementary
I thought I would explain a bit more about how surrogates work in Unicode, since it does not seem very well described in a whole lot of places. First, some definitions (all from the Unicode Glossary and the Unicode Roadmap sites):
- Basic Multilingual Plane. Plane 0, abbreviated as BMP.
- High-Surrogate Code Point. A Unicode code point in the range U+D800 to U+DBFF. (See definition D25 in Section 3.8, Surrogates.)
- High-Surrogate Code Unit. A 16-bit code unit in the range D80016 to DBFF16, used in UTF-16 as the leading code unit of a surrogate pair. Also known as a leading surrogate. (See definition D25a in Section 3.8, Surrogates.)
- Leading Surrogate. Synonym for high-surrogate code unit.
- Low-Surrogate Code Point. A Unicode code point in the range U+DC00 to U+DFFF. (See definition D26 in Section 3.8, Surrogates.)
- Low-Surrogate Code Unit. A 16-bit code unit in the range DC0016 to DFFF16, used in UTF-16 as the trailing code unit of a surrogate pair. Also known as a trailing surrogate. (See definition D26a in Section 3.8, Surrogates.)
- Plane. A range of 65,536 (1000016) contiguous Unicode code points, where the first code point is an integer multiple of 65,636 (1000016). Planes are numbered from 0 to 16, with the number being the first code point of the plane divided by 65,536. Thus Plane 0 is U+0000..U+FFFF, Plane 1 is U+10000..U+1FFFF, ..., and Plane 16 (1016) is U+100000..10FFFF. (Note that ISO/IEC 10646 uses hexadecimal notation for the plane numbers—for example, Plane B instead of Plane 11). (See Basic Multilingual Plane and supplementary planes.)
- Private Use. Refers to designated code points in the Unicode Standard or other character encoding standards whose interpretations are not specified in those standards and whose use may be determined by private agreement among cooperating users.
- Private-Use Code Point. Code points in the ranges U+E000..U+F8FF, U+F0000..U+FFFFD, and U+100000..U+10FFFD. (See definition D12 in Section 3.5, Properties.) These code points are designated in the Unicode Standard for private use.
- Reserved. Refers to undesignated code points, which are set aside for future standardization. (See Section 2.4, Code Points and Characters.)
- Supplementary Character. A Unicode encoded character having a supplementary code point.
- Supplementary Code Point. A Unicode code point between U+10000 and U+10FFFF.
- Supplementary Ideographic Plane. Plane 2, abbreviated as SIP.
- Supplementary Multilingual Plane. Plane 1, abbreviated as SMP.
- Supplementary Special-purpose Plane. Plane 14, abbreviated as SSP.
- Supplementary Planes. Planes 1 through 16, consisting of the supplementary code points.
- Surrogate Character. A misnomer. It would be an encoded character having a surrogate code point, which is impossible. Do not use this term. [I talk about this issue here]
- Surrogate Code Point. A Unicode code point in the range U+D800 through U+DFFF. Reserved for use by UTF-16, where a pair of surrogate code units (a high surrogate followed by a low surrogate) “stand in” for a supplementary code point.
- Surrogate Pair. A representation for a single abstract character that consists of a sequence of two 16-bit code units, where the first value of the pair is a high-surrogate code unit, and the second is a low-surrogate code unit. (See definition D27 in Section 3.8, Surrogates.)
- Trailing Surrogate. Synonym for low-surrogate code unit.
- Unassigned. Code points that either are reserved for future use or are never to be used.
- Unassigned Character. Synonym for not assigned to an abstract character. This refers to surrogate code points, noncharacters, and reserved code points. (See Section 2.4, Code Points and Characters.)
- Unassigned Code Point. (See undesignated code point.)
- Undesignated Code Point. Synonym for reserved code point. These code points are reserved for future assignment and have no other designated normative function in the standard. (See Section 2.4, Code Points and Characters.)
- Unicode Scalar Value. Any Unicode code point except high-surrogate and low-surrogate code points. In other words, the ranges of integers 0 to D7FF16 and E00016 to 10FFFF16 inclusive. (See definition D28 in Section 3.9, Unicode Encoding Forms.)
Ok, it is all as clear as mud now, right? :-)
The problem is that even if the definitions are applied consistently, there is no good feel for exactly how they work, how high and low surrogates combine, and so on.
(Other questions, like why do high surrogates have lower numbers than low surrogates are covered in other posts)
Let's see if we can't do something about that....
(Warning: some MATH content ahead!)
We start with the Basic Multilingual Plane -- it is the code units from U+0000 to U+FFFF. Some of these code points are assigned; and a large subset of those are assigned characters. In all there are 65,536 code units in this and every other plane; you can also think of this as 1000016 or just 216 code units. Whatever you find easiest, conceptually.
Now what happens with those high surrogate code points is that the block of 1024 of them are divided into 16 blocks of 64 each. And each one of those blocks is used for a plane:
- U+d800 - U+d83f (Plane 1, Supplementary Multilingual Plane)
- U+d840 - U+d87f (Plane 2, Supplementary Ideographic Plane)
- U+d880 - U+d8bf (Plane 3, Reserved)
- U+d8c0 - U+d8ff (Plane 4, Reserved)
- U+d900 - U+d93f (Plane 5, Reserved)
- U+d940 - U+d97f (Plane 6, Reserved)
- U+d980 - U+d9bf (Plane 7, Reserved)
- U+d9c0 - U+d9ff (Plane 8, Reserved)
- U+da00 - U+da3f (Plane 9, Reserved)
- U+da40 - U+da7f (Plane 10, Reserved)
- U+da80 - U+dabf (Plane 11, Reserved)
- U+dac0 - U+daff (Plane 12, Reserved)
- U+db00 - U+db3f (Plane 13, Reserved)
- U+db40 - U+db7f (Plane 14, Supplementary Special-purpose Plane)
- U+db80 - U+dbbf (Plane 15, Supplementary Private Use Area A)
- U+dbc0 - U+dbff (Plane 16, Supplementary Private Use Area B)
By convention, U+[##]FFFE and U+[##]FFFF of each plane are set aside and reserved, never to be assigned. This allows internal processes to use them as sentinels. Note that they should never be interchanged with any other process!
Now the way things are numbered, each high surrogate is used, serially, combining with every possible one of the 1024 low surrogates before moving onto the next high surrogate. Thus for supplementary characters you see the following order:
- U+d800 U+dc00 -> U+10000
- U+d800 U+dc01 -> U+10001
- U+d800 U+dc02 -> U+10002
- ...
- U+d800 U+dffd -> U+103fd
- U+d800 U+dffe -> U+103fe
- U+d800 U+dfff -> U+103ff
- U+d801 U+dc00 -> U+10400
- U+d801 U+dc01 -> U+10401
- U+d801 U+dc02 -> U+10402
- ...
- U+dbff U+dffd -> U+10fffd
- U+dbff U+dffe -> U+10fffe
- U+dbff U+dfff -> U+10ffff
(I skipped some spaces in there for obvious reasons!)
This mechanism allows for many things such as simple range checking and easy conversions between code point an surrogate pair (it is a simple algorithmic macro to do the conversion when/if it is ever needed).
When combined with the way that scripts are assigned in blocks, it is easy to notice things like the following (not a complete list, just a sample!):
- U+d800 -- contains Aegean Numbers, Linear B Syllabary, Linear B Ideograms, Ancient Greek Numbers, Old Italic, Gothic, Ugaritic, and Old Persian.
- U+d801 -- contains Deseret, Shavian, Osmanya
- U+d802 -- contains Cypriot
- U+d834 -- contains Byzantine Musical Symbols, Musical Symbols
- U+d835 -- contains Math Alphanumerics
So when you combine the BMP's 216 code units with the 16 planes of 64 * 1024 (which is also 216 code units!), you get 17 * 216 or 1,114,112 code units in total -- which is where that interestingly arbitrary-looking number comes from!
Unicode's Roadmap site has a lot of information about the potential placement of future character allocations in Unicode, for those who are interested.
And for a more reality-based set of links, if you look ahead to Windows Vista three macros have been added to the winnls.h that comes with the Vista SDK:
I would expect that the meanings are pretty self-explanatory, but if not you can look at the VSDK topics to which I linked. :-)
(On a side note, I find it very cool that the Windows Vista SDK is available right now to everyone, whether they are on the Vista beta or not. It really does help to explain features and functions!)
Now in future posts I could perhaps get into other topics, like algorithmic conversion between UTF-16 and UTF-32....
This post brought to you by all of the supplementary planes of Unicode