The basics of supplementary

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

The basics of supplementary

  • Comments 20

I thought I would explain a bit more about how surrogates work in Unicode, since it does not seem very well described in a whole lot of places. First, some definitions (all from the Unicode Glossary and the Unicode Roadmap sites):

Ok, it is all as clear as mud now, right? :-)

The problem is that even if the definitions are applied consistently, there is no good feel for exactly how they work, how high and low surrogates combine, and so on.

(Other questions, like why do high surrogates have lower numbers than low surrogates are covered in other posts)

Let's see if we can't do something about that....

(Warning: some MATH content ahead!)

We start with the Basic Multilingual Plane -- it is the code units from U+0000 to U+FFFF. Some of these code points are assigned; and a large subset of those are assigned characters. In all there are 65,536 code units in this and every other plane; you can also think of this as 1000016 or just 216 code units. Whatever you find easiest, conceptually.

Now what happens with those high surrogate code points is that the block of 1024 of them are divided into 16 blocks of 64 each. And each one of those blocks is used for a plane:

  • U+d800 - U+d83f (Plane 1, Supplementary Multilingual Plane)
  • U+d840 - U+d87f (Plane 2, Supplementary Ideographic Plane)
  • U+d880 - U+d8bf (Plane 3, Reserved)
  • U+d8c0 - U+d8ff (Plane 4, Reserved)
  • U+d900 - U+d93f (Plane 5, Reserved)
  • U+d940 - U+d97f (Plane 6, Reserved)
  • U+d980 - U+d9bf (Plane 7, Reserved)
  • U+d9c0 - U+d9ff (Plane 8, Reserved)
  • U+da00 - U+da3f (Plane 9, Reserved)
  • U+da40 - U+da7f (Plane 10, Reserved)
  • U+da80 - U+dabf (Plane 11, Reserved)
  • U+dac0 - U+daff (Plane 12, Reserved)
  • U+db00 - U+db3f (Plane 13, Reserved)
  • U+db40 - U+db7f (Plane 14, Supplementary Special-purpose Plane)
  • U+db80 - U+dbbf (Plane 15, Supplementary Private Use Area A)
  • U+dbc0 - U+dbff (Plane 16, Supplementary Private Use Area B)

By convention, U+[##]FFFE and U+[##]FFFF of each plane are set aside and reserved, never to be assigned. This allows internal processes to use them as sentinels. Note that they should never be interchanged with any other process!

Now the way things are numbered, each high surrogate is used, serially, combining with every possible one of the 1024 low surrogates before moving onto the next high surrogate. Thus for supplementary characters you see the following order:

  • U+d800 U+dc00 -> U+10000
  • U+d800 U+dc01 -> U+10001
  • U+d800 U+dc02 -> U+10002
  • ...
  • U+d800 U+dffd -> U+103fd
  • U+d800 U+dffe -> U+103fe
  • U+d800 U+dfff -> U+103ff
  • U+d801 U+dc00 -> U+10400
  • U+d801 U+dc01 -> U+10401
  • U+d801 U+dc02 -> U+10402
  • ...
  • U+dbff U+dffd -> U+10fffd
  • U+dbff U+dffe -> U+10fffe
  • U+dbff U+dfff -> U+10ffff

(I skipped some spaces in there for obvious reasons!)

This mechanism allows for many things such as simple range checking and easy conversions between code point an surrogate pair (it is a simple algorithmic macro to do the conversion when/if it is ever needed).

When combined with the way that scripts are assigned in blocks, it is easy to notice things like the following (not a complete list, just a sample!):

So when you combine the BMP's 216 code units with the 16 planes of 64 * 1024 (which is also 216 code units!), you get 17 * 216 or 1,114,112 code units in total -- which is where that interestingly arbitrary-looking number comes from!

Unicode's Roadmap site has a lot of information about the potential placement of future character allocations in Unicode, for those who are interested.

And for a more reality-based set of links, if you look ahead to Windows Vista three macros have been added to the winnls.h that comes with the Vista SDK:

I would expect that the meanings are pretty self-explanatory, but if not you can look at the VSDK topics to which I linked. :-)

(On a side note, I find it very cool that the Windows Vista SDK is available right now to everyone, whether they are on the Vista beta or not. It really does help to explain features and functions!)

Now in future posts I could perhaps get into other topics, like algorithmic conversion between UTF-16 and UTF-32....

 

This post brought to you by all of the supplementary planes of Unicode

Comment on the blather
Leave a Comment
  • Please add 8 and 8 and type the answer here:
  • Post
Blog - Comment List
  • OK, so now I'm only a little confused.

    Why can't I use the term 'Surrogate Character' to refer to a character which is encoded as a surrogate pair?

    Why didn't they use the high range of available surrogate code points for the high-surrogates and the low range for the low-surrogates? Are they intentionally trying to confuse us?

    Why did they have to use the term 'Basic Multilingual Plane' (giving us the ambiguous BMP acronym) instead of perhaps General Multilingual Plane or Basic Polylingual Plane?
  • Hi Gabe --

    For the first question, see http://blogs.msdn.com/michkap/archive/2005/07/27/444101.aspx

    For the second, see http://blogs.msdn.com/michkap/archive/2005/07/31/445850.aspx

    For the third, it is not ambiguous to Unicode people. :-)
  • Thank you for casting so much light on a murky area. This post is so good it seems churlish to go pointing out typos, but the last 3 lines of the table of supplementary characters should be:

    # U+dbff U+dffd -> U+10fffd
    # U+dbff U+dffe -> U+10fffe
    # U+dbff U+dfff -> U+10ffff
  • Not at all churlish, Simon -- and now fixed.... :-)
  • Stupid question time...

    1) OK, so this whole "surrogate code point" thing is UTF-16's way of encoding supplementary codepoints > U+FFFF? And this is one of the "private use" ranges, so there's no way to know the desired character properties of code points in this range?

    2) IS_SURROGATE_PAIR(wcH, wcL) == IS_HIGH_SURROGATE(wcH) && IS_LOW_SURROGATE(wcL)

    3) Why did Microsoft standardize on UTF-16 for the .NET framework? Wouldn't it be more space-efficient to standardize on UTF-8 for Western European locales, and UTF-16 for East Asian locales? Or would that cause interop problems for network communications across locale boundaries? Even given the relative ease of switching between UTF-8 and UTF-16 on the fly?

    4) It's kind of strange that 32 bits isn't enough. So UTF-32 really isn't a "flat map" to the Unicode code point system, because of U+10XXXX... Guess we need a UTF-33 encoding? ;)
  • Oh, I see... I was confusing U+10000 with U+100000

    There are seventeen planes (0x0 through 0x10) and only 0xf and 0x10 are specifically private-use. 0x0 is the basic plane but there are well-established characters in other planes, for example OLD ITALIC LETTER A:
    http://www.fileformat.info/info/unicode/char/010300/index.htm

    So considering the CharNext interview question:
    The UTF-16 way of encoding this particular character is with a surrogate pair. So, alas:

    It is not sufficient to unilaterally skip all surrogate pairs (as this character, among others, would be skipped)

    So the two feasible options are:
    * to unilaterally return the first element of all surrogate pairs
    * to come up with further logic to dictate when to return the first element, and when to skip

    "Unilaterally return" is a pretty attractive strategy at this point. :) This would assume that all private-use supplementary characters (in Plane F and Plane... um... "G") are "spacing" characters, which seems a fair assumption.

    And naturally, UTF-33 would not be enough... we'd need UTF-32 + 5 bits to handle the seventeen planes, to wit, UTF-37
  • UTF-32 encoded the same info as UTF-16, but in one flat plane.
  • > This mechanism allows for many things such as simple range checking and easy conversions between code point an surrogate pair (it is a simple algorithmic macro to do the conversion when/if it is ever needed).

    Hmmm... like this?

    /* Given a surrogate pair, returns the supplementary code point */
    #define SUPPLEMENTARY_CODE_POINT(H, L) \
    ( \
    /* optional checking */ \
    0xd800 <= (H) && (H) < 0xdc00 && \
    0xdc00 <= (L) && (L) < 0xe000 ? \
    /* UTF16 -> Unicode code point (un)encoding */ \
    0x10000 + ((H) - 0xd800) * 0x0400 + ((L) - 0xdc00) \
    /* invalid input - TODO: go "boom" - return null for now */ \
    : 0 \
    )

    /* Given a supplementary code point, returns the "high" surrogate pair element */
    #define SURROGATE_PAIR_HIGH(U) \
    ( \
    /* optional checking */ \
    0x10000 <= (U) && (U) < 0x110000 ? \
    /* Unicode code point -> UTF16 encoding */ \
    /* Note in this case | does not work for + */ \
    ((((U) - 0x10000) >> 0xa) + 0xd800) \
    /* invalid input - TODO: go "boom" - return null for now */ \
    : 0 \
    )

    /* Given a supplementary code point, returns the "low" surrogate pair element */
    #define SURROGATE_PAIR_LOW(U) \
    ( \
    /* optional checking */ \
    0x10000 <= (U) && (U) < 0x110000 ? \
    /* Unicode code point -> UTF16 encoding */ \
    /* Note in this case | does not work for + */ \
    (((U) - 0x10000) & 0x03ff) + 0xdc00 \
    /* invalid input - TODO: go "boom" - return null for now */ \
    : 0 \
    )

    /* Given a supplementary code point, returns the high and low surrogate code pair as an unsigned int */
    #define SURROGATE_PAIR(U) \
    ( \
    (unsigned int)SURROGATE_PAIR_HIGH(U) << 0x10 | SURROGATE_PAIR_LOW(U) \
    )
  • I would not tend to go boom -- easier to just return the original value if the return was not available....
  • >UTF-32 encoded the same info as UTF-16, but in one flat plane.

    I should really think before I post.

    Ah... one plane is 0000-FFFF - sixteen bits
    Seventeen planes - need five bits to determine the plane...

    That's only 21 bits. So UTF-32 is fine with supplementary planes.

    In fact, the first eleven bits of every UTF-32 code point are always zero... so we're only using one 2**12'th of the address space, even with all the reserved planes and whatnot...

    Ah, room to breathe :)
  • Hey, no worries. There are some people who do not even think after they post, let alone before. So you are still one up on many of them.

  • The question that was asked was:

    I want to show some surrogate pairs’ characters but failed. I...
  • The other day, in response to my post How are the file names encoded? , Matt Selz commented : If NTFS

  • (This could probably get turned into a series with various terms....) A very common question that comes

  • Sometimes I see a documentation topic that bothers me a little bit. And then occasionally I'll see one

Page 1 of 2 (20 items) 12