There is no such thing as a surrogate character (dammit!)

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

There is no such thing as a surrogate character (dammit!)

  • Comments 8

The title of this post, including the parenthetical note, is something that people associated with the Unicode Standard have to tell people all the time (of course generally people only say that parenthetical note to themselves, and really only because they have to say it so many times!).

The issue is clear in both the Unicode Glossary:

Surrogate Character. A misnomer. It would be an encoded character having a surrogate code point, which is impossible. Do not use this term.

and the Unicode FAQ:

Q: Are surrogate characters the same as supplementary characters?

A: This question shows a common confusion. It is very important to distinguish surrogate code points (in the range U+D800..U+DFFF) from supplementary code points (in the completely different range, U+10000..U+10FFFF). Surrogate code points are reserved for use, in pairs, in representing supplementary code points in UTF-16.

There are supplementary characters (i.e. encoded characters represented with a single supplementary code point), but there are not and will never be surrogate characters (i.e. encoded characters represented with a single surrogate code point).

In fact, if you look to the Unicode Roadmap, each plane has its own name:

  • Plane 0: BMP (Basic Multilingual Plane)
  • Plane 1: SMP (Supplementary Multilingual Plane)
  • Plane 2: SIP (Supplementary Ideographic Plane)
  • Plane 14: SSP (Supplementary Special-Use Plane)

They are supplementary characters, one and all. They are not surrogate characters. Truly.

This is easy, right?

Of course even the clearest intention will not always find itself communicated properly, which is why the Char.IsSurrogate method will have text like "Indicates whether a Unicode character is categorized as a surrogate character" or when the Windows CE docs say "For sorting, all surrogate pairs are treated as two Unicode code points. Surrogates are sorted after other Unicode code points, but before the PUA (private user area). Sorting for a standalone surrogate character (that is, either the high or low character is missing) is not supported.". I do mind the not-entirely-accurate statement about the collation, but I will talk about that another day!

I do not mind the surrogate character usage like that in the previous paragraph so much, as it is a more benign error -- when people say surrogate character in this context, they mean to say surrogate code point. Harmless error and it even shows up as a NULL glyph as if it were a character of some sort, and we can just the documentationl language at some point (hopefully soon, but I will not lose sleep if they do not).

The real problem case is when they try to equate the term surrogate character with the term surrogate pair. If they compound it by the naming the method that way, like the XmlWriter.WriteSurrogateCharEntity method, which in addition the evil method name, say things like:

When overridden in a derived class, generates and writes the surrogate character entity for the surrogate character pair.

This is a bit harder to fix (not the doc. portion, but the method name, which obviously cannot be removed.

But we'll figure something out. Eventually.

Until then, please remember what the title of this post is telling you -- there is no such thing as a surrogate character!

 

This post brought to you by U+D800, the first surrogate code point -- not a surrogate character!
(This code point has come to terms with his lack of character-ness, but has mentioned that the fact that no one else has may put him into therapy)

Comment on the blather
Leave a Comment
  • Please add 3 and 8 and type the answer here:
  • Post
Blog - Comment List
  • The evil like Char.IsSurrogate comes from the fact that Char managed type, as well as wchar_t (under Windows) really represent just two bytes in Utf16 encoding, not the Unicode character. I always mentally translate it, and then Utf16CodePoint.IsSurrogate does makes sense :)
  • Well, that is the slightly more benign use (IMHO) -- both types are defined for MS platforms as using UTF-16, an it is just asking if the thing in the Char or WCHAR is a surrogate....
  • Well put. I'll try to always keep this in mind. But looking at XmlWriter WriteSurrogateCharEntity it seems to be named to be consistent with WriteCharEntity which I think is actually incorrect in the first place -- it is a "Numeric Character Reference", not an "Entity Reference" and certainly leaving off the term "Reference" or at least "Ref" is bad usage; it should be WriteCharRef and WriteSupplementaryCharRef for use with UTF-16. They also have WriteEntityRef which is actually an "Entity Reference," so they apparently thought leaving the "Ref" part off would make it a Numeric Character Reference! This whole API shows a real confusion in terms, in particular a craziness around the term "Entity." Dare Obasanjo oversaw this right? He should explain it. But coming back to your point, the Char in the method name may be intending to refer to the output (which is actually a supplementary character) and not the surrogate pair that is input judging from the correlation with WriteCharEntity.
  • The other day I complained about how My syndication links are broken.
    We can blame it all on surrogates!...
  • (No, the title of this post does not contain a typo!)
    I have a regular reader of this blog who is a...
  • The other day, Ramanathan asked: Hi, I have the following surrogate character 𠄃 that can be encoded

  • This may have happened to you before. Sometimes I am trying to have a conversation with someone. And

  • Regular reader Jan Kučera asked in response to If you would wait till I *FINISHED* what I was trying

Page 1 of 1 (8 items)