Early in this century with texting becoming an increasingly popular way to communicate on cell phones, the Japanese created an imaginative new way of conveying an idea or emotion: use cool, maybe colorful, maybe animated, symbols called emoji. Some emoji resembled symbols that were already encoded in Unicode, but most were new. Examples include “red apple” 🍎 and “first quarter moon with face”🌛. In a way, emoji seem a little like high-tech, modern Kanji. Three Japanese cell phone carriers supported sets of several hundred emoji in an informal interchangeable way. This post describes how emoji were standardized for the world at large and some of the implementation problems that ensued in adding emoji support to RichEdit 8.

At first most character encoding experts assumed that emoji should be handled as rich text or embedded images rather than be encoded in plain text. After all, rich text includes color and animation along with a myriad other properties. But the fact remained that the Japanese cell phone carriers transmitted emoji as plain text in spite of their possibly richly formatted display. Meanwhile Google wanted to be able to search the web and find text that included emoji. Accordingly in 2007 some folks from Google proposed a dramatic departure from the earlier Unicode encoding principles that banned rich-text attributes such as color and animation. The new approach simply sought to standardize the characters even if their display might end up static and monochrome. The Japanese cell phone companies agreed to use rich text for additions to the set, so that the existing set would be stable.

Folks from Apple then got excited about emoji and created a plain-text font with glyphs for all the emoji. Later they went further than the Japanese cell phone carriers by inventing an incredible color font for the emoji with scalable images. After many intensive discussions in the Unicode Technical Committee and on the Unicode email lists, the emoji were added to the Unicode Standard 6.0. Most emoji were encoded in plane 1, starting at U+1F300. 107 emoji were unified with characters that already existed in the BMP (basic multilingual plane). Microsoft added emoji support to the Segoe UI Symbol font in Windows 8 and created a color emoji font as well. It also added emoji support to on-screen keyboards.

With regard to implementation, emoji pose special challenges mostly due to the unification of 107 emoji with existing BMP characters and to the representation of 11 keycap Emoji for #, 0, …, 9 using the U+20E3 keycap combining mark. The original Microsoft plan was to use the Segoe UI Symbol font for all emoji, but this font choice is ambiguous for the unified emoji. RichEdit 8 does use Segoe UI Symbol for all emoji except for the double exclamation mark U+203C (‼), which stays with the current font if the font has this character. Meanwhile, some clients would like to display emoji in an enhanced way, such as pictographic, multicolored, animated, or all three. An important question in plain-text contexts is, should a unified emoji character be displayed in an enhanced way or with a traditional monochrome, stationary font?

With hindsight it’s easy to note that this problem would not have occurred if the UTC had resisted the temptation to unify some of the emoji. Typical current emoji renderings often don’t even look like the characters that they were unified with. The Dingbat parentheses ❪ and ❫ were encoded separately from other parentheses, so why not the emoji that ended up being unified? I think part of the rationale was to justify encoding the emoji in the first place. Doing so helped assuage people’s concerns about adding such nonconventional characters to Unicode.

Two ways to resolve the ambiguities exist: use a variation selector or use rich text. The former works at least in part for plain text. Specifically if an ambiguous emoji character is followed by one of the BMP “emoji” variation selectors U+FE0E and U+FE0F, it’s definitely an emoji character. U+FE0E specifies that the character should be rendered using a standard emoji-capable font, e.g., Segoe UI Symbol, whereas U+FE0F implies that special emoji rendering should be used. The nature of the special rendering is specified by a higher-order protocol outside the domain of plain text.

Another implementation problem occurs because the on-screen keyboard programs send in the plane-1 emoji characters as a pair of WM_CHAR messages, the first for the lead surrogate and the second for the trail surrogate. It would have been simpler to use a single WM_UNICHAR message, which has the UTF-32 character code. It’s crucial that both the lead and trail surrogates end up in the same character format run. Otherwise they won’t be glyphed together and boxes will be displayed. Similar problems occur with some CJK IMEs when inputting plane-2 surrogate pairs.

You might wonder if any of this has anything to do with math. Well since the emoji are now readily available, it’d be pretty surprising never to find them used in formulas. It’s error prone to predict what symbols mathematicians may be inspired to use. In particular, the emoji might be handy in teaching fractions, but then one would probably need additional emoji representing different fractional amounts. There’s an emoji slice-of-pizza symbol 🍕, but at least we’d need a whole pizza symbol to make a useful formula…