Suppose a user pastes some plain text into a document. In principle, that text can contain any Unicode character. That includes virtually all characters used in the current languages of the world along with many from ancient scripts and a plethora of symbols, mathematical and otherwise, that don’t belong to any language in particular. The question arises as to what font(s) to use for the pasted characters. In general the same font cannot be used for all characters, since TrueType glyph indices are 16-bit numbers thereby limiting fonts to 64K characters. Meanwhile Unicode has over 100,000 assigned characters. Furthermore even if a font could contain glyphs for all Unicode characters, it wouldn’t be able to render them all without compromises in quality. East Asian characters, for example, ideally have different baselines from Latin characters. This post describes the way RichEdit chooses fonts for characters not present in the active font. This process is call “font binding”.
The first section describes RichEdit character repertoires. The second section explains how a character is assigned to a character repertoire. The third section describes how to find out what character repertoires are supported by a font. The fourth section shows how these two kinds of information are combined to bind fonts to characters in a context-dependent way.
RichEdit’s approach to font binding is an extension of the GDI CreateFont() functionality that ensures the created font matches a given charset. If the font named in the call supports the charset, then the font is used, but if not, GDI instantiates a font that does. Before Unicode became popular, charsets defined character encodings for character repertoires typically associated with language systems, like Western European languages and Japanese. As such they were used for two purposes: to define the encodings and to define character repertoires supported by fonts. The GDI CreateFont charset functionality addresses the latter purpose. This facility, which is a kind of “font fallback”, is very handy, since it’s usually easy to choose the charsets for characters that have charsets. In contrast it’s hard to choose the correct languages for characters in general.
Charsets correspond to code pages. The Windows code pages are described here. For reference, the Windows charsets supported by RichEdit are
When Windows 2000 added support for Indic and several other South East Asian scripts, the decision was made not to add charsets for new scripts since it was clear that Unicode was the best way to represent characters on computers. Unfortunately that decision limited GDI’s convenient font “fallback” mechanism to character repertoires that have charsets. RichEdit needed to generalize this usage of charset. Accordingly we defined the charrep, a character repertoire index. There’s a charrep for each Windows charset. In addition, there are charreps for
Most of these are described in The Unicode Standard.
RichEdit does a kind of binary range search to find out which charrep a character nominally belongs to. ASCII is treated specially, since almost all fonts support “low ASCII”, the “neutral” range U+0020..U+003F. High ASCII, the range U+0040..U+007F is supported by most fonts as well. The ANSI_CHARSET includes ASCII and the Western European ANSI set contained in Windows code page 1252. For the range U+00A0..U+00FF, there’s a charrep named “high Latin 1”.
East Asian (CJK—Chinese Japanese Korean) fonts all support ASCII and high Latin 1. Chinese characters are used in Japanese, in both simplified and traditional Chinese, and in Korean. The CJK fonts often have a lot of Unicode symbol characters (not SYMBOL_CHARSET discussed shortly), so a CJK charrep may be returned for those. An exception occurs in math zones, where a math charrep is preferred.
The default charrep is assigned to the Unicode Private Use Area characters, U+E000..U+F8FF, since these characters have no standardized semantics. Special attention is given to SYMBOL_CHARSET fonts, which by definition don’t use Unicode. But characters in the range U+F020..U+F0FF are assigned the SYMBOL_CHARSET charrep, since Microsoft TrueType SYMBOL_CHARSET fonts use those locations as aliases for U+0020..U+00FF. These charrep assignments are similar to some assignment models based on scripts. But note that natural language isn’t used. This is because it’s much easier and more reliable to figure out a reasonable charrep for a character than a natural language for a character.
Windows GDI has a handy structure known as the FONTSIGNATURE, which has bits claiming support for various code pages and Unicode ranges. Some fonts don’t have reliable values, so buyers beware. Nevertheless, it’s fast and useful, so RichEdit uses it to fill in a bit mask for supported character repertoires and has some back-up code to handle errant fonts. Some fonts claim to support a given character repertoire, but only partially support it. For example Japanese fonts claim to support Greek and Cyrillic, but they only have glyphs for basic Greek and Cyrillic characters. RichEdit classifies other characters in these repertoires as “extended”, which means that they need “cmap” verification. The cmap is a TrueType font’s character-to-glyph map. If it returns 0 for a character, the character is missing and will display a missing-character glyph, usually an empty box. The cmap approach is valuable for other cases in which the FONTSIGNATURE may be inadequate. For example, a font may not claim to support Latin 1, but it nevertheless supports low and/or high ASCII. By checking the cmap for ‘0’ and ‘a’, respectively, one can find out the amount of ASCII support available. This approach can also be useful for finding out about new Unicode ranges not yet in the FONTSIGNATURE, e.g., emoji.
Given the information in the two preceding sections, we need to ensure appropriate fonts are bound to characters. Let’s start with the basic algorithm and then consider some of many fix-ups.
The simple algorithm is: assign a character flag (bit) to each character repertoire (charrep) and AND the resulting bit mask for a character against the bit mask for the current font. If a nonzero value results, the font claims to support the character’s charrep and the font can continue to be used. If the result is zero, font binding is needed.
In an attempt to keep the current user font choices when font binding, RichEdit scans the text runs backward from the insertion point looking for a font that supports the desired charrep. If one is found, it is used unless it was introduced by font binding. Otherwise the default font for the charrep is used. The RichEdit client can change the default font for a charrep by using the EM_SETCHARFORMAT message with the SCF_ASSOCIATEFONT flag and an LCID (locale ID) that corresponds to the desired charrep.
Special considerations apply to math zones and Chinese characters (among other scenarios). In math zones, a math font is used whenever it can handle the characters. This includes not only Unicode math symbols and math alphanumerics, but also Latin, Greek and Cyrillic text. Many Chinese characters are used in Japanese, but a Chinese font may not look pleasing to a Japanese person. In particular, the simplified Chinese fonts look quite different from the more traditional look of the same characters rendered with a Japanese font. So to bind appropriately, we scan forward and backward around an inserted Chinese character to see if any Hiragana or Katakana characters are present. If so, it’s most likely to be Japanese text and a Japanese font should be used. Similarly if Hangul characters are found, a Korean font should be used. Other heuristics involve noting what the user locale is.
If even with its many heuristics RichEdit still cannot find a reasonable font and a Microsoft Office application is running, RichEdit queries the Office mso.dll for advice. Also various kinds of “font fallback” may kick in at display time to save a character from rendering as a missing-character glyph. Font binding is an area of active research, since new character scripts continue to be added to Windows and Unicode continues to add new characters. In general RichEdit does a good job of it, but it could be better.