When Word 2007 and later versions write an RTF file that includes math alphanumeric symbols (see U+1D400..U+1D7FF), they convert the symbols back to ASCII or Greek in the BMP (basic multilingual plane) and then write the characters out using the appropriate charset, namely ANSI_CHARSET for Latin letters like a..z, and GREEK_CHARSET for Greek letters. In addition Word writes the relevant math style \mstyN (upright, bold, italic, bold-italic) and math script \mscrN (Roman, script, Fraktur, double-struck, sans-serif, monospace) control words to identify the original math alphanumeric symbols. Unfortunately Word 2010 made a change that occasionally used the ANSI_CHARSET charset for math-italic Greek letters instead of the GREEK_CHARSET. Accordingly when the RTF was read back in, those letters were corrupted. The technical publishing industry uses RTF in its publication pipeline, so you can imagine that this was a real pain in that industry. Math-italic Greek letters are popular in technical documents.
Eventually I looked at the Word code, although I’m not a Word developer, and noticed that the math Greek letters could be written out using the Unicode \uN control word instead of converting them using a charset. Making this simple change fixed the data corruption. It didn’t actually fix the original bug of occasionally using the wrong charset, but it worked around the bug, arguably with an improved design. RichEdit cannot have such a bug because it writes out the math alphanumeric symbols using the \uN control words directly without converting them to ASCII or Greek. This has the added benefit that the math-style and math-script control words don’t have to be written. RichEdit also uses the Unicode values for math alphanumeric symbols in plain-text copies, while Word converts them back to ASCII or Greek. Since plain text can’t have math-style control words, this conversion loses the math styling. This is too bad since the math alphanumeric symbols were added to Unicode precisely so that they could be preserved in plain text.
Here’s the good news: you can download the fix for Word 2010 from here, and the fix for Word 2013 from here. Windows Update automatically downloaded the fix for Word 2013 on my laptop, so the bug may already be fixed on your computers too. The downloads were published on June 10, 2014.
The fix harkens back to my post on UTF-8 RTF, a Unicode version of RTF. With a Unicode format, charset conversions aren’t used, simplifying the RTF writer substantially and improving performance. Unicode RTF is clearly a substantially better format than regular RTF, except that it’s not supported by most RTF converters. It does work with RichEdit and hopefully more RTF converters will support it in the future.