Welcome to MSDN Blogs Sign in | Join | Help

What paragraphs are and how they are formatted are questions that continually come up both inside and outside of Microsoft. So this post describes Word/RichEdit paragraphs in general. A subsequent post will describe the “math paragraph”, which is part of a regular paragraph and is used for displayed equations, as distinguished from inline mathematical expressions.

The paragraph is a very important structure in written language. About six years ago, I developed the RichEdit binary format, which shipped with RichEdit 5.0 (Office 2003) as RichEdit’s preferred copy/paste format and was used by OneNote 2003 and 2007. In the design stage I talked with Eliyeser Kohen, of TrueType, OpenType, LineServices, and Page/Table Services fame. I was inclined to have four parallel streams: plain text, character formatting, paragraph formatting, and embedded objects, a format corresponding to the internal RichEdit representation. Eliyezer agreed such parallel streams were important, but insisted that they should be broken up into paragraphs. At the time, this seemed like extra overhead to me and I naturally didn’t want to slow things down. But I followed his advice and it’s right on! First, what’s a paragraph? Then what’s paragraph formatting? Then what’s a “soft” paragraph? And finally, what’s the final EOP?

What’s a paragraph?

 From a natural language point of view, a paragraph is one or (preferably) more closely related sentences that naturally belong together without becoming too long. From the Word/RichEdit point of view, a paragraph is a string of text in any combination of scripts and inline objects, including possible “soft” line breaks and “math paragraphs”, with uniform “paragraph formatting” up to and including a carriage return. The carriage return (CR) is given by the Unicode character U+000D, which you insert by typing the Enter key. In plain text on PC’s, the paragraph is usually terminated by a CRLF (U+000D U+000A) combination, but not ordinarily inside a Word document or RichEdit instance. Just the CR is used.

<rant> It’s quite convenient to use a single character. It takes up less space than the CRLF and it’s easier to parse/manipulate, since it’s an atomic entity. In fact, Unix already used a single character, the line feed (LF—U+000A), back in 1972, several  years before the PC operating systems were developed. Unfortunately, the PC with its DEC heritage preferred CRLF, a holdover from the old teletype days, and Word and the Mac shortened it to CR instead of LF. Windows NotePad still isn’t able to display Unix/Linux LF terminated paragraphs correctly after all these years (note that 2008 > 1972). I’m on a mission to fix that, but please don’t hold your breath! Anyhow I like CR better than LF, mostly because of habit. Clearly it would have been better to have a single standard. In this connection, it’s interesting to note that Word and RichEdit can handle CR, LF, and CRLF terminated paragraphs, even though they prefer CR. </ rant>

What’s paragraph formatting?

A key characteristic of a paragraph is its formatting, which is represented by a pretty large set of properties. Most of these properties are settable using a paragraph formatting dialog. In particular, there’s alignment (left, right, center, justify, along with a variety of East Asian options), space before/after, line spacing (single, double, multiple, at least, exactly), left/right margins and wrapped line indent, line/page breaks, tabs (oh, how I wish HTML had tab support!), and bullets/numbering. Internally, paragraphs and their formatting get overloaded with such entities as tables and drop caps, but let’s not get distracted. Using hot keys like Ctrl+E for centering or the paragraph formatting dialog, you can set the formatting for the paragraph(s) in which the current selection occurs. If you just have an insertion point (the blinking caret), only the paragraph containing the insertion point gets the new formatting.

What’s a soft paragraph?

When you create a numbered list, you may want to have an entry with one or more line breaks but no new number or bullet.  To insert a line break without ending the paragraph, type Shift+Enter, which inserts a Vertical Tab (VT—U+000B). Even though you get a line break, you don’t end the current paragraph, so no new line number appears. All the paragraph properties remain the same with the new line and the space-before property doesn’t apply to the new line, since the line is inside the paragraph. Sometimes it’s handy to refer to a sequence of lines terminated by such a line break as a “soft paragraph”. In HTML, these “soft line breaks” are represented by the <BR> tag, whereas “hard” paragraphs are identified by the <P> tag.

Thinking of numbered entities, you might want to change the character formatting of the number or bullet out in front. For example, you might want to use a larger font size or a different font. To do this, change the appropriate character formatting of the CR that ends the paragraph.

Final EOP

To provide a place to attach paragraph formatting for the last paragraph, every Word document and every RichEdit rich-text instance has a “final EOP” (end of paragraph), represented by a CR (CRLF in RichEdit 1.0). You cannot delete the final EOP, nor can you move the insertion point past it. In the Word and RichEdit object models, the ranges can select up through the final EOP, but they cannot collapse to an insertion point that follows the final EOP. The farthest they can go is up to just before the final EOP. Similarly messages like EM_EXSETSEL cannot make the RichEdit selection go beyond the final EOP.

RichEdit also supports plain-text controls, which are characterized by uniform paragraph formatting and don’t need, or have, a final EOP. An empty plain-text control is really empty, whereas a rich-text control always has at least one character, the final EOP.

An earlier post describes math context menus (right click somewhere in a math zone) for changing the display characteristics of math objects, like fractions and integrals. For example context menus offer options to convert a stacked fraction into a linear fraction and vice versa. Another post describes math context menus for aligning and/or manually breaking equations on binary and relational operators.

In particular, the second post shows how one can align a sequence of equations separated from one another by soft paragraph marks (Shift+Enter, instead of Enter). For this approach, one chooses the “Align at this character” option for the operator to be used for alignment in each equation. This method is quite general in that binary, relational, and punctuation characters can be used as alignment operators, even when inside math objects.

A useful alternative context menu option not described in those posts allows one to align a set of equations with the single menu choice, “Align at =”. This is less general than marking the alignment operators explicitly, since in each equation the first relational operator that’s not inside a math object is used. To access the option, select two or more equations separated from one another by soft paragraph marks. Then right click anywhere on the selected equations and choose the “Align at =” option. Here “=” is the most common choice for aligning multiple equations. But the “=” just stands for the first relational operator, which could be, for example, “≥” instead of “=”. Note that two or more whole equations have to be selected for the “Align at =” option to be offered. If the last equation is only partly selected, the option won’t appear.

The math context menus also include the options “Professional”, “Linear”, and “Save as New Equation…” The “Professional” option converts any linear format text that is selected in the math zone into the corresponding built-up “professional” form. If no text is selected, the whole math zone is build up. Conversely, the “Linear” option converts built-up math objects to the “built-down” linear format. The “Save as New Equation” option saves the selected equation(s) in the Equation drop down list appearing at the left side of the math ribbon. This gives you an easy way to insert them from the math ribbon. Alternatively you can add a Math Autocorrect entry with the linear format for any math expression/equation you’d like to insert via typed entry. To see this last method in action, try typing \quadratic <space> <space> in a math zone. This inserts the solutions to the quadratic equation.

A number of math display properties have document defaults. These are the ones used if you don’t explicitly override them, which you can usually do by invoking a math context-menu option. The properties all pertain to “displayed” math zones, that is, math zones that begin either at the start of the document or at a hard/shift Enter (CR/VT) and end at the following hard/shift Enter. The options determine math indents and things such as whether integral limits are positioned below and above the integral or as subscript and superscript. In Russia, it’s common to see the integral limits below and above the integral, while in the United States the limits are displayed as subscript and superscript.

You can change the default settings to suit your tastes or a publisher’s conventions. In the math ribbon (type Alt+= to insert a math zone and then the math ribbon should appear), click on the Tools button over toward the left side of the ribbon. A dialog will be displayed that shows a variety of math display properties along with buttons to access the math autocorrect and recognized-function dialogs.

 

The document default math properties in this dialog are described in a somewhat technical way in the math section of the RTF specification. The properties belong to the RTF {\mmathPr…} group. They are also children of the <mathPr> OMML element. In this post, I describe the properties in a less technical way. For easy reference to the RTF specification, the relevant RTF control word is listed in parentheses. The dialog also has some options that are not document default math properties, such as “Copy MathML to the clipboard as plain text” instead of “Copy Linear Format to the clipboard as plain text.” Such options do not affect the layout of a document and hence are stored in the system registry rather than in the document.

 

Default font for math zones (\mmathFontN) Gives a drop-down list of math fonts that can be used as the default math font to be used in the document. Currently only Cambria Math has thorough math support, but others such as the STIX fonts are coming soon.

Reduce size of nested fractions in display equations (\msmallFracN) Specifies that nested fractions should be displayed such that the numerator and denominator are written in a script or scriptscript size instead of regular-text size. Specifically characters in the outermost fraction’s numerator and denominator are displayed using the full text size, characters in a nested fraction are displayed in the script size (about 70% as large as the text size), and fractions nested inside a nested fraction are displayed in scriptscript size (about 60% as large as the text size). TeX uses this “small fraction” choice by default, but Word 2007 does not, basically because in all the physics books I’ve read I don’t remember seeing reduced sizes used in display math. But if you prefer them, you can change them.

Break lines with binary and relational operators (\mbrkBinN) Document property specifying how binary operators are treated when they coincide with a line break. By default, the line break occurs before the binary operator. That is, the binary operator is the first control word on the wrapped line. But you can change it so that a line break occurs after the operator, or so that the operator is duplicated, that is, it appears at the end of the first line and at the start of the second.

Duplicate operators for subtraction as (\mbrkBinSubN) Document property specifying how the minus operator is treated when it coincides with a line break when break operators are duplicated. By default, the minus appears before and after the break, but you can choose a plus before the break and a minus after the break or vice versa.

Place integral limits to the side/centered above and below (\mintLimN) Document setting for default placement of integral limits when converting from linear format to professional (built-up) format in display mode (not inline). Limits can be either centered above and below the integral, or positioned just to the right of the operator. The default setting is to position to the right of the operator (subscript/superscript).

Place n-ary limits to the side/centered above and below (\mnaryLimN) Document setting for default placement of n-ary limits other than integrals when converted from linear format to Professional (built-up) format in display mode. Limits can be either centered above and below the n-ary operator, or positioned just to the right of the operator.  The default setting is above and below the operator.

 

Use the following settings for math on its own line (\mdispDefN) Document property to use the default math paragraph settings for equations, i.e., use values given by \mlMarginN, \mrMarginN, \mdefJcN, \mwrapIndentN, \mwrapRightN, etc. Default is to use the default math settings described below, but you can change it to use the text paragraph settings.

Left margin (\mlMarginN) Document property for the left margin for math. Math margins are added to the paragraph settings for margins.

Right margin (\mrMarginN) Right margin for math.

Justification (\mdefJcN) Document property for the default justification of displayed math zones. Individual equations can overrule the default setting. Displayed math zones can be left justified, right justified, centered, or centered as a group. When a displayed math zone is centered as a group, the equation(s) are ordinarily left aligned within a block, and the entire block is centered with respect to column margins. The user can use a context menu to align equations in more general ways, e.g., on the equal signs.

Indent wrapped lines by (\mwrapIndentN) Indent of wrapped line of an equation. The line or lines of a wrapped equation after the line break can either be indented by a specified amount from the left margin, or right-aligned. The default indent is 1”.

Right align wrapped lines (\mwrapRightN) If enabled, right justify wrapped lines of an equation. If disabled, the line or lines of a wrapped equation after the line break are indented by \mwrapIndentN from the left margin.

 

In addition to the properties above, the math RTF and OMML include four useful displacements for displayed math which unfortunately didn’t make it into Word 2007 (hopefully they will someday J). These properties are

Spacing before math paragraph (\mpreSpN).

Intraequation spacing between lines in an equation (\mintraSpN).

Spacing between equations within a display math paragraph (\minterSpN).

Spacing after math paragraph (\mpostSpN).

In addition two useful, but not yet implemented, document default math properties are 1) math style for differential d and related characters (U+2145..U+2149), and 2) which character to use for invisible times (U+2063) if a line break occurs at the invisible times. Ordinarily one would use the \times (U+00D7) for a visible times character, but a raised dot is another possibility. In the United States, the differential d is almost always displayed as a math italic d, but in Europe, an upright d is fairly standard. The latter choice emphasizes that the differential d is different from regular mathematical variables. Similarly the Naperian logarithm base e (U+2147) and the imaginary unit i (square root of -1, U+2148) are displayed as math italic in the United States and upright in Europe.

The Equations Options dialog also includes buttons to examine math autocorrect entries and recognized functions such as trigonometric functions.

MathML doesn't formalize document defaults for math, but MathML math zones can inherit them depending on the implementation. So such defaults are compatible with MathML and need to be expressed in a way outside of MathML.

One subject that seems to come up every other month or so is how RichEdit tables work. So I might as well post the answer. Hopefully RichEdit tables will eventually be described in the Windows SDK. They are not directly related to Math in Office, but I had mathematical expressions in mind when designing RichEdit’s table facility. Both mathematics and tables are recursive. For example you can have a fraction in the numerator of another fraction, and you can have a table in the cell of another table. So implementing tables seemed like a useful project that might also reveal how to implement a WYSIWYG implementation of mathematics. In fact, MathML <mtable>’s have a lot in common with general tables.

 

Most people at the time (1999) were recommending that a table cell should be represented by a whole RichEdit instance, which would give great generality. But I wanted a model that was much smaller, faster and worked with the built-in Find/Replace functionality and the RTF file converters. To this end, we needed a model, like Word’s, that was part of a single document instance, and could be overlaid on the existing paragraph structure. Accordingly RichEdit's table implementation is very efficient and fast, in fact, much faster than Word’s (although less general). Improvements have been made over the years, but the discussion that follows applies to RichEdit 4.0, which shipped with Office 2002, and RichEdit 4.1, which ships with Windows XP and Vista to this day. It also applies to later versions that ship with Office 2003 & 2007, which have additional features..

 

Specifically a cell containing a single line of text is represented only by that text, not by some larger structure. An empty cell consists of the single character, the cell mark U+0007. A cell containing multiple lines of text is expressed in terms of a structure that is substantially smaller than a complete edit instance, followed by the CELL mark. Tables can be nested up to 15 levels deep; higher nestings are represented by tab-delimited text. Cells can contain multiple paragraphs of any kind, e.g., bidirectional text, arbitrary tabs and alignments.

 

The Spring of 1999 was shortly after the Unicode Technical Committee added the U+FFF9..U+FFFB delimiter characters for describing ruby text in Japanese. These characters were available for more general use and seemed ideal for RichEdit’s internal table structure. This choice preceded the addition of the internal-use-only U+FDDO..U+FDEF characters that we use for mathematical structure characters, among other things.

 

In the (in-memory) backing store, a table row has the form

 

    {CR...}CR

 

where { stands for the Unicode STARTGROUP character U+FFF9, and CR  is the ASCII Carriage Return character U+000D. The delimiter } stands for the Unicode ENDGROUP character U+FFFB and ... stands for a sequence of cells, each consisting of cell text terminated by the CELL mark U+0007. For example, a row with three empty cells has the plain text understructure U+FFF9 U+000D U+0007 U+0007 U+0007 U+FFFB U+000D. The start and end group character pairs are assigned identical PARAFORMAT2 information that describe the row and cell parameters.  If rows with different parameters are needed, they may follow one another with appropriate PARAFORMAT2 parameters. A horizontally or vertically merged cell has two characters: NOTACHAR (0xFFFF) followed by CELL (0x7). Any text that appears in a merged cell is stored in the first cell of the set of merged cells.

 

One way to insert tables is to copy/paste tables from Word. RichEdit reads and writes table RTF. For more programmatic purposes, RichEdit 4.0 introduced the message EM_INSERTTABLEROW, which acts similarly to EM_REPLACESEL but inserts one or more table rows with empty cells instead of plain text. Specifically it deletes the text (if any) currently selected by the selection and then inserts empty table row(s) with the row and cell parameters given by wparam and lparam, respectively, as defined below. It leaves the selection pointing to the start of the first cell in the first row. The client can then populate the table cells by pointing the selection at the various cell end marks and inserting and formatting the desired text. Such text can include nested table rows, etc. Since wparam and lparam point at row and cell parameter structures, this API isn't compatible with Visual Basic and can't be easily added to RichEdit’s object model TOM, although TOM2 does have a general set of table interfaces.

 

The TABLEROWPARMS and TABLECELLPARMS structures are defined as

 

typedef struct _tableRowParms

{                           // EM_INSERTTABLE wparam is a (TABLEROWPARMS *)

    BYTE    cbRow;          // Count of bytes in this structure

    BYTE    cbCell;         // Count of bytes in TABLECELLPARMS

    BYTE    cCell;          // Count of cells

    BYTE    cRow;           // Count of rows

    LONG    dxCellMargin;   // Cell left/right margin (\trgaph)

    LONG    dxIndent;       // Row left (right if fRTL indent (similar to \trleft)

    LONG    dyHeight;       // Row height (\trrh)

    DWORD   nAlignment:3;   // Row alignment (like PARAFORMAT::bAlignment,

                            //  \trql, trqr, \trqc)

    DWORD   fRTL:1;         // Display cells in RTL order (\rtlrow)

    DWORD   fKeep:1;        // Keep row together (\trkeep}

    DWORD   fKeepFollow:1;  // Keep row on same page as following row (\trkeepfollow)

    DWORD   fWrap:1;        // Wrap text to right/left (depending on bAlignment)

                            // (see \tdfrmtxtLeftN, \tdfrmtxtRightN)

    DWORD   fIdentCells:1;  // lparam points at single struct valid for all cells

} TABLEROWPARMS;

 

typedef struct _tableCellParms

{                           // EM_INSERTTABLE lparam is a (TABLECELLPARMS *)

    LONG    dxWidth;        // Cell width (\cellx)

    WORD    nVertAlign:2;   // Vertical alignment (0/1/2 = top/center/bottom

                            //  \clvertalt (def), \clvertalc, \clvertalb)

    WORD    fMergeTop:1;    // Top cell for vertical merge (\clvmgf)

    WORD    fMergePrev:1;   // Merge with cell above (\clvmrg)

    WORD    fVertical:1;    // Display text top to bottom, right to left (\cltxtbrlv)

    WORD    wShading;       // Shading in .01% (\clshdng) e.g., 10000 flips fore/back

 

    SHORT   dxBrdrLeft;     // Left border width (\clbrdrl\brdrwN) (in twips)

    SHORT   dyBrdrTop;      // Top border width  (\clbrdrt\brdrwN)

    SHORT   dxBrdrRight;    // Right border width (\clbrdrr\brdrwN)

    SHORT   dyBrdrBottom;   // Bottom border width (\clbrdrb\brdrwN)

    COLORREF crBrdrLeft;    // Left border color (\clbrdrl\brdrcf)

    COLORREF crBrdrTop;     // Top border color (\clbrdrt\brdrcf)

    COLORREF crBrdrRight;   // Right border color (\clbrdrr\brdrcf)

    COLORREF crBrdrBottom;  // Bottom border color (\clbrdrb\brdrcf)

    COLORREF crBackPat;     // Background color (\clcbpat)

    COLORREF crForePat;     // Foreground color (\clcfpat)

} TABLECELLPARMS;

 

Note that paragraph-format information containing the TABLEROWPARMS and TABLECELLPARMS information is attached to the table-row delimiters as set up by the EM_ INSERTTABLEROW message, so merely duplicating the plain-text table structure in the backing store isn't enough to insert a working table. In fact, methods like ITextRange::SetText() convert the special delimiters U+FFF9.U+FFFB to spaces (U+0020). Note also that this table structure is nestable.

 

The definition of EM_INSERTTABLEROW is extensible, since in the future we'll probably have to support more parameters for table rows and cells. The API also inserts a consistent table row all at once, so that no illegal table parts are present on return. Hence if the document is saved after such an insertion, valid Word-compatible RTF will be written. lparam points at the TABLECELLPARMS structure for the first cell in an array of TABLECELLPARMS structures.  It's important that cbCell = sizeof(TABLECELLPARMS).  That way RichEdit knows how much cell information the client is specifying.  In particular, in the future if more cell parameters are defined, older clients can get away with specifying less and the new RichEdit can assign default values for the new parameters.  Similarly cbRow says how many bytes are defined by the client for TABLEROWPARMS, in case RichEdit is revised to support more row parameters that the client doesn't know about.

 

To make simple tables easier to define, if fIdenticalCells = 1, lparam points at a single TABLECELLPARMS structure that is valid for all cells in the row.  Note that a nonzero cell border width is guaranteed to give at least a one-pixel border.

 

The colors are limited to the standard 16 colors defined by

 

      RGB(  0,   0,   0),     // \red0\green0\blue0

      RGB(  0,   0, 255),     // \red0\green0\blue255

      RGB(  0, 255, 255),     // \red0\green255\blue255

      RGB(  0, 255,   0),     // \red0\green255\blue0

      RGB(255,   0, 255),     // \red255\green0\blue255

      RGB(255,   0,   0),     // \red255\green0\blue0

      RGB(255, 255,   0),     // \red255\green255\blue0

      RGB(255, 255, 255),     // \red255\green255\blue255

      RGB(  0,   0, 128),     // \red0\green0\blue128

      RGB(  0, 128, 128),     // \red0\green128\blue128

      RGB(  0, 128,   0),     // \red0\green128\blue0

      RGB(128,   0, 128),     // \red128\green0\blue128

      RGB(128,   0,   0),     // \red128\green0\blue0

      RGB(128, 128,   0),     // \red128\green128\blue0

      RGB(128, 128, 128),     // \red128\green128\blue128

      RGB(192, 192, 192),     // \red192\green192\blue192

 

plus two custom colors. The border widths are limited to the range 0 to 255 twips.

 

If the color index is not in the range 1..18, then autocolor is used, which usually ends up being the system Text or Background colors.

No this isn’t about some kind of science fiction, this is about five Unicode characters that are useful for mathematics, but are generally invisible or should be. The characters are the zero-width space (U+200B), function apply (U+2061), invisible times (U+2062), invisible comma (U+2063), and the new invisible plus (U+2064). This post discusses each one in the context of mathematical text.

The zero-width space is a handy character that has no glyph “ink” and hence no ascent (height above the base line), no descent (depth below the baseline) and no width. In Word 2007 math zones you can insert it (type 200B <Alt+x>) into an empty argument if you don’t want a dotted box character to appear. RichEdit uses it for optional empty arguments to suppress the dotted box except when the insertion point resides inside an empty argument.

The function-apply character (U+2061) is used in the linear format as a binary operator that builds into a math function object. For example in a math zone, if you type sin2061<Alt+x> x and click on “Professional”, you get the math function object sin x. Naturally it’s easier just to type sin<space>x and have formula autobuildup do this for you, but underneath it’s the function apply character that’s controlling the build up process.

The invisible times (U+2062) is a bona fide binary operator and you can break on it and align to it. Unfortunately we didn’t have enough time to develop the uses for invisible times, so it’s not currently very useful. Unlike in Word 2007, it shouldn’t display a glyph, except for a thin space if at the end of a math zone. With it you could then effectively break an equation before any character, not just on binary, relational and some other operators. It would be nice to be able to have it display a multiplication times symbol × if it ends up being the best point for an automatic break. Word 2007 displays the invisible times as a dotted box surrounding a times sign, which is the glyph for it in the Cambria Math font.

The invisible comma (or separator) is supposed to convey the semantic of separating two variables or indices. For example the indices ij on a matrix element aij could be separated by the invisible comma to emphasize that ij isn’t the product of i and j. Word 2007 displays the invisible comma as a dotted box surrounding a comma, which is the glyph for it in the Cambria Math font.

The invisible plus (U+2064) is new with Unicode 5.1 and is supposed to carry the semantic of connecting a whole number like 3 with a fraction like ½ to give a quantity 3½ that has the value 3.5, not 1.5 (3/2). The invisible plus is well intended, but it’s also tricky to use. For one thing in ordinary arithmetic, addition is considered to have lower precedence than multiplication. So the value of the expression 4×3 + 1/2 is 12.5, not 14 (4×3.5). But 4×3<invisible plus>1/2 has the value 14. In this usage, the invisible plus has a higher precedence than multiplication.

Some more discussion of the invisible operators is given in Section 2.14 of Unicode Technical Report #25.

 

Two very interesting developments are happening that will improve Word 2007’s MathML support. The first is key for helping in getting Word 2007 math text into the scientific and technical publisher workflows and the second may help in this regard too. Specifically new transforms are now available in beta versions enabling Word to read and write MathML. These XSLT files are responsible for converting between Word’s native math format OMML and MathML 2.0. If you’d like to try out the new files (omml2mml.xsl and mml2omml.xsl), you can download them from the Microsoft Connect site using the invitation code: 0707-84P4-DPWT. Once you’ve downloaded the files, copy them to C:\Program Files\Microsoft Office\Office12 subdirectory, or wherever winword.exe is. Before doing so, you might want to change the current omml2mml.xsl and mml2omml.xsl files to omml2mml.xsl.bak and mml2omml.xsl.bak, respectively, in case you want to back out the update at a later date. But I doubt you will. The new ones are significantly better.

 

The second development is that Word 2007 will have a service pack release that enables it to read and write the ISO standard odf files as well as the native ISO standard OOXML files. In the odf standard, math zones are represented by MathML 2.0. So when Word converts to and from odf, it will use MathML 2.0 for all math zones. And it will use the files above to do the translations.

Subscript and Superscript Bases http://blogs.msdn.com/Themes/default/images/common/star-left-off.gifhttp://blogs.msdn.com/Themes/default/images/common/star-right-off.gifhttp://blogs.msdn.com/Themes/default/images/common/star-left-off.gifhttp://blogs.msdn.com/Themes/default/images/common/star-right-off.gifhttp://blogs.msdn.com/Themes/default/images/common/star-left-off.gifhttp://blogs.msdn.com/Themes/default/images/common/star-right-off.gifhttp://blogs.msdn.com/Themes/default/images/common/star-left-off.gifhttp://blogs.msdn.com/Themes/default/images/common/star-right-off.gifhttp://blogs.msdn.com/Themes/default/images/common/star-left-off.gifhttp://blogs.msdn.com/Themes/default/images/common/star-right-off.gif

For proper math typography, it’s important to know the base of a subscript or superscript expression. For example, in Einstein’s equation E = mc2, the superscript expression c2 appears and c is the base, not mc. Knowing what the base is allows proper kerning of the base relative to the script (superscript or subscript) as well as providing more accurate semantics in interoperating with mathematical calculation engines.

This post describes the subscript/superscript base rules used by Word 2007 and RichEdit 6 in building up math text from the linear format. The rules are good, but not infallible, and users can overrule them either directly in the linear format or after they are built up into the Professional format.

Unicode math alphabetics: Ordinarily when a user types an ASCII letter or a Greek lower case letter α..ω (along with some variants), the letter is automatically converted to the corresponding Unicode math italic letter. These special mathematical letters, along with the basic set of Latin letters in Fraktur, script, and open-face math styles, are reserved for mathematical variables . Accordingly if a subscript or superscript follows such a letter, that letter is considered to be the base. In linear format if you type E=mc^2<space>, you get E = mc2, where the letters are given by math italic characters (not used here in this blog post). In particular, c would be given by the math italic c, U+1D450, rather than by the ASCII c, U+0063. This single math italic c is the base of the superscript expression c2. For more information on the math alphabetics, please see Section 2.1 of the Unicode Technical Report #25.

Numbers: A consecutive string of ASCII digits is treated as a base. So in  the expression 1002, the 100 is the base of the superscript expression and has the mathematical meaning of “one hundred squared”. This quantity is typed in as 100^2.

ASCII letter strings: Since mathematical variables are almost always represented by math alphabetics, a consecutive string of ASCII letters is treated as a base. So in the superscript expression sin-1, the base is “sin”. Actually this case is usually handled by the function name mechanism described next. You can enter an ASCII letter string by turning off the italic button before you type or by selecting the corresponding math italic letters and then turning off the italic button. Be sure to turn the italic button back on if you want to enter math italic variables.

Function names: when a consecutive string of English alphabetics is typed followed by a space or bracket of some kind, the resulting math italic string is “folded” down to the corresponding ASCII letter string and compared to entries in a mathematical function dictionary. If found, the folded version of the string is used followed by the function-apply operator U+2061. The dictionary includes trigonometric functions like sin, cos, tan, etc., along with many other famous math function names. Users can modify this dictionary. If the function-apply operator is then followed by a subscript or superscript, that script is transferred to the function name, and the function name becomes the base of the script expression. This is handy for typing in expressions like sin-1x.

Embellished operators: If an operator character precedes a subscript or superscript, the operator is the base. For example, in the expression +­2, the + is the base.

Built-up math objects: If a built-up math object such as a stacked fraction precedes a subscript or superscript, that object is the base.

Superscript a subscript object: Exceptions to the rule above occur for superscripting a subscript object and subscripting a superscript object. In both of these cases, the combination is turned into a subsup object, which has special typography, typically placing the superscript over the subscript.

Opaque strings: Opaque strings are whatever is inside a \begin \end expression. Such strings are bases if followed by a subscript or superscript. This is the catch-all method of letting most any mathematical text be a subscript/superscript base. The user is cautioned to use reasonable choices so that the result is understandable to readers.

Complex script characters: In Indic scripts like Devanagari, a number of Unicode characters may be combined to form a character “cluster”. If such a cluster is followed by a subscript or superscript, the cluster becomes the base. However, this doesn’t occur for Arabic ligatures, for which only the last character is treated as the base. One can force the whole ligature to be the base by putting it inside a \begin \end expression, i.e., by making it an opaque string.

Ordinary text: Expressions resulting from the linear format “rate” are called ordinary text and are useful as variables when you want to spell out the variables’ names. Such ordinary text strings are treated as bases.

 

The science and technology publishing industry uses Word 2003 in processing a significant portion of manuscript submissions. The industry hasn’t yet been able to accept manuscripts in which the mathematical text (math zones) is created using Word 2007’s new math facility since the infrastructure currently only works with math zones encoded in the Design Sciences MathType format. To help generalize the infrastructure, the present post describes how the Word 2007 math zone content can be extracted from Word doc files converted from Word 2007 docx format. This post is pretty technical, so most people probably won’t read any further J

 

More specifically, this post shows how one can extract the Office 2007 MathML (OMML) from  math-zone images stored in doc files that have been converted for use in Word 2003 and earlier versions of Word. The main reason for having this information in the doc file is so that if Word 2003 is used to edit the file, the math zones remain alive and intact when reopened in Word 2007. But the information is also useful if you want to extract the OMML using Word 2003 as we see here.

 

The basic idea is to read the doc file into Word 2003 and save it in the RTF format. The image data in this RTF contains the OMML in the new wzEquationXML shape property value. Shapes are described in the section on Word 97 Through Word 2007 RTF for Drawing Objects (Shapes) in the RTF Specification 1.9.1.

 

For convenience, here is a quick summary of how the image RTF works. Images are represented by RTF of the form

 

{\*\shppict {\pict …}}{\nonshppict {\pict …}}

 

The full information is available in {\*\shppict {\pict …}} group, which is where the OMML is stored. Readers that don’t understand the \shppict group skip it and use the {\nonshppict {\pict …}} group instead, which represents the image in metafile format. The \shppict {\pict…} group contains the shape properties for the image in a {\*picprop…} group followed by some image control words and the binary data for the image itself in the png format.

 

Each shape property is represented by RTF of the form

 

                  {\sp{\sn PropertyName}{\sv PropertyValueInfo}

 

For example, consider the Word 2003 RTF for an image of x2, in which the wzEquationXML property name is displayed in red. The wzEquationXML value group contains a bunch of XML including the OMML, which is given by the <m:oMathPara> …</m:oMathPara> XML.

 

{\*\shppict{\pict{\*\picprop\shplid1025

{\sp{\sn shapeType}{\sv 75}}

{\sp{\sn fFlipH}{\sv 0}}

{\sp{\sn fFlipV}{\sv 0}}

{\sp{\sn pictureTransparent}{\sv 16777215}}

{\sp{\sn fLine}{\sv 0}}

{\sp{\sn wzEquationXML}

{\sv <?xml version="1.0" encoding="UTF-8" standalone="yes"?>\'0d\'0a<?mso-application progid="Word.Document

<…bunch of XML describing document environment…>

<m:oMathPara><m:oMath><m:sSup><m:sSupPr><m:ctrlPr><w:rPr><w:rFonts w:ascii="Cambria Math" w:h-ansi="Cambria Math"/><wx:font wx:val="Cambria Math"/><w:i/></w:rPr></m:ctrlPr></m:sSupP

r><m:e><m:r><w:rPr><w:rFonts w:ascii="Cambria Math" w:h-ansi="Cambria Math"/><wx:font wx:val="Cambria Math"/><w:i/></w:rPr><m:t>x</m:t></m:r></m:e><m:sup><m:r><w:rPr><w:rFonts w:ascii="Cambria Math" w:h-ansi="Cambria Math"/><wx:font wx:val="Cambria Math"/>

<w:i/></w:rPr><m:t>2</m:t></m:r></m:sup></m:sSup></m:oMath></m:oMathPara></w:p><w:sectPr wsp:rsidR="00000000"><w:pgSz w:w="12240" w:h="15840"/><w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0"/><w:cols w:space="720"/></w:sectPr></w:body></w:wordDocument>}}

{\sp{\sn fLayoutInCell}{\sv 1}}}

\picscalex100\picscaley100\piccropl0\piccropr0\piccropt0\piccropb0

\picw397\pich529\picwgoal225\pichgoal300

\pngblip\bliptag-207549586

{\*\blipuid f3a10b6edfd046bb828e459f44f8828d}89504e470d0a1a0a0000000d494844520000000f000000140802000000dda5f0450000000373424954050605330b8d80000000017352474200aece1ce9000000

097048597300000ec400000ec401952b0e1b0000007049444154384fb592db0ac0300843e7feff9f6d4a405b078b74ac0fa596438c1773f7ab7dee36394193da

66464590421b28a0501474c9dcf2cd0c30a3e94093c6575453de9b1906e56665c534c2ec20b5df1baa7dafe33ba2d7c2a3dce752e454e71a28eb7a4f3efb6eeeed514f7ed11eb9504504dfa6f6850000000049454e44ae426082}}

{\nonshppict{\pict\picscalex100\picscaley100\piccropl0\piccropr0\piccropt0\piccropb0\picw397\pich529\picwgoal225\pichgoal300\wmetafile8\bliptag-207549586\blipupi96{\*\blipuid f3a10b6edfd046bb828e459f44f8828d}<…hexadecimal string with metafile data…>

 

You need to use Word 2003 to get this RTF, since Word 2003 has been patched to write the wzEquationXML property. Word 2007 doesn’t write this property when it writes RTF for the png math zone images, since it writes math zones using math RTF (see the Mathematics section of the RTF Specification 1.9.1).

An updated RTF Specification is available for downloading here. I already blogged about the new version in the MS Word blog, but wanted to add a few words about math in Math in Office blog.

The RTF specification includes a thorough discussion of the Office 2007 math format. The format syntax is naturally RTF syntax, but the relationship to OMML (Office MathML) is straightforward, as discussed in the specification and in an earlier blog post. Reading the math section of the RTF specification is a great way to learn about the Office math model.

It’s fairly easy to read and write math RTF, so it provides an alternative, potentially easier way to interchange technical documents with Word 2007, especially if the target application already supports an earlier version of RTF.

If you have improvements you’d like to see incorporated into the RTF specification, please send them to me.

One handy way to edit mathematical text is to use math context menus. These menus are displayed when you depress the right mouse button with the mouse pointing inside a math zone. In addition to the usual Font and Paragraph options, in a math zone you see options relevant for the math object the mouse is pointing at. For example, if the mouse points at a stacked fraction, the context menu includes entries to change to a skewed or linear fraction as well as to remove the fraction bar. If the mouse points at an accented character, there’s an option to remove the accent.

You could make such changes by building down to the linear format, making appropriate edits, and then building back up to the professional format. This is a general way of making many kinds of changes that are hard to make in a WYSIWYG way. But if the desired change can be accomplished via a context menu option, that option is faster.

The following table summarizes the math context menus that appear using Word 2007.

 

Object

Context menu options

Accent

Remove accent

Bar

Switch between overbar and underbar

Remove bar

Box

Increase/decrease argument size

BorderBox

Hide/show (top/left/bottom/right) border

Add/remove (horizontal/vertical/top-left-diagonal/bottom-left-diagonal) strike

Brackets (delimiters)

Insert/delete argument before/after

Stretch delimiters, match delimiters

Hide/show left/right delimiter

Equation array

Insert row before/after

Delete row

Align array at top row, center, or bottom row

Fraction

Change to skewed/linear/stacked

Remove/replace fraction bar

LeftSubSup

Make into subsup

Increase/decrease argument size

Limit

Switch between upper and lower limit

Remove limit

Change limit size

Math zone

Build down/up, i.e., Linear or Professional

Matrix

Insert row/column before/after

Delete row/column

Show/hide empty-argument placeholders

Set row/column spacing

Align matrix at top row, center, or bottom row

Align column left/center/right

n-ary

Change limit location

Hide/show upper/lower empty limit place holder

Grow with content

Radical

Remove radical

Hide/show empty degree place holder

Group character

Display group (horizontal stretch) character above/below

Remove group character

Subscript/superscript

Delete script

Increase/decrease script size

SubSup

Align subscript and superscript

Make into left subsup

Increase/decrease script size

 

In addition if the Microsoft Math graphing calculator add-in is installed, right clicking on a formula gives you context menu options that Microsoft Math is able to perform on the formula. Select an option and a window appears with the results and offers the possibility to insert them into your document. We hope to extend this approach so that other math engines can be used to manipulate and graph formulas in Word.

Okay, the Math In Office blog isn't about advertising. But just in case you're someone who really likes RichEdit and editing and wants to work on it (as I did and do J) and related text processing, here's a pretty fine opportunity. If you're not interested, please skip this post.

 

So here goes. Want to work on components that are used by millions of users every day in apps like Word, OneNote, PowerPoint, and Excel, as well as in platforms like Windows Mobile and .NET? The RichEdit team is looking for energetic testers that love to code, take pride in their work, and enjoy solving problems in innovative ways.

 

RichEdit is used in many places throughout the Office applications, from high level edit controls to low level measuring and layout APIs. The test team's goal is to ensure that we provide robust and high quality code that meets the functionality that applications need. An example of a feature we recently helped deliver was the new Equations feature in Word 2007.

 

Because we provide APIs intended to solve a variety of application requirements, testers on our team have a deep understanding of the code our developers have written, the requirements and the code in the various client applications that use us, and write extensive automaton code (most of it in C++). We work closely with our development team, as well as client application developers to solve customer scenarios, prioritize our testing, help with integration, and most important, to find bugs in our code.

 

Since text is pretty much everywhere in computing, our team works with many other teams across Microsoft, and gets a unique perspective that comes from having such a wide scope. We also work with many different application requirements, and come up with new test approaches to deal wit