Welcome to MSDN Blogs Sign in | Join | Help

The science and technology publishing industry uses Word 2003 in processing a significant portion of manuscript submissions. The industry hasn’t yet been able to accept manuscripts in which the mathematical text (math zones) is created using Word 2007’s new math facility since the infrastructure currently only works with math zones encoded in the Design Sciences MathType format. To help generalize the infrastructure, the present post describes how the Word 2007 math zone content can be extracted from Word doc files converted from Word 2007 docx format. This post is pretty technical, so most people probably won’t read any further J

 

More specifically, this post shows how one can extract the Office 2007 MathML (OMML) from  math-zone images stored in doc files that have been converted for use in Word 2003 and earlier versions of Word. The main reason for having this information in the doc file is so that if Word 2003 is used to edit the file, the math zones remain alive and intact when reopened in Word 2007. But the information is also useful if you want to extract the OMML using Word 2003 as we see here.

 

The basic idea is to read the doc file into Word 2003 and save it in the RTF format. The image data in this RTF contains the OMML in the new wzEquationXML shape property value. Shapes are described in the section on Word 97 Through Word 2007 RTF for Drawing Objects (Shapes) in the RTF Specification 1.9.1.

 

For convenience, here is a quick summary of how the image RTF works. Images are represented by RTF of the form

 

{\*\shppict {\pict …}}{\nonshppict {\pict …}}

 

The full information is available in {\*\shppict {\pict …}} group, which is where the OMML is stored. Readers that don’t understand the \shppict group skip it and use the {\nonshppict {\pict …}} group instead, which represents the image in metafile format. The \shppict {\pict…} group contains the shape properties for the image in a {\*picprop…} group followed by some image control words and the binary data for the image itself in the png format.

 

Each shape property is represented by RTF of the form

 

                  {\sp{\sn PropertyName}{\sv PropertyValueInfo}

 

For example, consider the Word 2003 RTF for an image of x2, in which the wzEquationXML property name is displayed in red. The wzEquationXML value group contains a bunch of XML including the OMML, which is given by the <m:oMathPara> …</m:oMathPara> XML.

 

{\*\shppict{\pict{\*\picprop\shplid1025

{\sp{\sn shapeType}{\sv 75}}

{\sp{\sn fFlipH}{\sv 0}}

{\sp{\sn fFlipV}{\sv 0}}

{\sp{\sn pictureTransparent}{\sv 16777215}}

{\sp{\sn fLine}{\sv 0}}

{\sp{\sn wzEquationXML}

{\sv <?xml version="1.0" encoding="UTF-8" standalone="yes"?>\'0d\'0a<?mso-application progid="Word.Document

<…bunch of XML describing document environment…>

<m:oMathPara><m:oMath><m:sSup><m:sSupPr><m:ctrlPr><w:rPr><w:rFonts w:ascii="Cambria Math" w:h-ansi="Cambria Math"/><wx:font wx:val="Cambria Math"/><w:i/></w:rPr></m:ctrlPr></m:sSupP

r><m:e><m:r><w:rPr><w:rFonts w:ascii="Cambria Math" w:h-ansi="Cambria Math"/><wx:font wx:val="Cambria Math"/><w:i/></w:rPr><m:t>x</m:t></m:r></m:e><m:sup><m:r><w:rPr><w:rFonts w:ascii="Cambria Math" w:h-ansi="Cambria Math"/><wx:font wx:val="Cambria Math"/>

<w:i/></w:rPr><m:t>2</m:t></m:r></m:sup></m:sSup></m:oMath></m:oMathPara></w:p><w:sectPr wsp:rsidR="00000000"><w:pgSz w:w="12240" w:h="15840"/><w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0"/><w:cols w:space="720"/></w:sectPr></w:body></w:wordDocument>}}

{\sp{\sn fLayoutInCell}{\sv 1}}}

\picscalex100\picscaley100\piccropl0\piccropr0\piccropt0\piccropb0

\picw397\pich529\picwgoal225\pichgoal300

\pngblip\bliptag-207549586

{\*\blipuid f3a10b6edfd046bb828e459f44f8828d}89504e470d0a1a0a0000000d494844520000000f000000140802000000dda5f0450000000373424954050605330b8d80000000017352474200aece1ce9000000

097048597300000ec400000ec401952b0e1b0000007049444154384fb592db0ac0300843e7feff9f6d4a405b078b74ac0fa596438c1773f7ab7dee36394193da

66464590421b28a0501474c9dcf2cd0c30a3e94093c6575453de9b1906e56665c534c2ec20b5df1baa7dafe33ba2d7c2a3dce752e454e71a28eb7a4f3efb6eeeed514f7ed11eb9504504dfa6f6850000000049454e44ae426082}}

{\nonshppict{\pict\picscalex100\picscaley100\piccropl0\piccropr0\piccropt0\piccropb0\picw397\pich529\picwgoal225\pichgoal300\wmetafile8\bliptag-207549586\blipupi96{\*\blipuid f3a10b6edfd046bb828e459f44f8828d}<…hexadecimal string with metafile data…>

 

You need to use Word 2003 to get this RTF, since Word 2003 has been patched to write the wzEquationXML property. Word 2007 doesn’t write this property when it writes RTF for the png math zone images, since it writes math zones using math RTF (see the Mathematics section of the RTF Specification 1.9.1).

An updated RTF Specification is available for downloading here. I already blogged about the new version in the MS Word blog, but wanted to add a few words about math in Math in Office blog.

The RTF specification includes a thorough discussion of the Office 2007 math format. The format syntax is naturally RTF syntax, but the relationship to OMML (Office MathML) is straightforward, as discussed in the specification and in an earlier blog post. Reading the math section of the RTF specification is a great way to learn about the Office math model.

It’s fairly easy to read and write math RTF, so it provides an alternative, potentially easier way to interchange technical documents with Word 2007, especially if the target application already supports an earlier version of RTF.

If you have improvements you’d like to see incorporated into the RTF specification, please send them to me.

One handy way to edit mathematical text is to use math context menus. These menus are displayed when you depress the right mouse button with the mouse pointing inside a math zone. In addition to the usual Font and Paragraph options, in a math zone you see options relevant for the math object the mouse is pointing at. For example, if the mouse points at a stacked fraction, the context menu includes entries to change to a skewed or linear fraction as well as to remove the fraction bar. If the mouse points at an accented character, there’s an option to remove the accent.

You could make such changes by building down to the linear format, making appropriate edits, and then building back up to the professional format. This is a general way of making many kinds of changes that are hard to make in a WYSIWYG way. But if the desired change can be accomplished via a context menu option, that option is faster.

The following table summarizes the math context menus that appear using Word 2007.

 

Object

Context menu options

Accent

Remove accent

Bar

Switch between overbar and underbar

Remove bar

Box

Increase/decrease argument size

BorderBox

Hide/show (top/left/bottom/right) border

Add/remove (horizontal/vertical/top-left-diagonal/bottom-left-diagonal) strike

Brackets (delimiters)

Insert/delete argument before/after

Stretch delimiters, match delimiters

Hide/show left/right delimiter

Equation array

Insert row before/after

Delete row

Align array at top row, center, or bottom row

Fraction

Change to skewed/linear/stacked

Remove/replace fraction bar

LeftSubSup

Make into subsup

Increase/decrease argument size

Limit

Switch between upper and lower limit

Remove limit

Change limit size

Math zone

Build down/up, i.e., Linear or Professional

Matrix

Insert row/column before/after

Delete row/column

Show/hide empty-argument placeholders

Set row/column spacing

Align matrix at top row, center, or bottom row

Align column left/center/right

n-ary

Change limit location

Hide/show upper/lower empty limit place holder

Grow with content

Radical

Remove radical

Hide/show empty degree place holder

Group character

Display group (horizontal stretch) character above/below

Remove group character

Subscript/superscript

Delete script

Increase/decrease script size

SubSup

Align subscript and superscript

Make into left subsup

Increase/decrease script size

 

In addition if the Microsoft Math graphing calculator add-in is installed, right clicking on a formula gives you context menu options that Microsoft Math is able to perform on the formula. Select an option and a window appears with the results and offers the possibility to insert them into your document. We hope to extend this approach so that other math engines can be used to manipulate and graph formulas in Word.

Okay, the Math In Office blog isn't about advertising. But just in case you're someone who really likes RichEdit and editing and wants to work on it (as I did and do J) and related text processing, here's a pretty fine opportunity. If you're not interested, please skip this post.

 

So here goes. Want to work on components that are used by millions of users every day in apps like Word, OneNote, PowerPoint, and Excel, as well as in platforms like Windows Mobile and .NET? The RichEdit team is looking for energetic testers that love to code, take pride in their work, and enjoy solving problems in innovative ways.

 

RichEdit is used in many places throughout the Office applications, from high level edit controls to low level measuring and layout APIs. The test team's goal is to ensure that we provide robust and high quality code that meets the functionality that applications need. An example of a feature we recently helped deliver was the new Equations feature in Word 2007.

 

Because we provide APIs intended to solve a variety of application requirements, testers on our team have a deep understanding of the code our developers have written, the requirements and the code in the various client applications that use us, and write extensive automaton code (most of it in C++). We work closely with our development team, as well as client application developers to solve customer scenarios, prioritize our testing, help with integration, and most important, to find bugs in our code.

 

Since text is pretty much everywhere in computing, our team works with many other teams across Microsoft, and gets a unique perspective that comes from having such a wide scope. We also work with many different application requirements, and come up with new test approaches to deal with threading, performance, security, and other issues.

 

Requirements include 2 years experience with C/C++, and/or C#, 2 years product testing or similar development experience, and a bachelor’s degree in Computer Science or other engineering/science discipline. Experience with Win32, TeX, typography, or international language issues is a plus. The successful candidate will also have an aptitude for testing and a desire to innovate in test approaches, as well as be energetic, communicate and work well with others as well as independently, and have good problem solving skills. More info is available here.

 

I'd like to add that there's simply no way Word 2007's math feature could have shipped without the help of our incredible test team. If you're interested, please feel free to email me or a-prameh@microsoft.com.

 

Thanks!

 

This post discusses aspects of Word’s first math editing and display facility: the EQ field. This field is still used today for some East Asian formatting constructs. To have a built-up fraction a/b, one could (and still can) enter an EQ field with the contents \f(a,b). To try this in Word 2007, go to the Insert tab, click on Quick Parts, select Field..., scroll down to and click on Eq, click on Field Codes, after the EQ in the Field Code text box type \f(a,b), type ok, and you see the built-up fraction. Well it’s not really ok; it looks pretty awful by today’s standards, but back in the mid-1980’s it seemed cool at least if you didn’t know about TeX or my PS Technical Word Processor J. For one thing, like my PS Technical WP, the user was responsible for all horizontal spacing, e.g., inserting spaces around the + in a + b. In contrast, TeX usually chooses the ideal spacing automatically for you.

 

In the early 1990’s, people at Microsoft realized that the EQ field wasn’t very good typographically and it was really hard to edit, partly because it didn’t have WYSIWYG editing. And so they contracted with Design Science to ship the Equation Editor with Office. And later on, Word 2007 shipped its own fine math facility. So you’d think the EQ field would be long lost and forgotten.

 

Nevertheless three uses of the EQ field persist to this day: the East Asian formatting constructs that Word calls "phonetic guide", "combine", and "enclose". This post discusses how these constructs are created using the Word EQ field using the function \o(<this>,<that>), which displays <this> over <that>. The major difference between the three constructs is the displacement of the <this> relative to the <that>.

 

Consider first the phonetic guide, which is often call ruby. This displays a ruby text annotation (<ruby>) in a smaller type size above, below, or to the side of a base text (<base>). The ruby text is used to clarify the base text in some way, typically how the base text is pronounced. When Japanese text is displayed from left to right (instead of vertically), the ruby text is displayed above or below the base text. The ruby text can have various justifications. Back in 1996 or so when Word added the ruby feature, the developers realized they could get the old EQ facility to do a fairly reasonable job. Specifically the EQ field contains the information

 

\* jcN  \* "Font:MS Mincho" \* hpsN \o\ad(\s\upN(<ruby>),<base>)

 

Here the N of the jcN switch specifies the kind of ruby justification as defined in the table

 

N

Meaning

0

Center <ruby> with respect to <base>

1

Distribute difference in space between longer and shorter text in the latter, evenly between each character

2

Distribute difference in space between longer and shorter text in the latter using a ratio of 1:2:1 which corresponds to lead : inter-character : end

3

Align <ruby> with the left of <base>

4

Align <ruby> with the right of <base>

5

Display <ruby> vertically to the right of <base>, regardless of the <base> alignment

 

The \* "Font:..." specifies the font and the \* hpsN specifies the number of half points to use for the ruby text size. The \ad switch for the \o function says to use the distributed justification defined by the jcN entry (I think). The \s\upN(...) is the EQ shift function that shifts its argument up if the \upN switch is used and down if the \doN switch is used. Here N is the number of points to shift. Note that (half) points don’t scale with the text size.

 

An interesting note is that starting with Word 2000, the ruby construct is displayed by a special LineServices ruby handler. It was this handler that I started with when I wrote a preliminary math handler for LineServices. Basically I figured that the ruby construct was just a mathematical fraction in disguise and naturally my RichEdit/LineServices implementation of ruby could nest ruby structures as deeply as desired, just as one may nest fractions deeply in a continued-fraction expression. In fact, part of the HTML ruby standard allows ruby text to appear above and below the base text, and this is most easily accomplished by using one ruby construct as the <base> of another.

 

Onto the East Asian formatting constructs that Word calls "combine", and "enclose". For "combine", the characters to be combined are split into two groups, <above> and <below>. The corresponding Word EQ field contains

 

\o(\s\up6(<above>),\s\do2(<below>))

 

where the font size is chosen to be 6 pts (\fs12). This construct displays <above> over <below>, sort of the way ruby displays <ruby> over <base>, but for "combine", <above> isn’t shifted up so far and <below> is shifted down a bit. As for the ruby construct, since the shifts are in points, the "combine" structure doesn’t scale with text size correctly.

 

For the "enclose" construct that looks like , the EQ field can contain

 

\o\ac(\uc0\u9675,Q)

 

where 967510 = 25CB16, i.e., a white circle. Here \ac switch means center align one argument over the other (note that there’s no \s() object) and we include \uc0 to get rid of the multibyte translation that would otherwise follow \u9675.

 

When encoding these EQ fields in RTF, one has to duplicate every backslash, so that the backslash is taken literally instead of the start of a control word. For example, the "enclose" EQ field above could be represented by the RTF

 

{\field{\*\fldinst EQ \\o\\ac(\\fs24\\uc0\\u9675,\\fs16 Q)}{\fldrslt}}

 

This structure also doesn’t scale with font size, since the white circle and the Q have to have appropriate relative font sizes. Also EQ fields always have a null field result (empty \fldrslt), so if a reader of the RTF doesn’t understand the EQ \fldinst, it displays nothing for the field.

 

You can see from this that Word needs a better way to represent these three East Asian formatting constructs, a way that is compatible with the past and allows proper scaling with text size. An appropriate way to accomplish these things is to use a format that allows older readers to use the current EQ field approach and newer ones to use a proper description of the constructs. Two other East Asian formatting constructs, two-in-one (Warichu) and horizontal in vertical (tatenakayoko) cannot be rendered with the EQ field. They were first implemented in Word 2000, which used dedicated LineServices handlers and represented them with dedicated RTF control words.  They do scale correctly with text size.

Alex Ioffe emailed me

 

Hi Murray,

I realize you probably get this often by why can't someone (pleeease!) publish some official documentation of Word2007 Equation editor features? I have seen all of the MSN videos regarding it features and they barely scratch the surface. People like Dataninja  (http://dataninja.files.wordpress.com/2007/09/word07shortcuts.pdf) spent a great deal of time finding some very powerful features that seem to be entirely undocumented. LaTeX provides some help but some of the *most* interesting features of Equation Editor are not standard LaTeX (e.g. the stuff Dataninga has found). A slightly related example is the fact that Equation Editor for some reason does not contain the logical-not (¬ which is \lnot in LaTeX), fortunately this is an ascii character so I was able to add it myself.

 

Chiefly I have three questions about equation editor:

1) Is it possible to delete placeholders and how?

This is probably my most frequent annoyance, to delete an exponent I have to delete the base as well!

2) Is it possible to insert columns into a matrix in professional format via shortcut key.

I discovered that Enter will add a row, can you add a column?

3) Are there any plans for better HCI when it comes to shortcuts in Equation Editor. My dream is to be able to set some checkbox in the Options menu and see the shortcut keys for symbols appear their in the respective popup windows (e.g. - For All (Shortcut: \forall) when I mouseover them in the equation editor menus.

 

Yes, Word 2007’s new math editing and display are sort of a secret feature J Hopefully someday they’ll be better documented. Most of your wishes are already included in Word 2007, but they’re not immediately obvious as your email reveals.

 

To add an AutoCorrect entry for any Unicode character, go to the AutoCorrect dialog (the math AutoCorrect dialog for entries into math zones), and in the Replace text box type the AutoCorrect name you want to use and in the With text box type the Unicode hex code of the desired character followed by Alt+x. The Alt+x converts the code to the character. Say Ok and you’ve added the entry. Or you can copy/paste the desired character into the With text box.

 

The linear format math input method used in Word 2007 is similar to TeX, but differs in significant ways. Thorough documentation for it is given in Unicode Technical Note #28, including the default AutoCorrect keywords for symbols.

 

To delete placeholders and insert/delete matrix columns, use the context menus available with a right mouse click on the math object of interest. These context menus enable you to perform many other operations as well, all in built-up form.

 

It wouldn’t be hard to add the shortcut values to the tooltips you see when mousing over the symbol displays. We’ve thought about this. The only trick is that they should correspond to the user’s math AutoCorrect choices which may differ from the default set. The latter are essentially the same as TeX’s.

People have been inquiring about Word RTF’s occasional use of the Unicode Private Use Area (PUA) characters in the range U+F020..U+F0FF. These codes are also used in WordProcessingML defined by the ECMA-376 standard. This post explains what Word means by those characters. But first note a couple of things:

 

1)      Unicode assigns no meaning to characters in the PUA, that is, those in the range U+E000..U+F7FF. So it’s up to a higher-level protocol to define the meaning. In general it’s a really bad idea to use the PUA if you’re interested in data interchange, because the program that reads such data may well display nothing or display completely different characters than you intended. That’s why it’s called “private use”, something for you and your friends who are in cahoots with you.

 

2)      The original syntax of an RTF control word defines the numeric parameter to be a signed 16-bit decimal number. For most control words that have a numeric parameter, Word does use a signed 16-bit decimal number. In particular, for the \uN Unicode control word, N has this format. If the high bit of a 16-bit number is 1, the number is negative and this is true for all codes in the range U+8000..U+FFFF. To get the RTF 16-bit signed decimal values, convert Unicode hex values to decimal and if greater than 32767, subtract 65536. Accordingly U+F020 is represented by \u-4064 and U+F0FF by \u-3841. It’s true that later on Word learned that 32-bit numbers exist and so some more recent RTF control words like \rsid (revision save IDs) have parameters much larger than 65536, let alone 32767 (the most positive 16-bit signed number). RichEdit even supports reading \uN with N being the decimal UTF-32 value corresponding to a surrogate pair (now isn’t that cool?!)

 

Given the strong recommendation not to use the PUA, why would Word nevertheless go ahead and use it? If the choice were made today, I seriously doubt that Word would, but back in 1995 when Word started switching to Unicode, it wasn’t so obvious. Furthermore it solved a pesky problem with special nonUnicode fonts known as “symbol fonts”, or more precisely symbol-charset fonts. By their very definition, these fonts do not use Unicode code points. So while U+0041 stands for ‘A’ in a Unicode font, in a symbol-charset font like Wingdings, it stands for whatever character has hex code 0041, namely for Wingdings A. You must agree that A looks nothing like ‘A’, so the Word 97 folks decided to give it a distinct value, namely F000 + 41 = F041. This is also the value that Microsoft TrueType symbol-charset fonts use in the Unicode cmap (character-to-glyph mapping table). Often a symbol-charset character is defined by a SYMBOL field with a character code in the range 20 to FF.

 

A key point here is that Word RTF may treat any symbol-charset character this way, so merely getting a character in the range U+F020..U+F0FF does not mean you know which symbol-charset font is involved. For that you need to find the last symbol-charset font control word \fN, look up font N in the font table and find its face name. The charset is specified by the \fcharsetN control word and the symbol-charset is N = 2. In contrast, RichEdit does not use U+F020..U+F0FF for characters in symbol-charset fonts; it uses the native values 0020 through 00FF, and both RichEdit and Word read the resulting RTF just fine. In many cases Word, too, uses the range 0020 through 00FF for symbol-charset font characters, so Word's use of F020 through F0FF isn't exclusive.

 

For math probably the most relevant symbol-charset font is the Symbol font itself, since it has most of the Greek letters used in math along with some useful math operators and operator pieces. But since Unicode has nearly 100 times as many math characters and includes all 224 characters in the Symbol font, the Symbol font is basically useless for math at this point in time. Read: avoid it if you can J

 

This post summarizes what I said at the retirement ceremony for my long time collaborator and good friend Dr. Rick Shoemaker, Associate Dean, College of Optical Sciences, and Professor of Optical Sciences.

 

I’ll talk a bit on Rick and his love for microcomputers. Back in the 1970s, Rick regularly performed magic in nonlinear spectroscopy. For example, he used to pull beautiful photon echoes out of the noise. Photon echoes had been observed before, but with relative difficulty. Photon echoes are useful for measuring the lifetime of induced dipole coherence. How did he observe them? For a single pulse, pulse, echo measurement, the oscilloscope time trace would show nothing but noise. But Rick performed the measurement many, many times, adding the results. Since the noise was a random function of time and the echo was deterministic, the echo rose up out of a sea of noise.

 

Ordinarily such techniques involved expensive Tektronix or HP lab equipment. But Rick had a better, more fun way. He assembled a Z80 microcomputer and taught it to do the adding, any time night or day! It worked in spite of the fact that Rick had defied all odds and packed a whopping 64 KB RAM in his computer! At the time I had built a Z80 microcomputer to computerize my home and had been collaborating with Rick on some debugger software. We decided to pool our thoughts and write a book on interfacing microcomputers to the real world. We used the fledgling word processor I had coauthored (with Mike Aronson) in Z80 assembly language and typeset it on our microcomputers using a daisy-wheel printer. At that point, Rick started teaching his popular course on microcomputer interfacing based on the book.

 

Then in 1981, IBM faked us out with its remarkable, very non-IBM-like PC, and naturally we had to write a book on it and typeset the book on our new IBM PCs. The PCs included a Microsoft 32 KB ROM Basic, configured to graph things on a 320x240 color graphics adapter, but not on the nicer 720x350 monochrome adapter. Shortly thereafter, a start-up company named Hercules came out with a compatible monochrome card that included graphics functionality, but no support for Basic. Not to be discouraged, Rick and Chris Koliopoulos (the Ko of Wyko Corporation) fixed this lamentable limitation by copying the ROM Basic into the unused upper half of the Hercules card’s 64 KB video RAM, patching some locations, and bingo, they could graph things using Basic on the Hercules card. Naturally Hercules wanted this software really badly, paid the inventors accordingly, and bundled it with their card. We all had many fun times together at Comdex Shows and elsewhere.

 

Twelve years later Rick and I wrote an updated book on PCs of the time. Yet another 12 years have passed since then, but Rick and I have never lost our enthusiasm for microcomputers, old or new, although admittedly Rick has succeeded in controlling his computer addiction better than I.

Perfection is to be strived for but not attained

A software version of this saying is “shipping is a feature”. In general one wants to do the best possible job, but getting something accomplished is usually better than having nothing but unfinished work. I learned this lesson by watching my PhD advisor as he got older. He had become accustomed to producing stunning, nearly perfect physical theories, one of which led to a Nobel Prize (the Lamb Shift) and two of which were also of Nobel Prize quality (the theory of the Mössbauer effect, published 12 years before Mössbauer published his experimental observations, and the Lamb Dip). But as he got older he couldn’t bring himself to publish anything that didn’t meet his criteria for quality. As a result, the world was largely deprived of his later day insights. A software example is the math editing and display facility in Word 2007. It’s tantalizingly close to perfection and we think we know how to finish the job. But it’s missing some important ingredients such as the optimal line breaking algorithm, equation numbering, math Find/Replace, and OpenType enhancements such as ligatures. We hope to incorporate these features at some point; meanwhile the math facility in Word 2007 is fantastic, even though it’s not perfect. A related saying: life is a series of compromises.

At first it seems like magic, but it’s really just plain logic

My coauthor Rick Shoemaker and I came up with this saying when writing about microcomputers and digital circuitry. What microcomputers accomplished back in the late 1970s, let alone today, seemed really magical. These small computers (monsters compared to today’s laptops) could do so many amazing things. Imagine, we even wrote our microcomputer books on them along with some physics books on the microcomputers of the 1980s! Today most books and published writings in general are written on personal computers. But underneath, it’s all just plain logic.

Intuition is necessary but not sufficient

I came up with this saying watching some of my fellow physicists justify their latest theories. The phenomenon involved is “wishful thinking” and goes way beyond theoretical physics. The wording is a play on the standard mathematical statement, “A necessary and sufficient condition for … is …”. The point is that while intuition is incredibly valuable in the development of a new theory, before that theory can be trusted, it has to be subjected to merciless testing against observation and other known relationships. The same is true of computer programs as well as of theories that don’t involve the physical sciences. The trap is that one may feel a beautiful theory ought to be valid even if closer examination would reveal that it isn’t. Published invalid theories are ultimately discarded due to subsequent scientific scrutiny, but they remain in the scientific literature.

The art of the physicist is the use of approximation

It’s often said that mathematics is the queen of the sciences and it’s sometimes said that physics is the king who raped her. The king part is a terse way of summing up a basic fallacy in what some people think of physics. Many natural phenomena are describable by mathematics, but the descriptions are inevitably approximate. Too many factors enter to ever allow a physical law such as F = ma to apply exactly. So it’s up to the physicist to figure out which mathematical formalisms apply to physical phenomena and how accurately they apply. I came up with this saying in replying to a comment about a paper Marlon Scully and I wrote on the concept of the photon.

Every day things get better and by the time you die, they’ll be fantastic!

This is another saying Rick and I came up with in writing about microcomputers and their logic. Even though next year’s gadgets will be cooler and more powerful than today’s, if you keep waiting for the next great thing you’ll miss out on a wealth of experiences.

A good notation has a subtlety and suggestiveness which at times make it seem almost like a live teacher…and a perfect notation would be a substitute for thought

Bertrand Russell wrote this in his Introduction to Tractatus Logico-Philosophicus by Lugwig Wittgenstein. It’s kind of long for a saying, but it says it so well. While this saying is valuable advice for developing and documenting physical theories, I’ve used it also in attempting to make computer programs more understandable. Specifically if you’re programming a mathematical expression, it’s much clearer to use the original mathematical notation provided you can teach the computer how to understand it. The linear format for mathematics used in Word 2007 was inspired by this saying. Recently exciting steps along these lines have been made in the Fortress programming language.

[Some other favorite (and famous) sayings: give credit where credit is due (and the Golden Rule in general); make habit work for you; haste makes waste; those who have not studied history are destined to relive it; make sure you have something to show for your efforts; make everything as simple as possible, but not simpler; don’t bite the hand that feeds you; always tell the truth, but don’t go around telling it; don’t look a gift horse in the mouth; think positively; just do it; focus; enjoy!]

The Microsoft Math graphing calculator folks have created a Word 2007 add-in that lets you simplify, solve, calculate, and graph your equations in 2-D or 3-D. With it installed, your technical paper becomes alive. For example, your paper may have graphs of the formulae, but a reader wants graphs for different sets of parameters. She just right clicks on a formula and a context menu appears that includes options that Microsoft Math is able to perform on the formula. Select an option and a window appears with the results and offers the possibility to insert them into your Word document.

A side note is that the math displayed in the window is rendered by RichEdit. The math displayed in Word is rendered by Word. Of course, both programs use the LineServices math handler, so the rendering quality is almost identical.

This certainly isn’t the first time such a “smart canvas” was created with computational capabilities. Mathematica works in a similar way and offers considerably greater computational and graphical power. But it’s relatively specialized and expensive and the mathematical typography isn’t as good. Note that the approach used by Microsoft Math could be used by a Mathematica, MapleSoft, MathCad, etc., add–in, and hopefully such add–ins will be created. Then you could have the power of a Mathematica with the math typography and environment of Word 2007. It would be cool to generalize the approach so that any math engine could work seamlessly with a variety of Microsoft Office applications and Internet Explorer.

It’s very exciting. With the math add-in, students can learn math interactively as well as document their results using Word 2007. Professionals can carry out their R&D on screen in Word and ship off their publications as active documents. People reading the publications can click on formulas and obtain graphs for the scenarios they’re interested in or have steps in derivations filled in. Smart canvasses promise to revolutionize the way people learn mathematics as well as to streamline the production of technical results by professionals.

The STIX folks (Scientific and Technical Information eXchange) folks have a beta version of their math font. There are more math characters in the STIX fonts than in Cambria Math. The primary typeface is Times Roman. This post describes how you can examine the fonts and gives some reasons why they aren’t quite ready to use with Word 2007 and RichEdit 6. To get the fonts, go to http://www.stixfonts.org/ and follow the download instructions.

On Windows, you can use Asmus Freytag’s nifty Unibook program to look at the fonts. Unibook is the program that typesets the Unicode Standard’s code charts and it’s also great for checking out character properties. Here are Asmus’s instructions on how to use it to see the fonts and where the characters are located:

1) download
Stix fonts, unpack, and drag/copy to Windows/Fonts folder to install
2) download Unibook (beta) from
http://www.unicode.org/unibook
3) create a StixBeta.cfl file (see below)
4) run Unibook, open that cfl file using File / Open,
5) Select Index View in View/Show As..
6) Check "Show Private Use Area" in the View / Show As.. dialog

StixBeta.cfl is a text file with these lines

STIXIntegralsDisplay,22
STIXSize1Symbols,22
STIXVariants,22
STIXNonUnicode,22
STIXGeneral,22
STIXGeneral,22,I
STIXGeneral,22,B
STIXGeneral,22,BI

Let’s examine the font STIXGeneral.otf. It has a math table, so Word 2007 does recognize it as a math font. But as soon as you type an English letter in a math zone, Word 2007 switches to Cambria Math because STIXGeneral.otf doesn’t have any math italic characters. If you don’t load the cfl file above (which shows all the characters), Unibook reveals that STIXGeneralItalic.otf does have math italic, but not math bold italic, or math bold.

The Unicode Technical Committee added the math alphanumerics primarily because without them plain text can destroy the semantics of mathematical expressions. The Hamiltonian example appearing in Sec. 2.2 of Unicode Technical Report #25 illustrates that plain text without math alphanumerics converts the Hamiltonian into an integral equation. For plain text to display such characters faithfully, they must all be in the same font, since plain text ignores rich-text attributes like bold and italic. Accordingly, Office 2007 software assumes all the math alphanumerics belong to a single font. So the first thing to fix is to put all the math alphanumerics into STIXGeneral.otf.

In addition, the various size glyph variants need to be accessible in this font. There are some little errors, such as U+2145..U+2149 (including the differential d U+2146) have upright glyphs in STIXGeneral.otf. This error is probably related to the choice of putting all italic into an italic font, upright into an upright font, etc., which doesn’t agree with the Unicode Standard for math characters.

For nonmathematical text, such a separation is standard practice and the italic, bold, and bold-italic fonts are needed for such text. But these fonts shouldn’t have any math alphanumerics, since the latter belong in the math font.

Needless to say, we’re all very excited to see this excellent font family running with our software. It’d be great to have a choice of two math fonts and the STIX fonts have many less common math operators that are missing in Cambria Math.

A number of readers have asked how to use the RichEdit 6.0 shipped with Office 2007 to edit and display mathematical text. This post explains one way to do so. The code assumes that you already have an application that knows how to instantiate a RichEdit control with a window identified by hwndRE. The function ToggleMathZone() is designed to be hooked up to the Alt+= hot key, which Word 2007 uses to toggle a math zone on and off.

#include "windows.h"

#include "richedit.h"

 

#define CFE_MATH                0x10000000

#define CFM_MATH                CFE_MATH

#define MATH_LCID               0x0001007F

#define SCF_ONLYCFEFFECTS       0x0200  // Only set/get CF

#define EM_SETADJUSTTEXTPROC    (WM_USER + 234)


void ToggleMathZone(HWND hwndRE)
{
    HRESULT hr;
    CHARFORMAT2W cf;


    cf.cbSize = sizeof(CHARFORMAT2W);
    SendMessage(hwndRE, EM_GETCHARFORMAT,

                SCF_SELECTION | SCF_ONLYCFEFFECTS, (LPARAM)&cf);
    cf.dwEffects ^= CFE_MATH;               // Toggle math zone

    if (cf.dwEffects & CFE_MATH)            // Turn on math zone
    {

       // Turn on LineServices

       SendMessage(hwndRE, EM_SETTYPOGRAPHYOPTIONS,

                   TO_ADVANCEDTYPOGRAPHY, TO_ADVANCEDTYPOGRAPHY);    
       
       // Enable built-in math autocorrect
       SendMessage(hwndRE, EM_SETADJUSTTEXTPROC, 1, 0);

 

              // Setup math font and lcid

              cf.dwMask = CFM_FACE | CFM_LCID;

       wcscpy(cf.szFaceName, L"Cambria Math");

              cf.lcid = MATH_LCID;

              SendMessage(hwndRE, EM_SETCHARFORMAT, SCF_SELECTION, (LPARAM)&cf);

    }
    cf.dwMask = CFM_MATH;

    SendMessage(hwndRE, EM_SETCHARFORMAT,

                SCF_SELECTION | SCF_ONLYCFEFFECTS, (LPARAM)&cf);

}

There is a somewhat strange message to control math transformations, namely EM_TRANSFORMMATH (WM_USER + 264). The wparam bits are defined by

 

#define XM_LINEARIZE                0

#define XM_BUILDUP                  1

 

#define XM_LINEARFORMAT             2

#define XM_TEX                      3

 

#define XM_NEEDTERMOP               4

 

#define XM_BMPALPHABETICS           8

#define XM_MATHALPHABETICS          9

 

#define XM_BUILDMATHZONES           16

 

#define XM_NOMATHAUTOCORRECT  32

#define XM_MATHAUTOCORRECT          33

 

So you should be able to select the range you want and then send EM_TRANSFORMMATH with wparam = XM_BUILDUP to build it up. The XM_BUILDUP/ XM_LINEARIZE, XM_LINEARFORMAT/TEX, XM_MATHALPHABETICS/ XM_BMPALPHABETICS, and NOMATHAUTOCORRECT/ XM_MATHAUTOCORRECT pairs have to be handled in separate messages, since they all depend on bit 0. That’s why the message is somewhat strange. TeX isn’t supported in RichEdit 6.0, but the XM_TEX definition is included for completeness.

 

TOM2 has better methods, but it requires a new tom.h, more documentation, etc., and we don’t have a current plan for releasing this info.

 

The Office 2007 RichEdit 6.0 shipped without having math RTF conversions enabled. So it you want to save a file with math in it, you need to use RichEdit’s binary file format or convert the math zones to linear format. The binary format is used by the EM_STREAMIN and EM_STREAMOUT messages if SF_RTF is replaced by  SF_BINARY defined by

 

#define SF_BINARY       0x0008

Only RichEdit 5.0 and later versions understand RichEdit’s binary file format and RichEdit 5.0 doesn’t understand RichEdit 6.0’s math properties. The next version of RichEdit understands math RTF as well.

 

Paul Libbrecht commented that there’s more to selection in math text than discussed in my first post on this subject. As usual, Paul is right. That post explains how one or more characters and/or math objects are selected. In addition the topic of selection includes insertion-point behavior, which by definition selects no characters or objects. The insertion point simply indicates where the next character you type or the next thing you paste will be inserted.

The present post describes how the left and right arrow keys work in and next to math zones. These keys traverse the logical backing store, generally moving one character at a time. For example typing the right arrow key successively starting at the left of the equation E = mc2 moves past (1) E, (2) =, (3) m, (4) subscript-object start delimiter, (5) c, (6) subscript-object separator, (7) 2, and (8) subscript-object end delimiter. Successive Shift right arrows starting at the left of E = mc2 also move past the first three characters while including them in the current selection. Step (4) is different for Shift right arrow, since it selects the whole object, i.e., up through (8).

Now consider a substantially different behavior, namely what these arrow keys do at the edge of a math zone. Math zones are represented by a character formatting attribute like bold. Math zones don’t begin and end with delimiter characters the way math objects do. Nevertheless, it’s desirable to make it seem as though they do, so that you can easily insert something immediately before and after a math zone, and at the beginning and end of a math zone.

To illustrate how this works, suppose the insertion point (IP) is at the start of a line immediately following a displayed equation. Typing the left arrow key moves the IP before the ASCII CR (U+000D) that terminates the displayed equation. But the nice acetate rectangle that surrounds an equation when the IP is inside the equation does not appear. Even though the IP immediately follows the math zone, it’s not in the math zone. At this point if you type a character, it won’t be in the math zone and the displayed equation will be converted to an inline equation with correspondingly compressed typography.

Typing the left arrow key another time moves the IP into the math zone and the acetate rectangle appears. This left arrow key didn’t bypass any character; it only changed the selection IP to have the math zone character formatting property. Consequently if you now type a character it will be inside the displayed equation and be formatted according to math-zone rules. In summary, to move the IP from outside a math zone to inside the zone or vice versa, type the appropriate arrow key, but no character is bypassed. Only the selection’s math-zone property is changed. So it feels as though math zones have start and end delimiters even though they don’t.

Implementation advantages to using the math-zone character formatting effect include automatic merging of adjacent math zones and prevention of nested math zones. On the other hand, it complicates the arrow-key logic. I suspect the code might have been simpler if math zones had start and end delimiters, just the way the math object do.

Posted