If you read Part 1 of the Word XML Introduction, you saw the basics behind a Word document, as well as how basic formatting can be applied. The Word XML schemas were designed to closely map the structures that Word uses internally to represent a document. A Word document is essentially a collection of text runs. Each text run has a collection of properties that describe how that text should be displayed. Often times a text run can be very long, and is broken out only when the paragraph ends. If there is a bunch of text, and at some point the formatting changes, then the text run will need to be broken to account for that formatting. The reason for this is that Word doesn't apply formatting in a cascading way as is done in HTML. In Word, for the most part the formatting that is applied to text either comes from properties assigned to the paragraph, or to the text directly.
Let's take an example to try to make this clear. As we saw in Part 1 of the Word XML Introduction, simple text like this: "My name is Brian Jones" would look like this in WordprocessingML:
<w:p> <w:r> <w:t>My name is Brian Jones</w:t> </w:r></w:p>
I also showed how you could apply bold formatting to that entire run of text. What if we just wanted to apply bold formatting to a couple words though, so that it looked like this: "My name is Brian Jones". This is where you'll see differences between Word XML and HTML. In HTML, there would just be a <b> tag thrown around the text "name is". In Word, by applying that formatting, we've now created three separate runs of text. The HTML for this will look like[edit: "HTML" should have been "WordprocessingML"] :
<w:p> <w:r> <w:t>My</w:t> </w:r> <w:r> <w:rPr> <w:b/> </w:rPr> <w:t>name is</w:t> </w:r> <w:r> <w:t>Brian Jones</w:t> </w:r></w:p>
Go ahead and try this opening this in Word. Make sure you also include the wordDocument and body tags like we used in Part 1 of the Word XML Introduction.
Did you notice any problems when you opened that file in Word? If you look closely (or maybe you see it clearly), there are no spaces between the runs, so the text looks like this: "Myname isBrian Jones". We just need to add some trailing whitespace to the first two runs, and also specify that the XML parser should preserve our whitespace. Update the XML so that it looks like this:
<w:p> <w:r> <w:t xml:space="preserve">My </w:t> </w:r> <w:r> <w:rPr> <w:b/> </w:rPr> <w:t xml:space="preserve">name is </w:t> </w:r> <w:r> <w:t>Brian Jones</w:t> </w:r></w:p>
You can specify to preserve space on the specific runs where it matters, or you can just declare it globally. It's up to you.
That's how you apply formatting directly to text. Another way of applying formatting is by creating a style and referencing that style from the paragraph or from the run of text. Let's create a paragraph style that has red font coloring, and we'll then reference that style from the paragraph by updating the paragraph properties:
<w:wordDocument xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"> <w:styles> <w:style w:type="paragraph" w:styleId="myCustomStyle"> <w:name w:val="myCustomStyle" /> <w:rPr> <w:color w:val="FF0000" /> </w:rPr> </w:style> </w:styles> <w:body> <w:p> <w:pPr> <w:pStyle w:val="myCustomStyle" /> </w:pPr> <w:r> <w:t xml:space="preserve">My </w:t> </w:r> <w:r> <w:rPr> <w:b/> </w:rPr> <w:t xml:space="preserve">name is </w:t> </w:r> <w:r> <w:t>Brian Jones</w:t> </w:r> </w:p> </w:body></w:wordDocument>
Now if you open that file up in Word, you should see the following: "My name is Brian Jones". We're moving in baby steps here folks...
Here are a couple final things you should notice from this example:
That's it for now. I'm still trying to decide what to talk about next. Most likely it will either be another approach to opening XML data in Excel (different from Part 2), or working with custom defined schema in Word. I've been focusing a lot of Office 2003 since that's what is available today, but if there are other topics you'd like to hear more about in relation to the Office 12 formats, let me know.
-Brian