Custom XML markup

Published 26 March 07 01:16 AM

I recently covered an example of how to implement custom XML parts in Open XML documents, and in this post we're going to take a look at another way Open XML supports custom schemas: custom XML markup. With custom XML parts, you have an XML instance embedded in the document as a discrete part, but with custom XML markup there is no custom XML part -- instead, the visual content of the document is tagged with elements from your schema.

Custom XML parts and custom XML markup both support any schema, enabling interoperability with other systems through the schemas they already use. But there are some differences as well, which developers should take consider when deciding which approach to use.

Custom XML parts can be bound to structured document tags ("content controls"), which provides for 2-way binding between presentation elements and the underlying nodes in a custom XML instance. This provides a very clean separation of presentation and data, and for that reason custom XML parts are the preferred approach in general.

Custom XML markup, by contrast, is a mechanism for tagging content within the body of a document according to an arbitrary schema. This interleaves the business semantics with the presentation data, which -- although the norm in many document formats -- isn't ideal from an architectural point of view because it doesn't separate presentation and data.

There is one common type of scenario where custom XML markup is a good fit: an existing document whose content needs to be tagged with business semantics. For example, consider a document tha may have been created years ago, which needs to be logged into a content management system that will read metadata such as a document abstract from the document. In that scenario, custom XML markup could be used to identify the abstract, and the content management system could then use a simple XPath expression to find the abstract within the body of the document.

Content Tagging through the Word UI

The Word user interface allows for content tagging, and if you're going to write code that implements custom XML markup it's a good idea to play around with this concept through the Word UI first to get a feel for how it works. Here are the steps to follow:

  1. Make sure you have the Developer tab visible on the ribbon. (It's not there by default, so if you don't see a Developer tab, click the Office logo button, then Word Options, then make sure "Show Developer tab in the Ribbon" is selected.)
  2. Open or a create a document -- anything will do, you just need some content.
  3. Click the Developer tab, then Schema (in the XML group), and add/attach a schema to your document.

Now you're ready to tag content. Select some text in your document, and then click on an element in the structure pane to the right to tag the selected text. Here's an example of what it looks like in action:

The structure pane and the document content both show the current selection, and there is live 2-way linking between them: click on a tag in the document and the node in the schema gets highlighted, or vice versa. This makes it easy to see the structure of the semantics that you're adding to the document.

Under the Hood

From a user perspective, that's all there is to custom XML markup. From a developer perspective, however, we need to understand the underlying details of how the content is being tagged so that we can write code to retrieve the nodes of our custom schema.

To illustrate those details, I've attached the simple sample Open XML document shown in the diagram above. Here's what you'll find within the body element (in document.xml in the word folder, since that's where Word puts the "start part" by default):

Note how the customXml element is used to wrap the runs that were selected and tagged. This allows the document body to remain 100% valid WordprocessingML, since customXml is a valid element in WordprocessingML. The custom schema's elements are added as "element" attribute values of the customXml element. If you're familiar with how microformats allow for tagging of content within an existing XML document without adding new elements, this approach will look quite familiar. :-)

A Messy Little Detail

If you tag content through the Word UI as described above, you must use a schema. And in most applications, you're going to have a schema -- why would you want to tag content if you don't have a schema? But the schema is technically optional, and I've not used in a schema in the attached sample document to illustrate something you should be aware of if you're programmatically generating documents that have tagged content but no custom schema.

When there is an attached schema for custom XML markup, two things are added to the document: an attribute of the customXml elements refers to the schema, and the schema is also listed in an attachedSchema element within the document settings part. Word expects to find that attachedSchema element in the document settings any time it sees custom XML markup. If there's no schema, the attachSchema value can be blank, but the element still needs to appear or Word won't open the document. (Yes, this is a bug, and the product team is aware of it.)

That means in order to have Word 2007 open a document with custom XML markup and no schema, you need to have a document settings part with this element in it:

<w:attachedSchema w:val="" />

You can see the other implementation details, such as the relationship type and content type for the settings part, in the attached document. I created that document manually with Notepad and WinZip, and Word will open it just fine.

Programmatically generating custom XML markup with no schema attached isn't a very realistic scenario, as I mentioned earlier, but I just wanted to point out this detail so you'll know how to work around it if needed.

by dmahugh
Filed under: ,
Attachment(s):CustomXmlMarkup.docx

Comments

# Krishna said on April 9, 2007 7:54 PM:

Word 2007 had a built-in option to save the xml data only, is there similar option available in word 2007? I was searching for it but couldn't find that one. If I remember Word 2003 did that with a XSL, a similar xsl would sure be helpful.

# gedw99 said on April 13, 2007 11:08 AM:

Hey,

When i inspect the document there is nowhere that the schema is defined.

Where is it? It has to be somewhere.

I also cant find it in the UI of Ofiice.

There has to be an XSD somewhere.

gedw

# Doug Mahugh said on May 19, 2007 11:21 PM:

I've seen some signs of confusion about custom schema support lately. For example, I've seen a vendor

# Blog de Neodante (Julien Chable) said on May 22, 2007 3:50 AM:

J'aurais aimé faire un post complet sur ce sujet, mais le temps se fait désirer en ce moment ... à cause

# Open XML said on June 7, 2007 9:11 AM:

Open XML, comme de nombreux formats de documents, permet l’utilisation de métadonnées extensibles. On

# Blog de Neodante (Julien Chable) said on October 16, 2007 5:14 PM:

Voici le dernier post sur les 10 questions posées à Microsoft France. Vous vous demandez pourquoi avoir

# Blog de Neodante (Julien Chable) said on October 25, 2007 3:56 AM:

Voici le dernier post sur les 10 questions posées à Microsoft France. Vous vous demanderez sûrement pourquoi

# Doug Mahugh said on March 31, 2008 6:22 PM:

Like many people, I thought we'd know the official outcome of the DIS 29500 process today, but it looks

# cnblogs.com said on April 1, 2008 11:30 AM:

Open XML Resources for Developers Published 31 March 08 03:20 PM Like many people, I thought we&#39;d

New Comments to this post are disabled

This Blog

Syndication

Page view tracker