Accessing Open XML Document Parts with the Open XML SDK

Published 30 July 08 07:30 PM

About a month ago the Open XML SDK 1.0 (June 08 update) was released. The SDK provides strongly typed document part access to Word 2007, Excel 2007 and PowerPoint 2007 documents. The SDK has been a CTP for a while, but last month version 1.0 was finally released. So I installed this baby last week and started playing around with it and found it really easy to use after briefly looking at the documentation. The How Do I section is a great place to start.

Upgrading the Letter Generator

I decided to upgrade my Word 2007 letter generator program to use the SDK to manipulate the packages. Remember that Office 2007 documents are really just archive files, so if you rename them to .ZIP you can take a look at the contents of the package. The Open XML Package spec defines a set of XML files that contain the content and define the relationships for all of the document parts stored in a single package. To programmatically manipulate them you can use the raw System.IO.Packaging namespace, but the SDK's DocumentFormat.OpenXml.Packaging namespace is much easier to work with. 

My mail merge program uses XML literals to construct XML for the document part of a Word 2007 file based on data in the Northwind database. The LINQ query was a piece of cake compared to figuring out how to manipulate the .docx package in order to replace the document.xml (called the MainDocument) part. Not that the final code is particularly long, it was just a pain to figure it out. The SDK not only saved me a few lines of code, it made the code much more readable and took only a few minutes to write. (I updated the code for the WordMailMerge program on Code Gallery).

Getting Started with the Open XML SDK

Let's take another simple example that constructs a MainDocument part using XML literals and then replaces it in a .docx package using the SDK. This time I'll focus on the code that manipulates the Open XML package with the SDK not on the particulars of XML Literals. The first thing I recommend is to install the VSTO Power Tools so you can open Office 2007 documents and manipulate the parts directly in the Visual Studio IDE like I showed in my last post using the Open XML Package Editor.

Of course you'll need to also install the SDK which places the DocumentFormat.OpenXML.dll assembly into your GAC. Add a reference to this assembly in your project. As an aside, when x-copy deploying to a machine with the .NET Framework on it already just make sure you deploy the DocumentFormat.OpenXML.dll assembly alongside your application to avoid having to install the SDK on the target machine. The easiest thing to do is select "Show All Files" in the Solution Explorer, expand the References, and on the Properties for the DocumentFormat.OpenXML reference set "Copy Local" = True. This will place a private copy of the assembly next to your application when it's built.

Now create a new Word 2007 document with some simple text in it, for instance, type: "This is my document" then save it and add the .docx file to your Visual Basic project. Double-click on it and that opens the Open XML Package Editor:

We can manipulate the parts through this editor if we want to but what I really want to do is replace the document.xml with our own we create using XML literals and embedded expressions. Double-click on the document.xml to open the MainDocument part in the XML Editor (if the XML editor opens and the XML is all on one line with no breaks then just select all the contents and cut then paste it back into the editor and it will put the proper line breaks in there for you : Ctrl + A,X,V).

For this simple example, let's place the executing user's name into the document. Create the XML Literal and an embedded expression by pasting the document.xml into the VB Editor and adding an expression to print out the executing user's name:

Dim myDoc = <?xml version="1.0" encoding="utf-8" standalone="yes"?>
            <w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006"
               xmlns:o="urn:schemas-microsoft-com:office:office"
               xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
               xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"
               xmlns:v="urn:schemas-microsoft-com:vml"
               xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"
               xmlns:w10="urn:schemas-microsoft-com:office:word"
               xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
               xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml">
               <w:body>
                   <w:p w:rsidR="00DD17EB" w:rsidRDefault="00361264">
                       <w:r>
                           <w:t>This is <%= Environment.UserName %>'s document</w:t>
                       </w:r>
                   </w:p>
                   <w:sectPr w:rsidR="00DD17EB" w:rsidSect="00DD17EB">
                       <w:pgSz w:w="12240" w:h="15840"/>
                       <w:pgMar w:top="1440" w:right="1440" w:bottom="1440"
                           w:left="1440" w:header="720" w:footer="720" w:gutter="0"/>
                       <w:cols w:space="720"/>
                       <w:docGrid w:linePitch="360"/>
                   </w:sectPr>
               </w:body>
           </w:document>

Replacing the MainDocument Part

Before the SDK, replacing the MainDocument part in the package we had to figure out the right content type and write the code that deleted then added the new part. We also needed to add a reference to WindowsBase (a 3.0 assembly) in order to access the System.IO.Packaging namespace.

Imports System.IO.Packaging
Imports System.IO
...
'**** Without OpenXML SDK
Dim uri As New Uri("/word/document.xml", UriKind.Relative)
Dim contentType = "application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"
Dim docFile = CurDir() & "\MyDocument.docx"

Using p As Package = Package.Open(docFile)
    'Delete the current document.xml file
    p.DeletePart(uri)

    'Replace that part with our XDocument
    Dim replace As PackagePart = p.CreatePart(uri, contentType)
    Using sw As New StreamWriter(replace.GetStream())
        myDoc.Document.Save(sw)
     End Using
End Using

For this example it's pretty easy, however if you add/remove parts it's up to you to update the relations in the package and this isn't an easy task using this raw API. Enter the Open XML SDK. Now we don't need to add a reference to WindowsBase, only to DocumentFormat.OpenXML and import the Packaging namespace contained within. Then our code can access the parts of the document in a strongly-typed way:

Imports DocumentFormat.OpenXml.Packaging
Imports System.IO
...
'***** Use the OpenXML SDK for easier access to parts
Dim docFile = CurDir() & "\MyDocument.docx"

Dim wordDoc = WordprocessingDocument.Open(docFile, True)
Using wordDoc
    'Replace the document part with our XML
    Using sw As New StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create))
        myDoc.Document.Save(sw)
    End Using
End Using

After we run this code you'll see that the MainDocument part now has the user name in the document body as described by our XML literal.

Using LINQ with the Open XML SDK

Using the SDK we can also write LINQ queries over the part collections. For instance if we want to select all the top level parts and any of their sub-parts, we can write a query like so:

Using wordDoc
  Dim parts = From part In wordDoc.Parts _
            Select part.OpenXmlPart, _
                   part.RelationshipId, _
                   part.OpenXmlPart.RelationshipType, _
                   SubParts = _
                   ( _
                    From subPart In part.OpenXmlPart.Parts _
                    Select subPart.OpenXmlPart, _
                           subPart.RelationshipId, _
                           subPart.OpenXmlPart.RelationshipType _
                   ).ToList
End Using

This query returns similar information to what you get with the Open XML Package Editor if we look at the same document. If we display the query results in two related DataGridViews we'll see that the MainDocument part contains additional parts for things like themes, styles and settings.

If we want to access the actual XML content for each of the OpenXmlParts we can call the GetStream method on the OpenXmlPart we want and pass it a StreamReader which we can use to load an XDocument object.

Using wordDoc

 Dim parts = From part In wordDoc.Parts _
             Select Doc = XDocument.Load(New StreamReader(part.OpenXmlPart.GetStream())), _
                    part.OpenXmlPart, _
                    part.RelationshipId, _
                    part.OpenXmlPart.RelationshipType, _
                    SubParts = _
                    ( _
                     From subPart In part.OpenXmlPart.Parts _
                     Select Doc = XDocument.Load(New StreamReader(subPart.OpenXmlPart.GetStream())), _
                            subPart.OpenXmlPart, _
                            subPart.RelationshipId, _
                            subPart.OpenXmlPart.RelationshipType _
                    ).ToList
End Using

Loading and Querying the XDocument from the Package

Let's say we have a case where we can't use XML Literals and embedded expressions, instead we want to pull out the MainDocument part and find and replace text inside. We can do this using XML Axis properties. This can get pretty tricky because there may be a lot of formatting information in the document. An easier way may be to use content controls which you can alias so that it's easier to query those instead, but for this example it's a pretty simple query to find our body text and replace the word "my" with the user name.

Imports <xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
...
Dim docFile = CurDir() & "\MyDocument.docx"
Dim wordDoc = WordprocessingDocument.Open(docFile, True)
Dim myDoc As XDocument

Using wordDoc
    Using xr As New StreamReader(wordDoc.MainDocumentPart.GetStream())
        'Load the MainDocument part's XML
        myDoc = XDocument.Load(xr)
    End Using

    'Find the only line of text in this document
    Dim element = (From item In myDoc...<w:t>)(0)

    'Replace the value of the element
    element.Value = <s>This is <%= Environment.UserName %>'s document</s>.Value

    Using sw As New StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create))
        'Save the modified XML back to the MainDocument part
        myDoc.Save(sw)
    End Using

End Using

One of the cool things about using the Open XML SDK is that you don't have to have Office installed to run any of this code. So it's a great alternative instead of using slow COM automation to manipulate documents.

As I explore Open XML in Office 2007 more and more I'll post more realistic business examples using LINQ to XML and Visual Basic. For now, you may want to sink your teeth into Ken Getz's Advanced Basics March 2008 article in MSDN Magazine: Office 2007 Files and LINQ. This article also shows off some important XML namespace features of Visual Basic.

Enjoy!

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

# a-foton &raquo; Accessing OpenXML Document Parts with the OpenXML SDK said on July 30, 2008 10:52 PM:

PingBack from http://blog.a-foton.ru/2008/07/accessing-openxml-document-parts-with-the-openxml-sdk/

# Beth Massi - Sharing the goodness that is VB said on July 31, 2008 7:18 PM:

Yesterday I received an email from another employee here at Microsoft to check out this tool called Open

# Beth Massi - Sharing the goodness that is VB said on February 26, 2009 8:00 PM:

I just finished my last talk of the conference on LINQ to XML and it was lots of fun as always. I've

# Andrew said on July 2, 2009 11:55 AM:

Hi Beth,

Once again a great article.  I know this is a bit after the fact but, I got caught out today on a problem that took a bit of time to resolve.

It might be worth pointing out the subtly of the FileMode.Create when getting the stream for saving.  By default (i.e. without specifying the the FileMode) the stream writer merely overwrites the existing stream, which in my case was corrupting the Xml saved back to a customXml part.

Leave a Comment

(required) 
(optional)
(required) 

  
Enter Code Here: Required

About Beth Massi

Beth is a Program Manager on the Visual Studio Community Team at Microsoft and is responsible for producing and managing content for business application developers, driving community features and team participation onto MSDN Developer Centers (http://msdn.com), and helping make Visual Studio one of the best developer tools in the world. She also produces regular content on her blog (http://blogs.msdn.com/bethmassi), Channel 9, and a variety of other developer sites and magazines. As a community champion and a long-time member of the Microsoft developer community she also helps with the San Francisco East Bay .NET user group and is a frequent speaker at various software development events. Before Microsoft, she was a Senior Architect at a health care software product company and a Microsoft Solutions Architect MVP. Over the last decade she has worked on distributed applications and frameworks, web and Windows-based applications using Microsoft development tools in a variety of businesses. She loves teaching, hiking, mountain biking, and driving really fast.

This Blog

Syndication

Page view tracker