Accessing Open XML Document Parts with the Open XML SDK

Accessing Open XML Document Parts with the Open XML SDK

  • Comments 10

About a month ago the Open XML SDK 1.0 (June 08 update) was released. The SDK provides strongly typed document part access to Word 2007, Excel 2007 and PowerPoint 2007 documents. The SDK has been a CTP for a while, but last month version 1.0 was finally released. So I installed this baby last week and started playing around with it and found it really easy to use after briefly looking at the documentation. The How Do I section is a great place to start.

Upgrading the Letter Generator

I decided to upgrade my Word 2007 letter generator program to use the SDK to manipulate the packages. Remember that Office 2007 documents are really just archive files, so if you rename them to .ZIP you can take a look at the contents of the package. The Open XML Package spec defines a set of XML files that contain the content and define the relationships for all of the document parts stored in a single package. To programmatically manipulate them you can use the raw System.IO.Packaging namespace, but the SDK's DocumentFormat.OpenXml.Packaging namespace is much easier to work with. 

My mail merge program uses XML literals to construct XML for the document part of a Word 2007 file based on data in the Northwind database. The LINQ query was a piece of cake compared to figuring out how to manipulate the .docx package in order to replace the document.xml (called the MainDocument) part. Not that the final code is particularly long, it was just a pain to figure it out. The SDK not only saved me a few lines of code, it made the code much more readable and took only a few minutes to write. (I updated the code for the WordMailMerge program on Code Gallery).

Getting Started with the Open XML SDK

Let's take another simple example that constructs a MainDocument part using XML literals and then replaces it in a .docx package using the SDK. This time I'll focus on the code that manipulates the Open XML package with the SDK not on the particulars of XML Literals. The first thing I recommend is to install the VSTO Power Tools so you can open Office 2007 documents and manipulate the parts directly in the Visual Studio IDE like I showed in my last post using the Open XML Package Editor.

Of course you'll need to also install the SDK which places the DocumentFormat.OpenXML.dll assembly into your GAC. Add a reference to this assembly in your project. As an aside, when x-copy deploying to a machine with the .NET Framework on it already just make sure you deploy the DocumentFormat.OpenXML.dll assembly alongside your application to avoid having to install the SDK on the target machine. The easiest thing to do is select "Show All Files" in the Solution Explorer, expand the References, and on the Properties for the DocumentFormat.OpenXML reference set "Copy Local" = True. This will place a private copy of the assembly next to your application when it's built.

Now create a new Word 2007 document with some simple text in it, for instance, type: "This is my document" then save it and add the .docx file to your Visual Basic project. Double-click on it and that opens the Open XML Package Editor:

We can manipulate the parts through this editor if we want to but what I really want to do is replace the document.xml with our own we create using XML literals and embedded expressions. Double-click on the document.xml to open the MainDocument part in the XML Editor (if the XML editor opens and the XML is all on one line with no breaks then just select all the contents and cut then paste it back into the editor and it will put the proper line breaks in there for you : Ctrl + A,X,V).

For this simple example, let's place the executing user's name into the document. Create the XML Literal and an embedded expression by pasting the document.xml into the VB Editor and adding an expression to print out the executing user's name:

Dim myDoc = <?xml version="1.0" encoding="utf-8" standalone="yes"?>
            <w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006"
               xmlns:o="urn:schemas-microsoft-com:office:office"
               xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
               xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"
               xmlns:v="urn:schemas-microsoft-com:vml"
               xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"
               xmlns:w10="urn:schemas-microsoft-com:office:word"
               xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
               xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml">
               <w:body>
                   <w:p w:rsidR="00DD17EB" w:rsidRDefault="00361264">
                       <w:r>
                           <w:t>This is <%= Environment.UserName %>'s document</w:t>
                       </w:r>
                   </w:p>
                   <w:sectPr w:rsidR="00DD17EB" w:rsidSect="00DD17EB">
                       <w:pgSz w:w="12240" w:h="15840"/>
                       <w:pgMar w:top="1440" w:right="1440" w:bottom="1440"
                           w:left="1440" w:header="720" w:footer="720" w:gutter="0"/>
                       <w:cols w:space="720"/>
                       <w:docGrid w:linePitch="360"/>
                   </w:sectPr>
               </w:body>
           </w:document>

Replacing the MainDocument Part

Before the SDK, replacing the MainDocument part in the package we had to figure out the right content type and write the code that deleted then added the new part. We also needed to add a reference to WindowsBase (a 3.0 assembly) in order to access the System.IO.Packaging namespace.

Imports System.IO.Packaging
Imports System.IO
...
'**** Without OpenXML SDK
Dim uri As New Uri("/word/document.xml", UriKind.Relative)
Dim contentType = "application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"
Dim docFile = CurDir() & "\MyDocument.docx"

Using p As Package = Package.Open(docFile)
    'Delete the current document.xml file
    p.DeletePart(uri)

    'Replace that part with our XDocument
    Dim replace As PackagePart = p.CreatePart(uri, contentType)
    Using sw As New StreamWriter(replace.GetStream())
        myDoc.Document.Save(sw)
     End Using
End Using

For this example it's pretty easy, however if you add/remove parts it's up to you to update the relations in the package and this isn't an easy task using this raw API. Enter the Open XML SDK. Now we don't need to add a reference to WindowsBase, only to DocumentFormat.OpenXML and import the Packaging namespace contained within. Then our code can access the parts of the document in a strongly-typed way:

Imports DocumentFormat.OpenXml.Packaging
Imports System.IO
...
'***** Use the OpenXML SDK for easier access to parts
Dim docFile = CurDir() & "\MyDocument.docx"

Dim wordDoc = WordprocessingDocument.Open(docFile, True)
Using wordDoc
    'Replace the document part with our XML
    Using sw As New StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create))
        myDoc.Document.Save(sw)
    End Using
End Using

After we run this code you'll see that the MainDocument part now has the user name in the document body as described by our XML literal.

Using LINQ with the Open XML SDK

Using the SDK we can also write LINQ queries over the part collections. For instance if we want to select all the top level parts and any of their sub-parts, we can write a query like so:

Using wordDoc
  Dim parts = From part In wordDoc.Parts _
            Select part.OpenXmlPart, _
                   part.RelationshipId, _
                   part.OpenXmlPart.RelationshipType, _
                   SubParts = _
                   ( _
                    From subPart In part.OpenXmlPart.Parts _
                    Select subPart.OpenXmlPart, _
                           subPart.RelationshipId, _
                           subPart.OpenXmlPart.RelationshipType _
                   ).ToList
End Using

This query returns similar information to what you get with the Open XML Package Editor if we look at the same document. If we display the query results in two related DataGridViews we'll see that the MainDocument part contains additional parts for things like themes, styles and settings.

If we want to access the actual XML content for each of the OpenXmlParts we can call the GetStream method on the OpenXmlPart we want and pass it a StreamReader which we can use to load an XDocument object.

Using wordDoc

 Dim parts = From part In wordDoc.Parts _
             Select Doc = XDocument.Load(New StreamReader(part.OpenXmlPart.GetStream())), _
                    part.OpenXmlPart, _
                    part.RelationshipId, _
                    part.OpenXmlPart.RelationshipType, _
                    SubParts = _
                    ( _
                     From subPart In part.OpenXmlPart.Parts _
                     Select Doc = XDocument.Load(New StreamReader(subPart.OpenXmlPart.GetStream())), _
                            subPart.OpenXmlPart, _
                            subPart.RelationshipId, _
                            subPart.OpenXmlPart.RelationshipType _
                    ).ToList
End Using

Loading and Querying the XDocument from the Package

Let's say we have a case where we can't use XML Literals and embedded expressions, instead we want to pull out the MainDocument part and find and replace text inside. We can do this using XML Axis properties. This can get pretty tricky because there may be a lot of formatting information in the document. An easier way may be to use content controls which you can alias so that it's easier to query those instead, but for this example it's a pretty simple query to find our body text and replace the word "my" with the user name.

Imports <xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
...
Dim docFile = CurDir() & "\MyDocument.docx"
Dim wordDoc = WordprocessingDocument.Open(docFile, True)
Dim myDoc As XDocument

Using wordDoc
    Using xr As New StreamReader(wordDoc.MainDocumentPart.GetStream())
        'Load the MainDocument part's XML
        myDoc = XDocument.Load(xr)
    End Using

    'Find the only line of text in this document
    Dim element = (From item In myDoc...<w:t>)(0)

    'Replace the value of the element
    element.Value = <s>This is <%= Environment.UserName %>'s document</s>.Value

    Using sw As New StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create))
        'Save the modified XML back to the MainDocument part
        myDoc.Save(sw)
    End Using

End Using

One of the cool things about using the Open XML SDK is that you don't have to have Office installed to run any of this code. So it's a great alternative instead of using slow COM automation to manipulate documents.

As I explore Open XML in Office 2007 more and more I'll post more realistic business examples using LINQ to XML and Visual Basic. For now, you may want to sink your teeth into Ken Getz's Advanced Basics March 2008 article in MSDN Magazine: Office 2007 Files and LINQ. This article also shows off some important XML namespace features of Visual Basic.

Enjoy!

Leave a Comment
  • Please add 8 and 6 and type the answer here:
  • Post
  • PingBack from http://blog.a-foton.ru/2008/07/accessing-openxml-document-parts-with-the-openxml-sdk/

  • Yesterday I received an email from another employee here at Microsoft to check out this tool called Open

  • I just finished my last talk of the conference on LINQ to XML and it was lots of fun as always. I've

  • Hi Beth,

    Once again a great article.  I know this is a bit after the fact but, I got caught out today on a problem that took a bit of time to resolve.

    It might be worth pointing out the subtly of the FileMode.Create when getting the stream for saving.  By default (i.e. without specifying the the FileMode) the stream writer merely overwrites the existing stream, which in my case was corrupting the Xml saved back to a customXml part.

  • Beth Hello

    I am new to orenxml programming.

    I am have a word document with content controls, i use plain and rich text controls.

    I would like to add a line break when inserting the data into the control but i am not able to.

    I tried '/r/n' but it is not working.

    I checked the checkbox to "allow carrige returns" on tht property box of the text content control. but still no luck.

    i am using:

    document.maindocumentpart.document.body.descendant.....text  =  "abc /r/n def"

    what am i doing wrong?

    another question:

    If i give the content control a name/ title what is the

    right way to insert the data by the content's control name?

    thank you very much

    Karen

  • Hi

    could you give the example for this line in your article:

    "An easier way may be to use content controls which you can alias "

    thank you

  • Great post.  Search and replace gets a bit trickier when you have a document that has been edited and the XML consequently includes multiple <w:r><w:t> parent/child elements to illustrate changes.

    I'm not sure what the best approach would be, but I think perhaps getting a collection of <w:p> and flattening the child elements (in memory) might be the best approach for searching?

    Any thoughts on how you would tackle such a thing?  Keep up the great work.

  • Nevermind... :)

    It looks like I can do something like this:

    Dim pgs = wordDoc.MainDocumentPart.Document.Descendants(Of Paragraph)()

    Then, you have the InnerText property of each paragraph, which nicely concatenates all of the various runs.

    This gives me the find portion.  Then it's just a matter of iterating through the ChildElements of each paragraph to keep the XML structure in place.  Or I suppose, I could RemoveAllChildren, then reconstruct a single Child element based upon the paragraph's InnerText.  I suppose this could mess with formatting, so the former approach may be the best.

  • Everything I see here really is horrible practice.  

    For starters, who in god's name hard codes XML documents in source?  Where is the real solution to this gap in the technology?

    Second, if VBA / Office Interop is a 1 out of 10 for complexity, this is at least a 6.  The fact that accessors which formerly were simple are now encapsulated in some of the most convoluted non-intuitive LINQ makes me pull out my hair.  WTF is wrong with a proper OBJECT MODEL!!!

    When the XML structure of my document changes (EX// a user edits the template) nothing will work because everything you've done here is hardcoded.  

    Where are the real world examples of how to build real applications out of this technology?

    I've read every article on your blog about this technology and guarantee you we will not be going this route.

  • Hi Matt,

    You can ABSOLUTELY load the XML from the file system. Just use XElement.Load. This example was to be purely data driven example. I have the ability to create the document from a database here and replace any of the XML with simple embedded expressions.

    Take a look at the Open XML SDK. It DOES provide an object model for you to work with the document formats. The advantage of working with the format directly istead of using the Office Object model is that you can provide a much more scalable document generation solution. This would be something you would want on a server-side implementation, dynamically creating document from data, withough having to have Office installed on the server at all. That is what I'm showing in these posts. If you need/want user interaction then I suggest you stay with VBA or look at VSTO. Those solutions are meant to extend the Office applications themselves and serve a much different purpose than this.

    -B

Page 1 of 1 (10 items)