Welcome to MSDN Blogs Sign in | Join | Help
Binary to Open XML Translator, and Freeform Shapes in the Office Drawing Format

The interoperability team here at Microsoft has posted about a C# SourceForge open source project that converts from binary documents to Open XML.  The blog post indicates that the code works with Mono, so it provides some level of portability across operating systems.  The blog post also has a good explanation about the architecture of the project.  There’s more good information on the SourceForge site.  The Developer’s Corner on the SourceForge site has a number of good links to Open XML resources – the binary file format specs, and Open XML specs, and the Implementation Notes site.

What caught my eye are the well-written papers that contain detailed information about some of the issues involved in conversion.  Of particular note is the guide that provides a very nice explanation of Freeform Shapes in the Office Drawing Format.

Also:

How to Retrieve Text from a Binary .doc File

A Guide to Table Formatting

The Storage of Macros and OLE Objects

Links – July 1, 2009

Frank Rice published an article on MSDN: Programmatically Update Multiple External Data Connections in Excel 2007 by Using Open XML

Hadley Pettigrew published an article on OpenXmlDeveloper.org: Use Ruby on Rails to modify an Open XML Document

Wouter van Vugt blogged on copying a chart from a spreadsheet to a presentation.

After posting on splitting runs in word processing documents, I realized that I had neglected to account for content controls and custom XML in the transform.  I’ve updated that post with the corrected code.  I also included an explanation of the markup that requires a recursive approach.

Splitting Runs in Open XML Word Processing Document Paragraphs

[Blog Map]

(July 1, 2009 - Updated TransformRun to be recursive)

In Open XML Word processing document markup, paragraphs contain runs, and runs contain text elements.  Sometimes when transforming a document, we may want to split runs differently than in the original document.  This post presents a couple of small functions that help us deal with paragraphs and runs – determine the split locations of runs, and to split runs.

Word 2007 has a neat feature where you can lock a document and disallow editing of the content; yet allow the user to add comments.  You can send this document for review to a number of users, and after the reviewers return the documents, it would be handy to have some code that merges comments from all documents into a single document.  I’m currently working on a blog post that shows how to do this.  However, adding a comment to a paragraph can cause runs to be split, which adds a bit of complexity.

Paragraphs, Runs, and Text Elements

The following markup shows a very simple paragraph.  We can see the paragraph element, the run element, and the text element.

<w:p>

  <w:r>

    <w:t>abcdefghi</w:t>

  </w:r>

</w:p>

 

If we select “def” in the above text, and add a comment, the markup changes to look like this:

<w:p>

  <w:r>

    <w:t>abc</w:t>

  </w:r>

  <w:commentRangeStart w:id="0"/>

  <w:r>

    <w:t>def</w:t>

  </w:r>

  <w:commentRangeEnd w:id="0"/>

  <w:r>

    <w:rPr>

      <w:rStyle w:val="CommentReference"/>

    </w:rPr>

    <w:commentReference w:id="0"/>

  </w:r>

  <w:r>

    <w:t>ghi</w:t>

  </w:r>

</w:p>

 

In this paragraph, we can see the commentRangeStart and commentRangeEnd elements.  In addition, we can see a special run that contains information on the styling of the text that is commented.  This special run contains a commentReference element.

If we want to programmatically insert a comment into a document, we need to split runs as appropriate so that we can insert commentRangeStart, commentRangeEnd, and the special run that contains commentReference into the paragraph.

Note that a paragraph can be split into runs for a variety of reasons, and that there are a number of other valid child elements of the paragraph element.  For example, because the above text isn’t a correctly spelled word, and isn’t a sentence with proper grammar, the markup can include w:proofErr elements:

<w:p>

  <w:proofErr w:type="spellStart"/>

  <w:proofErr w:type="gramStart"/>

  <w:r>

    <w:t>abc</w:t>

  </w:r>

  <w:commentRangeStart w:id="0"/>

  <w:r>

    <w:t>def</w:t>

  </w:r>

  <w:commentRangeEnd w:id="0"/>

  <w:proofErr w:type="gramEnd"/>

  <w:r>

    <w:rPr>

      <w:rStyle w:val="CommentReference"/>

    </w:rPr>

    <w:commentReference w:id="0"/>

  </w:r>

  <w:r>

    <w:t>ghi</w:t>

  </w:r>

  <w:proofErr w:type="spellEnd"/>

</w:p>

 

When splitting runs, we want to honor those existing run splits, and make sure that we don’t disturb those other elements.

As Open XML developers know, content controls and custom XML markup are very powerful features of Open XML.  They enable a vast number of scenarios – we can make our documents smarter.  However, they add an interesting twist to markup.  The element for content controls is w:sdt, which contains another element, w:sdtContent, which contains the contents.  This means that runs that we potentially want to split occur at different levels of the XML hierarchy:

<w:p>

  <w:r>

    <w:t>123</w:t>

  </w:r>

  <w:sdt>

    <w:sdtContent>

      <w:r>

        <w:t>4567</w:t>

      </w:r>

    </w:sdtContent>

  </w:sdt>

  <w:r>

    <w:t>890</w:t>

  </w:r>

</w:p>

 

Custom XML markup has the same issue.  The following schema defines some custom XML markup:

<?xml version="1.0" encoding="utf-8"?>

<xs:schema attributeFormDefault="unqualified"

           elementFormDefault="qualified"

           xmlns:xs="http://www.w3.org/2001/XMLSchema">

  <xs:element name="Root">

    <xs:complexType>

      <xs:sequence>

        <xs:element name="Child"

                    type="xs:string" />

      </xs:sequence>

    </xs:complexType>

  </xs:element>

</xs:schema>

 

When we use this custom schema to add structure to a document, it looks like this:

The markup looks like this:

<w:p>

  <w:customXml w:uri="http://northwind.com"

               w:element="Root">

    <w:r>

      <w:t>12</w:t>

    </w:r>

    <w:customXml w:uri="http://northwind.com"

                 w:element="Child">

      <w:r>

        <w:t>34</w:t>

      </w:r>

    </w:customXml>

    <w:r>

      <w:t>56</w:t>

    </w:r>

  </w:customXml>

  <w:r>

    <w:t>7890</w:t>

  </w:r>

</w:p>

 

We may need to split runs at any level - as a child of the paragraph, as content in a content control, or within custom XML markup.  We need to use a recursive transform to do the transform, which then handles this issue nicely.

Determining Run Split Locations

The first piece of functionality that we need is a method to return an array of integers indicating where run splits are.  If we are moving comments from one document to another, then we want to find out where the run splits are in the source document so that we can create the same run splits in the destination document.

Here’s the prototype of simple method to do so:

static int[] RunSplitLocations(XElement paragraph)

 

The following paragraph markup contains three runs:

<w:p>

  <w:r>

    <w:t>abc</w:t>

  </w:r>

  <w:commentRangeStart w:id="0"/>

  <w:r>

    <w:t>def</w:t>

  </w:r>

  <w:commentRangeEnd w:id="0"/>

  <w:r>

    <w:rPr>

      <w:rStyle w:val="CommentReference"/>

    </w:rPr>

    <w:commentReference w:id="0"/>

  </w:r>

  <w:r>

    <w:t>ghi</w:t>

  </w:r>

</w:p>

 

If we call RunSplitLocations for this paragraph, it returns an array that contains:

0

3

6

 

Splitting Runs

If we have another document that contains no comments in this paragraph, and we want to split runs so that we can insert a comment on the middle three characters, we can call another method that takes an array of integers to do the splitting:

public static XElement SplitRunsInParagraph(XElement p, int[] positions)

 

If we have a paragraph with this markup:

<w:p>

  <w:r>

    <w:t>abcdefghi</w:t>

  </w:r>

</w:p>

 

And we call SplitRunsInParagraph passing an array that contains 0, 3, and 6, it returns a paragraph that looks like this:

<w:p xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">

  <w:r>

    <w:t>abc</w:t>

  </w:r>

  <w:r>

    <w:t>def</w:t>

  </w:r>

  <w:r>

    <w:t>ghi</w:t>

  </w:r>

</w:p>

 

As I previously mentioned, the paragraph may contain child elements other than runs.  SplitRunsInParagraph will leave those other elements in place.  Also, a run can contain styling information, which we also want to leave in place.

Now that we have some methods to determine where run splits are, and to create run splits, it will be pretty simple to write a pure functional transform to move comments from one document to another (if the documents contain the exact same content, with the exception of comments).

The Code

The following example contains RunSplitLocations and SplitRunsInParagraph.  This code uses a node cloning technique similar to what I presented in this post.  In addition, the code uses the pre-atomization approach that I showed in this post.  This code implements a pure functional transformation - no side effects anywhere, which will make it easy to use when writing the next transformation.

Here’s the code (also attached):

using System;

using System.Collections.Generic;

using System.IO;

using System.Linq;

using System.Text;

using System.Xml;

using System.Xml.Linq;

using DocumentFormat.OpenXml.Packaging;

 

public static class Extensions

{

    public static XDocument GetXDocument(this OpenXmlPart part)

    {

        XDocument xdoc = part.Annotation<XDocument>();

        if (xdoc != null)

            return xdoc;

        using (StreamReader streamReader = new StreamReader(part.GetStream()))

            xdoc = XDocument.Load(XmlReader.Create(streamReader));

        part.AddAnnotation(xdoc);

        return xdoc;

    }

 

    public static string StringConcatenate(this IEnumerable<string> source)

    {

        StringBuilder sb = new StringBuilder();

        foreach (string s in source)

            sb.Append(s);

        return sb.ToString();

    }

}

 

public static class W

{

    public static XNamespace w =

        "http://schemas.openxmlformats.org/wordprocessingml/2006/main";

 

    public static XName t = w + "t";

    public static XName r = w + "r";

    public static XName del = w + "del";

    public static XName body = w + "body";

    public static XName p = w + "p";

    public static XName moveFrom = w + "moveFrom";

}

 

class Program

{

    static int GetRunLength(XElement e)

    {

        return e

            .Descendants(W.t)

            .Select(t => (string)t)

            .StringConcatenate()

            .Length;

    }

 

    // return the run split locations for all runs in the paragraph

    static int[] RunSplitLocations(XElement paragraph)

    {

        // find the runs that don't have w:del or w:moveFrom as parent elements

        var runElements = paragraph

            .Descendants(W.r)

            .Where(e => e.Parent.Name != W.del && e.Parent.Name != W.moveFrom &&

                e.Descendants(W.t).Any());

 

        // determine the run length of each run

        var runs = runElements

            .Select(r => new

            {

                RunElement = r,

                RunLength = GetRunLength(r)

            });

 

        // determine the split locations

        var runSplits = runs

            .Select(r => runs

                .TakeWhile(a => a.RunElement != r.RunElement)

                .Select(z => z.RunLength)

                .Sum());

 

        return runSplits.ToArray();

    }

 

    // if value starts or ends with a space, return xml:space="preserve" attribute

    // else return null

    static XAttribute XmlSpacePreserved(string value)

    {

        if (value.Substring(0, 1) == " " || value.Substring(value.Length - 1) == " ")

            return new XAttribute(XNamespace.Xml + "space", "preserve");

        else

            return null;

    }

 

    private class RunSplits

    {

        public XElement RunElement { get; set; }

        public int RunLength { get; set; }

        public int RunLocation { get; set; }

    }

 

    private static object RunTransform(XElement element,

        int[] positions, IEnumerable<RunSplits> runSplits)

    {

        // split runs that have child text elements

        if (element.Name == W.r && element.Descendants(W.t).Any())

        {

            // get text of run

            string text = element

                .Descendants(W.t)

                .Select(t => (string)t).StringConcatenate();

 

            // find run in runSplits

            RunSplits rs = runSplits.First(r => r.RunElement == element);

 

            // find list of splits in this run

            var splitsInThisRun = positions

                .Where(p => p >= rs.RunLocation && p < rs.RunLocation + rs.RunLength);

 

            // adjust splits so that split locations are relative to this run instead of

            // relative to the beginning of the paragraph

            var splitsIntext = splitsInThisRun

                .Select(p => p - rs.RunLocation)

                .ToArray();

 

            // project collection of strings that will be in the new, split runs

            var splitText = splitsIntext

                .Select((p, i) =>

                    i != splitsIntext.Length - 1 ?

                    text.Substring(p, splitsIntext[i + 1] - p) :

                    text.Substring(p)

            );

 

            // project collection of runs that will replace the original run

            return splitText.Select(r =>

                new XElement(W.r,

                    rs.RunElement.Elements().Where(e => e.Name != W.t),

                    new XElement(W.t,

                        XmlSpacePreserved(r),

                        r)));

        }

 

        // clone elements other than runs

        // must be recursive to handle custom XML markup and content controls

        return new XElement(element.Name,

            element.Attributes(),

            element.Nodes().Select(n =>

            {

                XElement e = n as XElement;

                if (e != null)

                    return RunTransform(e, positions, runSplits);

                return n;

            })

        );

    }

 

    public static XElement SplitRunsInParagraph(XElement p, int[] positions)

    {

        // find the runs that don't have w:del or w:moveFrom as parent elements

        var runElements = p

            .Descendants(W.r)

            .Where(e => e.Parent.Name != W.del && e.Parent.Name != W.moveFrom &&

                e.Descendants(W.t).Any());

 

        // calculate the run length of each run

        var runs = runElements

            .Select(r => new

            {

                RunElement = r,

                RunLength = GetRunLength(r)

            });

 

        // calculate the location of each split

        var runSplits = runs

            .Select(r => new RunSplits

            {

                RunElement = r.RunElement,

                RunLength = r.RunLength,

                RunLocation = runs

                    .TakeWhile(a => a.RunElement != r.RunElement)

                    .Select(z => z.RunLength)

                    .Sum()

            });

 

        // the positions argument contains a list of locations where splits will be added

        // to the paragraph.  In addition, runs may already be split at various places, and

        // we want those splits to remain, so we need to create the complete list of

        // locations where we want run splits.

 

        // create ordered union of desired splits and existing splits

        int[] allSplits = runSplits

            .Select(rs => rs.RunLocation)

            .Concat(positions)

            .OrderBy(s => s)

            .Distinct()

            .ToArray();

 

        // transform the paragraph to a new paragraph with new splits in runs

        return new XElement(W.p,

            p.Elements().Select(e => RunTransform(e, allSplits, runSplits))

        );

    }

 

    static void Main(string[] args)

    {

        using (WordprocessingDocument doc1 =

            WordprocessingDocument.Open("Test.docx", true))

        {

            XDocument doc = doc1.MainDocumentPart.GetXDocument();

            XElement p = doc.Root.Element(W.body).Element(W.p);

            //XElement newPara = SplitRunsInParagraph(p, new[] { 12, 15 });

            XElement newPara = SplitRunsInParagraph(p, new[] { 10 });

            Console.WriteLine(newPara);

        }

    }

}

 

Querying LINQ to XML Nodes in Reverse Document Order with Better Performance

[Blog Map] 

(Update June 25, 2009 - fixed bugs in event handlers associated with deleting last node and inserting node at beginning of list)

Occasionally I need to query LINQ to XML nodes in reverse document order.  I’m currently writing some LINQ to XML queries over Open XML documents where I need to select paragraph nodes based on content in the immediately preceding paragraph.  However, nodes in LINQ to XML are forward-linked only.  We can see evidence of this in the XNode.NodesBeforeSelf and XElement.ElementsBeforeSelf methods – these methods return collections of nodes in document order, not reverse document order.  This was by design – LINQ to XML was designed to provide great performance for the vast majority of scenarios with the minimum memory footprint possible.  The need to process nodes in reverse document order is rare, so the designers of LINQ to XML decided that it was more important to reduce memory footprint than to allow for good performance in the few scenarios that require processing in reverse document order, and of course it was a good decision.  But the need does exist.

In my scenario (a functional transform that processes Open XML document revisions), it is possible that I would need to process 80,000 (or more) paragraphs.  If we use the XNode.PreviousNode property, we won’t have acceptable performance.  There is an easy work-around that provides us the ability to query in reverse document order in a way that performs well.

  • We define a new class, PreviousNodeAnnotation , that contains one public field, public XNode PreviousNode;.
  • We add instances of this class as annotations on nodes that we need to query in reverse document order.

In the following small example, I select nodes based on previous node value for a document size of 50,000.  The slow version exhibits performance of O(n2).  I limited the sample size in the example to 50,000 nodes.  When I increased the doc size to 80,000 nodes (the size of one of my documents that I need to query), the execution time of the slow version exceeded my patience.  In any case, it is clear that I can’t use XNode.PreviousNode for my scenario.

using System;

using System.Linq;

using System.Xml.Linq;

 

class PreviousNodeAnnotation

{

    public XNode PreviousNode;

    public PreviousNodeAnnotation(XNode prev) { PreviousNode = prev; }

}

 

class Program

{

    static int DocumentSize = 50000;

 

    static void SlowPreviousNodeAccess()

    {

        // create a tree with lots of nodes

        XElement root = new XElement("Root",

            Enumerable.Range(0, DocumentSize).Select(i => new XElement("Child", i)));

 

        // query for all elements where the previous element has a value of 1000

        DateTime start = DateTime.Now;

        var q = root

            .Elements()

            .Where(e =>

            {

                XElement p = e.PreviousNode as XElement;

                return (string)p == "1000";

            });

        var q2 = q.ToList();  // force iteration

        TimeSpan duration = DateTime.Now - start;

        Console.WriteLine(duration);

    }

 

    static void FastPreviousNodeAccess()

    {

        // create a tree with lots of nodes

        XElement root = new XElement("Root",

            Enumerable.Range(0, DocumentSize).Select(i => new XElement("Child", i)));

 

        // initialize previous node annotations

        XElement prev = null;

        foreach (var item in root.Elements())

        {

            item.AddAnnotation(new PreviousNodeAnnotation(prev));

            prev = item;

        }

 

        // query for all elements where the previous element has a value of 1000

        DateTime start = DateTime.Now;

        var q = root

            .Elements()

            .Where(e =>

            {

                XElement p = e

                    .Annotation<PreviousNodeAnnotation>()

                    .PreviousNode as XElement;

                return (string)p == "1000";

            });

        var q2 = q.ToList();  // force iteration

        TimeSpan duration = DateTime.Now - start;

        Console.WriteLine(duration);

    }

 

    static void Main(string[] args)

    {

        FastPreviousNodeAccess();

        SlowPreviousNodeAccess();

    }

}

 

On my old slow laptop, the execution time of these two queries is .015 seconds and 30 seconds, respectively.

We can expand on this technique a bit by declaring three extension methods:

public static XNode PreviousNodeFast(this XNode node)

public static IEnumerable<XNode> NodesBeforeSelfFast(this XNode node)

public static IEnumerable<XElement> ElementsBeforeSelfFast(this XElement element)

It’s convenient to use these methods in queries.

In addition, while I’m partial to pure transforms with no side effects, we can declare two event handlers that keep the previous node annotations in sync when adding or deleting nodes.  Here’s an example that includes these extension methods and event handlers:

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using System.Xml.Linq;

 

class PreviousNodeAnnotation

{

    public XNode PreviousNode;

    public PreviousNodeAnnotation(XNode prev) { PreviousNode = prev; }

}

 

public static class Extensions

{

    public static XNode PreviousNodeFast(this XNode node)

    {

        return node.Annotation<PreviousNodeAnnotation>().PreviousNode;

    }

 

    public static IEnumerable<XNode> NodesBeforeSelfFast(this XNode node)

    {

        XNode currentNode = node;

        while (true)

        {

            XNode prevNode = currentNode.PreviousNodeFast();

            if (prevNode == null)

                yield break;

            else

                yield return prevNode;

            currentNode = prevNode;

        }

    }

 

    public static IEnumerable<XElement> ElementsBeforeSelfFast(this XElement element)

    {

        return NodesBeforeSelfFast(element).OfType<XElement>();

    }

}

 

class Program

{

    static void ValidatePreviousNodes(XElement element)

    {

        XNode prev = null;

        foreach (XNode node in element.Nodes())

        {

            if (node.PreviousNodeFast() != prev)

            {

                Console.WriteLine("ERROR: previous nodes are invalid");

                Environment.Exit(0);

            }

            prev = node;

        }

        Console.WriteLine("Validated");

    }

 

    static void Main(string[] args)

    {

        // create a tree with lots of nodes

        XElement root = new XElement("Root",

            Enumerable.Range(0, 100).Select(i => new XElement("Child", i)));

 

        // setup previous nodes after tree creation

        XElement prev = null;

        foreach (var item in root.Elements())

        {

            item.AddAnnotation(new PreviousNodeAnnotation(prev));

            prev = item;

        }

 

        // add event handlers to take care of adding / deleting nodes

        root.Changed += new EventHandler<XObjectChangeEventArgs>((o, e) =>

        {

            if (e.ObjectChange == XObjectChange.Add)

            {

                Console.WriteLine("Add");

                XNode node = o as XNode;

 

                // o could be an XAttribute, in which case it's not applicable

                if (node != null)

                {

                    node.AddAnnotation(new PreviousNodeAnnotation(node.PreviousNode));

                    if (node.NextNode != null)

                    {

                        node.NextNode.RemoveAnnotations<PreviousNodeAnnotation>();

                        node.NextNode.AddAnnotation(new PreviousNodeAnnotation(node));

                    }

                }

            }

        });

        root.Changing += new EventHandler<XObjectChangeEventArgs>((o, e) =>

        {

            if (e.ObjectChange == XObjectChange.Remove)

            {

                Console.WriteLine("Remove");

                XNode node = o as XNode;

 

                // o could be an XAttribute, in which case it's not applicable

                if (node != null)

                {

                    if (node.NextNode != null)

                    {

                        node.NextNode.RemoveAnnotations<PreviousNodeAnnotation>();

                        node.NextNode

                            .AddAnnotation(new PreviousNodeAnnotation(node.PreviousNode));

                    }

                }

            }

        });

 

        ValidatePreviousNodes(root);

 

        root.Elements().ElementAt(3).AddAfterSelf(

            new XElement("NewChild", 999)

        );

        ValidatePreviousNodes(root);

 

        root.Nodes().ElementAt(2).Remove();

        ValidatePreviousNodes(root);

 

        root.Nodes().ElementAt(3).NodesBeforeSelfFast().Remove();

        ValidatePreviousNodes(root);

 

        ValidatePreviousNodes(root);

        root.AddFirst(

            new XElement("ANode", 1)

        );

        ValidatePreviousNodes(root);

        root.Add(

            new XElement("ANode", 2)

        );

        ValidatePreviousNodes(root);

        root.Nodes().Last().NodesBeforeSelfFast().Remove();

        ValidatePreviousNodes(root);

        root.Add(

            new XElement("ANode", 2)

        );

        ValidatePreviousNodes(root);

        root.Add(

            new XElement("ANode", 2)

        );

        root.Add(

            new XElement("ANode", 2)

        );

        root.Add(

            new XElement("ANode", 2)

        );

        ValidatePreviousNodes(root);

        root.Nodes().First().Remove();

        ValidatePreviousNodes(root);

        root.Add(

            new XElement("ANode", 2)

        );

        ValidatePreviousNodes(root);

        root.Nodes().Last().Remove();

 

        root.Add(Enumerable.Range(0, 100).Select(i => new XElement("Child", i)));

 

        XElement last = root.Elements().ElementAt(50);

        foreach (var item in last.NodesBeforeSelfFast())

        {

            Console.WriteLine(item);

        }

    }

}

  

I would personally only define these extension methods in the module where I need good performance of reverse document order queries.  It would be messy to have these extension methods in scope for modules that don't set up the annotations.

Code is attached.

Two Interesting Open XML Articles on OpenXmlDeveloper.org

Darcy Thomas has written two interesting Open XML articles on OpenXmlDeveloper.org:

Converting a Facebook stream into an Open XML spreadsheet

This sample is an example of an app which pulls down a stream and puts it into a nicely formatted Open XML spread sheet using PHPExcel. The spreadsheet reorganizes posts into a time log with rows showing the person’s photo, their name (with a link back to their profile), the actual message from the post and the time and date of the post.

Accessing the C# code from PowerTools for Open XML in a .NET application

This is an interesting article that shows how you can take advantage of the C# code that sits behind the PowerTools for Open XML.  He recreates in C# the PowerShell application that Lawrence Hodson wrote for this article.  The primary purpose of the PowerTools for Open XML isn’t to provide a C# library for developers to use.  Its main purpose is to provide example code and guidance for the types of things you commonly want to do with Open XML.  That said, this is an interesting article that shows how to look around inside the PowerTools source code, and to use the DocumentBuilder class (from C#) to generate a nicely formatted word processing document.

Office Developer Conference moving to SharePoint Conference 2009

Gray Knowlton (Group Product Manager for Office) has posted news that Office Developer Conference (ODC) will not take place this year.  Instead, the ODC content will be included within the SharePoint Conference.

From his post:

As you may have seen at PDC, TechEd or elsewhere, Office 2010 is on its way. To help you get ready, Office 2010 for Developers will be highlighted at the upcoming SharePoint Conference (October 2009, Las Vegas, NV) and TechEd conferences around the world in 2009 and 2010.

NET: Office Developer Conference will not take place this year; instead we are including the Office Developer Conference content within the SharePoint Conference.  If you are an attendee of Office Developer Conference in the past, we strongly recommend you come see us at the SharePoint Conference in October, where we’ll cover Office client development in depth. Be sure to sign up for the Technical Preview as well!

We are optimizing our show presence for developers seeking opportunities to build on the Office platform, which includes Office client applications, SharePoint, Exchange and Communicator. By adding the ODC track to the 2009 SharePoint conference, we can provide better exposure to those seeking to develop solutions across the platform.

Links – May 28, 2009

Open XML Developer

There are some great new articles:

Introduction to the Open XML SDK 2.0

XSLT transforming XML to Open XML using Java

OPC Team Blog

The Open Packaging Conventions (OPC) team here at Microsoft has started a new blog.  They’ve started off a series of posts with Adventures in Packaging Episode 1.  I see more and more opportunities to take advantage of OPC, and it’s clear that other folks do too.

LoBand and PDA Views of MSDN Content

If you haven’t seen this before, the Lo Band and PDA views of MSDN content ROCK!  I spend a fair amount of time on the bus.  I use the PDA view on my phone to fill in gaps in my knowledge on the .NET framework.  Check out the LINQ to XML docs on your phone:  http://msdn.microsoft.com/en-us/library/bb387098(pda).aspx J  The Library Experience team (LEX) has a great blog post that details the new views of MSDN content.

MSDN Code Search

The MSDN Code Search Preview lets you search for code in the MSDN Library, MSDN Code Gallery, and CodePlex.  This is a much-needed addition to developer’s toolkits.

Links – May 15, 2009

Doug Mahugh and a bunch of the standards crew (both in and out of Microsoft) have been having a great discussion on document format interoperability.  They (and referenced posts) are worth reading.

Michael Kiselman and his crew have been publishing some great case studies on the use of Open XML.  These cases studies are hard evidence about the uptake of the format.  I really love seeing people putting the document formats to good use.  Folks are building innovative solutions that really wouldn’t be possible without using Open XML.

  • Microsoft has a group (The Microsoft IT Business Intelligence Center of Excellence Core Scorecard Team) who delivers key information to more than 3000 people worldwide, sometimes in the form of automatically generated macro-enabled spreadsheets.  They wanted to deliver these through Excel Services in SharePoint, but SharePoint doesn’t allow web delivery of macro enabled spreadsheets.  Well, they did something cool – before delivering from SharePoint to the user, they add (using Open XML) digitally signed VBA code that contains the necessary macros.  This is an interesting read.
  • Darwin Information Typing Architecture (DITA) is an XML based solution for topic-based authoring.  Content Technologies has built DITA Exchange, which uses some of the cool features of Open XML to allow people to author DITA XML in Word.  They translate DITA XML to Open XML and back again.
  • Savo Group provides an on-demand collaborative Sales Enablement solution called SAVO. Customers wanted documents delivered in Open XML.  Using the Open XML SDK, SAVO generates customized documents for download to end users.

Look for these among the studies that you find here.

Working with Optional Elements and Attributes in LINQ to XML Queries

Often XML schemas allow for optional elements and attributes.  When you write queries on these elements or attributes, you may be tempted to write code that does lots of testing for null.  There is a better way to do this, laid out in this post.  I covered this idiom in a previous post, but the main purpose of that post wasn’t to explain this idiom.  I’m speaking on using LINQ with Open XML tomorrow at TechEd 2009, and need a better example.

The following XML document is a simplified variation of markup that you can find in Open XML word processing documents:

<document>

  <body>

    <p>

      <r>

        <t>Text of first para.</t>

      </r>

    </p>

    <p>

      <pPr>

        <pStyle val="Heading1"/>

      </pPr>

      <r>

        <t>Text of second para.</t>

      </r>

    </p>

  </body>

</document>

 

The first paragraph doesn’t have a <pPr> element, whereas the second does.  This is allowable in Open XML word processing documents.  The first paragraph has the default style.

Our task is to write a query that returns the style name for each paragraph, but if the paragraph has no style name, then the paragraph has the default style.  The code projects a collection of an anonymous type that contains the style name and the text.  If the paragraph has the default style, the StyleName is set to null.

The approach where the code tests for null looks like this:

using System;

using System.Linq;

using System.Xml.Linq;

 

class Program

{

    static string GetStyleName(XElement p)

    {

        XElement pPr = p.Element("pPr");

        if (pPr != null)

        {

            XElement pStyle = pPr.Element("pStyle");

            if (pStyle != null)

                return (string)pStyle.Attribute("val");

        }

        return null;

    }

 

    static void Main(string[] args)

    {

        XElement root = XElement.Parse(

@"<document>

  <body>

    <p>

      <t>Text of first para.</t>

    </p>

    <p>

      <pPr>

        <pStyle val='Heading1'/>

      </pPr>

      <t>Text of second para.</t>

    </p>

  </body>

</document>");

 

        var paragraphs = root

            .Element("body")

            .Elements("p")

            .Select(p => new

            {

                StyleName = GetStyleName(p),

                Text = (string)p.Element("t")

            });

 

        foreach (var item in paragraphs)

            Console.WriteLine(item);

    }

}

 

This works just fine, and yields the expected results:

{ StyleName = , Text = Text of first para. }

{ StyleName = Heading1, Text = Text of second para. }

 

Beyond making the code harder to read, this approach introduces two additional points of possible failure.  If I had neglected to write the code to test for null, my code would throw an exception.

There is another way to write this query, which is to use the Elements and Attributes extension methods that operate on IEnumerable<XElement>.

using System;

using System.Linq;

using System.Xml.Linq;

 

class Program

{

    static void Main(string[] args)

    {

        XElement root = XElement.Parse(

@"<document>

  <body>

    <p>

      <t>Text of first para.</t>

    </p>

    <p>

      <pPr>

        <pStyle val='Heading1'/>

      </pPr>

      <t>Text of second para.</t>

    </p>

  </body>

</document>");

 

        var paragraphs = root

            .Element("body")

            .Elements("p")

            .Select(p => new

            {

                StyleName = (string)p.Elements("pPr").Elements("pStyle")

                    .Attributes("val").FirstOrDefault(),

                Text = (string)p.Element("t")

            });

 

        foreach (var item in paragraphs)

            Console.WriteLine(item);

    }

}

 

This also yields the same results, and doesn’t contain the two points of possible failure.

Here’s how this code works.  In the snippet below, the highlighted code evaluates to a collection of XElement objects.  Notice that I used the Elements method, not the Element method, even though I know that there could only be zero or one <pPr> elements.  The highlighted code returns a collection of either zero or one items.

StyleName = (string)p.Elements("pPr").Elements("pStyle")

    .Attributes("val").FirstOrDefault(),

 

The Elements extension method yields all child elements with the given name for each and every element in the source collection.  In the snippet below, the highlighted code will return either one XElement object (if there was a <pPr> element), or an empty collection, if there wasn’t a <pPr> element:

StyleName = (string)p.Elements("pPr").Elements("pStyle")

    .Attributes("val").FirstOrDefault(),

 

Next, the code ‘dots’ into the Attributes extension method.  Again, the Attributes extension method is happy to take a collection of elements as source.  If an empty collection is passed to the Attributes extension method, it also returns an empty collection:

StyleName = (string)p.Elements("pPr").Elements("pStyle")

    .Attributes("val").FirstOrDefault(),

 

The FirstOrDefault extension method either returns the first element in a collection, or it returns the default value for the type of items in the collection.  The default value for all reference types (which XAttribute and XElement are) is null.  In this case, FirstOrDefault will either return the one “val” attribute, or it will return null.

Finally we cast this one value (either null or an XAttribute) to string.  String is a nullable type, of course, and the explicit conversion operator (the cast operator for XAttribute or XElement) is happy to take null, and return null.  If there is no <pPr> element, StyleName will be set to null.  If there is a pPr element, a pStyle element, and a val attribute, then StyleName will be set to the value of the attribute.

This is where the other nullable CLR types (int?, bool?, double?, etc.) come in handy.  If we want to get the value of an optional element or attribute, and we know that the value is an integer or double, or whatever, instead of casting to string, we can cast to any of the other nullable types.  The same semantics apply.

Interestingly, this is also efficient.  Here’s why:

  • The LINQ to XML axes use deferred execution.
  • FirstOrDefault starts the process of materialization, requesting the first item in the collection from the Attributes extension method.
  • It, in turn, requests the first item from the Elements(“pStyle”) call.
  • It, in turn, requests the first element from the XElement.Elements method, which finally yields up an XElement (or returns an empty collection if there isn’t one).
  • This one element is yielded up to the Elements(“pStyle”) extension method.
  • This one element is yielded up to the Attributes(“val”) extension method.
  • The one attribute is yielded up to the FirstOrDefault extension method, which due to its semantics, “short circuits” the query, and never requests another item from its source.

The net result of this is that this idiom is efficient, reduces points of possible failure, and is shorter.  Once I started using this idiom regularly, it became quite natural.  Once one of the LINQ architects described this idiom, I started using it, and mostly don't write code in the other style.  So I’m curious – question for all you LINQ users out there, do you use this idiom?

Comparison of Navigating Parts between System.IO.Packaging and the Open XML SDK

Code highlighted in yellow shows navigating from the package to the main document part.  Code highlighted in green shows navigating from the main document part to the styles part.

using System;

using System.Linq;

using System.IO;

using System.IO.Packaging;

using System.Xml;

using System.Xml.Linq;

using DocumentFormat.OpenXml.Packaging;

 

class Program

{

    static void SystemIoPackaging()

    {

        const string fileName = "Test.docx";

 

        const string documentRelationshipType =

          "http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument";

        const string stylesRelationshipType =

          "http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles";

 

        XDocument xDocMainDocument = null;

        XDocument xDocStyleDocument = null;

 

        using (Package wdPackage = Package.Open(fileName, FileMode.Open, FileAccess.Read))

        {

            PackageRelationship docPackageRelationship = wdPackage

                .GetRelationshipsByType(documentRelationshipType)

                .FirstOrDefault();

            if (docPackageRelationship != null)

            {

                Uri documentUri = PackUriHelper.ResolvePartUri(

                       new Uri("/", UriKind.Relative), docPackageRelationship.TargetUri);

                PackagePart documentPart = wdPackage.GetPart(documentUri);

 

                //  Load the document XML in the part into an XDocument instance.

                using (Stream documentPartStream = documentPart.GetStream())

                using (XmlReader documentPartXmlReader =

                       XmlReader.Create(documentPart.GetStream()))

                    xDocMainDocument = XDocument.Load(documentPartXmlReader);

 

                //  Find the styles part. There will only be one.

                PackageRelationship styleRelation = documentPart

                    .GetRelationshipsByType(stylesRelationshipType)

                    .FirstOrDefault();

                if (styleRelation != null)

                {

                    Uri styleUri = PackUriHelper.ResolvePartUri(documentUri,

                        styleRelation.TargetUri);

                    PackagePart stylePart = wdPackage.GetPart(styleUri);

 

                    //  Load the style XML in the part into an XDocument instance.

                    using (Stream stylePartStream = stylePart.GetStream())

                    using (XmlReader stylePartXmlReader =

                           XmlReader.Create(stylePartStream))

                        xDocStyleDocument = XDocument.Load(stylePartXmlReader);

                }

            }

        }

        Console.WriteLine("The main document part has {0} nodes.",

            xDocMainDocument.DescendantNodes().Count());

        Console.WriteLine("The style part has {0} nodes.",

            xDocStyleDocument.DescendantNodes().Count());

    }

 

    static void OpenXmlSdk()

    {

        const string filename = "Test.docx";

 

        using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(filename, true))

        {

            MainDocumentPart mainPart = wordDoc.MainDocumentPart;

            StyleDefinitionsPart styleDefinitionsPart = mainPart.StyleDefinitionsPart;

            Console.WriteLine("The main document part has {0} nodes.",

                mainPart.RootElement.Descendants().Count());

            Console.WriteLine("The style part has {0} nodes.",

                styleDefinitionsPart.RootElement.Descendants().Count());

        }

    }

 

    static void Main(string[] args)

    {

        SystemIoPackaging();

        OpenXmlSdk();

    }

}

Using PHP with Open XML Spreadsheet Documents

OpenXmlDeveloper.org has just posted a new interesting article on using PHP to create Open XML spreadsheet documents.  It includes a sample that shows how to create an Excel Spreadsheet from various formats created by financial applications and online banking websites.  The code takes advantage of Maarten Balliauws's PHPExcel SDK on CodePlex – enables the reports to include some basic functions and conditional formatting.

LINQ / Open XML at TechEd 2009

I’ll be presenting a talk on LINQ and Open XML at TechEd 2009.  The session is “OFC403 Developing Office Client Solutions Using LINQ and Open XML”.  I’ll be presenting on Thursday, 5/14, at 4:30PM in Room 411.  Here is the abstract for the talk:

In this session, learn how to leverage LINQ with version 1 and 2 of the Open XML SDK to create, modify, and query Open XML documents. Version 1 works well with LINQ to XML, and Version 2 provides a strongly-typed document object model (also LINQ friendly) that makes it easier to work with XML parts. Both approaches benefit from the conciseness of LINQ and the power of functional construction. In addition, we talk about and demonstrate tools that you can use to develop Open XML solutions faster and better. This session is very technical; knowledge of generics and LINQ are prerequisites to get the most from this session.

I’ve presented this talk before, at TechEd in Barcelona last fall, and at an internal Microsoft conference.  In general, folks like this talk, and have given me high scores.  However, in both previous presentations of this talk, one or two attendees criticize the talk because I spent some amount of time talking about LINQ and why you want to use it.

I feel conflicted about how to present the material in this talk.  I always take an informal survey at the beginning of the talk, and 50% of the attendees haven’t really used LINQ, nor know too much about functional programming.  I feel that without a little introduction to LINQ, these people might not appreciate the benefits that LINQ and functional programming offer to us.  I certainly understand the sentiments of the LINQ / functional programming experts who attend – they basically want to see the tricks and patterns that I’ve developed for working with Open XML, and feel that the time spent on LINQ is wasted for them.

After a fair amount of reflection, I feel that more people are better served if I speak about LINQ for a few minutes before we dive into its application with Open XML, so I am again going to do the short intro to LINQ – I’ll limit this intro to 15 minutes, I promise!  My goal in this intro is not to bring people up to speed on LINQ, but instead point out the basic principles that underlie it – specifically, how to craft pure functional transforms.

In the LINQ portion of the talk, as well as when I talk about using it with Open XML, I will be occasionally pointing out pages on my blog that contain more information.  My goal here isn’t to increase traffic on my blog J, instead, it allows me to talk about concepts in as short amount of time possible, and attendees can refer to my blog to fill in the technical details.  I want to cover as much information as possible, and this is an effective way, I believe.

Staffing the Office / SharePoint Booth

I’ll also be at the Office / SharePoint booths at the following times:

·         Wednesday, May 13, 3:00 PM – 6:00 PM

·         Thursday, May 14, 10:45 AM – 1:45 PM

·         Friday, May 15, 10:00 AM - 1:15 PM

I really enjoy meeting customers and Microsoft employees who are using Office development technologies, including Open XML and SharePoint.  Please feel free to come by the booth at the above times and introduce yourself!  I’d love to hear about your development efforts, and am happy to help in any way that I can.  I'll have a little time available for 1:1 consultations - if you'd like to chat in a side meeting about your development efforts, email me through the EMAIL link above.

Links – April 29, 2009

Kirk Evans , an architect evangelist at Microsoft, has posted five cool videos on SharePoint development:

Johann Granados, from Staff DotNet, has written some interesting posts on using the Open XML SDK v2:

The Use of Extension Methods to Manage Open XML Document Changes in PowerTools for Open XML

There is an interesting approach that we use in PowerTools for Open XML that makes it easy to write cmdlets that modify Open XML documents.  This approach isn’t very complicated, but aspects of this approach need some explanation so that developers who are extending the PowerTools can understand what’s going on.  This approach is based on the techniques detailed in Technical Improvements in the Open XML SDK and Using LINQ to XML Events and Annotations to Track if an XML Tree has Changed.  This post explains the approach that we took in PowerTools in detail.

Note from Eric White:  This is another guest post by Bob McClellan.  I’ve met and worked with an awful lot of developers in my career, and Bob is one of the best developers that I’ve ever worked with.  He’s an expert in more areas than is possible to list here, but a few relevant areas are C++, Windows development, C#, .NET, LINQ, Open XML, WPF, and SQL.  His contributions to the Open XML conversation are well documented – he wrote this C code implementing the legacy hashing algorithm in word processing ML and the KParts proof-of-concept of an implementation of embedding linked objects from an Open XML document.  He also wrote the code that enables document composability in v1.1 of PowerTools for Open XML.  Bob, as you can see from the various guest posts on my blog, is also a good writer.  Find out more about Bob here.

The Open XML SDK makes it very easy to create an XDocument object from a part by using a Stream.  It does not, however, have any way to keep track of changes to that object.  The PowerTools for Open XML uses a couple of simple extensions to the Open XML SDK classes to manage XDocument objects and changes to those documents.

Design Requirements

We wanted to meet certain requirements with these extensions.  First, we didn’t want to add another class or new member variables and so on.  We wanted the addition to be lightweight.  Second, a part should only need to be read once for as long as the package is open.  Third, changes should not be written out until we are ready to close the package (or when a logical group of changes are complete).

Synchronization Issues

In theory, it is a bad idea to represent the same data multiple times in a program.  If the multiple values that should be the same become different for some reason, then the program will most certainly not work correctly.  The best way to avoid that kind of problem is to store information just once and then there is nothing to synchronize.  However, we often break this rule for various reasons and performance is one of them.  When we are dealing with an Open XML package, we don’t want to have to keep reading and writing parts during a series of changes to that document.  In particular, if the part is very large, many reads and writes could be time-consuming.  As you will see below, the compromise is to keep the in-memory version of the document in an XDocument that is tightly coupled with the part so that it is unlikely that we will lose synchronization between the two.

Extension Methods

Extension methods are methods that are defined to appear as if they are member methods of an existing class.  The limitation is that extension methods cannot add any member variables to the class, but there are ways around that limitation, as you will see below.  The two extension methods that we will be examining are:

public static XDocument GetXDocument(this OpenXmlPart part)

public static void FlushParts(this OpenXmlPackage doc)

 

The first method will be used to get an XDocument for a particular part.  The keyword “this” signifies that it is an extension method for the OpenXmlPart class.  This method will look to see if an XDocument has already been created for that part.  Otherwise, it creates the XDocument.

The second method is used to write out any changes to the XDocuments that have been created.  It is an extension to the OpenXmlPackage class because changes to all parts will be written out when this method is called.

Caching XDocuments

As mentioned above, we want to make sure that an XDocument is only read once no matter how many times we might examine it or modify it while the package is open.  An easy way to do this without creating new objects is by using Annotations on the part in the Open XML SDK.  Annotations are simply objects that can be attached to other objects.  We can avoid reading XDocuments more than once by attaching the XDocument to the part as an annotation.  Here is the GetXDocument extension method:

public static XDocument GetXDocument(this OpenXmlPart part)

{

    XDocument xdoc = part.Annotation<XDocument>();

    if (xdoc != null)

        return xdoc;

    try

    {

        using (StreamReader sr = new StreamReader(part.GetStream()))

        using (XmlReader xr = XmlReader.Create(sr))

        {

            xdoc = XDocument.Load(xr);

            xdoc.Changed += ElementChanged;

            xdoc.Changing += ElementChanged;

        }

    }

    catch (XmlException)

    {

        xdoc = new XDocument();

        xdoc.AddAnnotation(new ChangedSemaphore());

    }

    part.AddAnnotation(xdoc);

    return xdoc;

}

 

The parts that deal with ElementChanged and the ChangedSemaphore will be explained in the next section.  The first line calls a method that tries to retrieve an annotation that is an XDocument object.  If there is no XDocument object for that part, the method returns null.  The next line checks to see if the XDocument was returned.  If so, it was read in a previous call and there’s nothing else we need to do except return that XDocument.

If there is no annotation for the XDocument, then we need to read it in.  The process for reading an XDocument from a part is next.  A StreamReader class is needed for XmlReader and then that can be used for the static Load method that creates the XDocument.  If that process fails, then we can assume that the part doesn’t have any content yet, so we create an empty XDocument.  In either case, once we have the new XDocument object, we then add an annotation with that object so that this part will not be read again.

Tracking Changes

There were a few lines of code in the GetXDocument method that are used to track changes to that XDocument.  The basic approach to tracking changes is to use the ChangedSemaphore object to “tag” the XDocuments that have changed.

private class ChangedSemaphore { }

The class has no content; it is just used to identify which have changed.  The other two lines of code from that method that are used to track changes are setting event handlers that will be called when the XDocument is changed by any other method calls.  Here is the code for the event handler:

private static EventHandler<XObjectChangeEventArgs> ElementChanged = new
    EventHandler<XObjectChangeEventArgs>(ElementChangedHandler);

private static void ElementChangedHandler(object sender,
    XObjectChangeEventArgs e)

{

    XDocument xDocument = ((XObject)sender).Document;

    if (xDocument != null)

    {

        xDocument.Changing -= ElementChanged;

        xDocument.Changed -= ElementChanged;

        xDocument.AddAnnotation(new ChangedSemaphore());

    }

}

This method is called when the XDocument is changed.  It starts by removing itself as an event handler.  Once we have detected a change, there is no reason to have the event handler called again upon subsequent changes.  Next is the addition of the ChangedSemaphore as an annotation for the XDocument.  As you will see in the next section, we will use that annotation to determine which parts need to be written.  If you look back to the GetXDocument method, you will also see that the ChangedSemaphore object is added as an annotation for a new XDocument because we assume that a newly created empty XDocument will always be changed.

Writing Changes

The process of writing out the changes is handled by the FlushParts method and its helper method shown below:

public static void FlushParts(this OpenXmlPackage doc)

{

    HashSet<OpenXmlPart> visited = new HashSet<OpenXmlPart>();

    foreach (IdPartPair item in doc.Parts)

        FlushPart(item.OpenXmlPart, visited);

}

private static void FlushPart(OpenXmlPart part, HashSet<OpenXmlPart> visited)

{

    visited.Add(part);

    XDocument xdoc = part.Annotation<XDocument>();

    if (xdoc != null && xdoc.Annotation<ChangedSemaphore>() != null)

    {

        using (XmlWriter xw = XmlWriter.Create(part.GetStream(FileMode.Create, FileAccess.Write)))

        {

            xdoc.Save(xw);

        }

        xdoc.RemoveAnnotations<ChangedSemaphore>();

        xdoc.Changing += ElementChanged;

        xdoc.Changed += ElementChanged;

    }

    foreach (IdPartPair item in part.Parts)

        if (!visited.Contains(item.OpenXmlPart))

            FlushPart(item.OpenXmlPart, visited);

}

 

The FlushParts method calls its helper method for each part in the package.  The FlushPart helper method checks the XDocument for that part to see if it has changed and then writes it, if it has.  It then recursively calls all the related parts for that part.  The HashSet collection is used to keep track of which parts have already been checked.  It is needed because the parts of a package can be referenced from multiple parts.  It is even possible that there could be “loops” of references that would cause the method to enter an infinite recursion.  Instead, each part is added to the “visited” collection as it is checked and then that collection is checked to be sure the part is not processed a second time.

Writing the document is just a matter of getting the stream for the part and then using the XmlWriter object to write it out.  Once it is written, the method also removes the ChangedSemaphore annotation and sets the event handlers for changes again.  This is done because the package remains open after the FlushParts call.  If additional changes are made, we want to be sure they are detected.

Summary

I hope this shows how a very small amount of carefully designed code can create very powerful functionality.  Although we have a little bit of a compromise by storing a copy of the part in memory, that risk is reduced by the simplicity of the code that handles it.  There is still a risk, though.  If any part of the code using these methods makes direct calls to load, process and write a part, then they become out of sync and we don’t have any way to detect if that happened.  It’s all or nothing with this approach, but as long as you always use GetXDocument to get a part and use FlushParts before you close the package, your code will work properly.

-Bob McClellan

Using DocumentBuilder with Content Controls for Document Assembly

DocumentBuilder is an example class that’s part of the PowerTools for Open XML project that enables you to assemble new documents from existing documents.  One of the problems to solve when moving markup from one document to another is that of interrelated markup – markup in one paragraph often has dependencies with markup in other paragraphs, or other parts of the Open XML package.  Document builder fixes up interrelated markup when assembling a new document from existing documents.  This post shows how to use DocumentBuilder in concert with content controls to control the document assembly.

Zeyad Rajabi wrote a blog post on using content controls to control document assembly.  His post uses the altChunk approach for document assembly.  This post presents code that mirrors the code in his post, except that this code uses DocumentBuilder.  I’ve covered altChunk also, in How to Use altChunk for Document Assembly.

The updated post Inserting / Deleting / Moving Paragraphs in Open XML Wordprocessing Documents documents interrelationships in paragraph markup in detail.

The post Move/Insert/Delete Paragraphs in Word Processing Documents using the Open XML SDK introduces the DocumentBuilder class.

See Comparison of altChunk to the DocumentBuilder Class for more information about both approaches to document assembly.

The gist of the approach is that you insert content controls in the ‘template’ document, setting the tag of each content control to the name of the document that you want inserted at the point of the content control.  For example, in the following document, SolarOverview.docx will replace the content control in the assembled document:

Example Code

The example takes a ‘template’ document, solar-system.docx, and inserts eleven documents into it.  As I mentioned, each inserted document replaces a content control.  This example demonstrates one approach to coding document assembly using content controls and DocumentBuilder:

static void Main(string[] args)

{

    using (WordprocessingDocument solarSystem =

        WordprocessingDocument.Open("solar-system.docx", false))

    {

        XNamespace w =

            "http://schemas.openxmlformats.org/wordprocessingml/2006/main";

 

        // get children elements of the <w:body> element

        var q1 = solarSystem

            .MainDocumentPart

            .GetXDocument()

            .Root

            .Element(w + "body")

            .Elements();

 

        // project collection of tuples containing element and type

        var q2 = q1

            .Select(

                e =>

                {

                    string keyForGroupAdjacent = ".NonContentControl";

                    if (e.Name == w + "sdt")

                        keyForGroupAdjacent = e.Element(w + "sdtPr")

                            .Element(w + "tag")

                            .Attribute(w + "val")

                            .Value;

                    if (e.Name == w + "sectPr")

                        keyForGroupAdjacent = null;

                    return new

                    {

                        Element = e,

                        KeyForGroupAdjacent = keyForGroupAdjacent

                    };

                }

            ).Where(e => e.KeyForGroupAdjacent != null);

 

        // group by type

        var q3 = q2.GroupAdjacent(e => e.KeyForGroupAdjacent);

 

        // validate existence of files referenced in content controls

        foreach (var f in q3.Where(g => g.Key != ".NonContentControl"))

        {

            string filename = f.Key + ".docx";

            FileInfo fi = new FileInfo(filename);

            if (!fi.Exists)

            {

                Console.WriteLine("{0} doesn't exist.", filename);

                Environment.Exit(0);

            }

        }

 

        // project collection with opened WordProcessingDocument

        var q4 = q3

            .Select(g => new

            {

                Group = g,

                Document = g.Key != ".NonContentControl" ?

                    WordprocessingDocument.Open(g.Key + ".docx", false) :

                    solarSystem

            });

 

        // project collection of OpenXml.PowerTools.Source

        var sources = q4

            .Select(

                g =>

                {

                    if (g.Group.Key == ".NonContentControl")

                        return new Source(

                            g.Document,

                            g.Group

                                .First()

                                .Element

                                .ElementsBeforeSelf()

                                .Count(),

                            g.Group

                                .Count(),

                            false);

                    else

                        return new Source(g.Document, false);

                }

            ).ToList();

 

        DocumentBuilder.BuildDocument(sources, "solar-system-new.docx");

 

        // dispose of the opened WordprocessingDocument objects

        foreach (var g in q4)

            if (g.Group.Key != ".NonContentControl")

                g.Document.Dispose();

    }

}

 

 

How the Code Works

The code consists of chained queries that eventually build up a list of OpenXml.PowerTools.Source objects, which is what we pass to DocumentBuilder.BuildDocument to specify the sources for the document assembly.

When building up the list of document source objects, where the ‘template’ document contains paragraphs or tables, then we need to include a source object with the source document set to the ‘template’ document, and the source range set to the range of those paragraphs.  Where the ‘template’ document contains a content control, then we need to include a source object with the source document set to the document being imported.  We don’t need to set a range – we simply import the entire document.

In other words, we need to group together all paragraphs that don’t contain content controls, and we need to process separately all content controls.  This is a job for the GroupAdjacent extension method.  If we create a key such that all non content control paragraphs have the same key, and all content controls have a unique key, then we’ll end up with groups of paragraphs to import from the template document, and separate groups that contain one content control each.  As I develop the query, I’ll show intermediate results so that you can see exactly what I mean.

The results of the first query is a collection of the child elements of the <w:body> element:

// get children elements of the <w:body> element

var q1 = solarSystem

    .MainDocumentPart

    .GetXDocument()

    .Root

    .Element(w + "body")

    .Elements();

 

This is pretty simple – no need to show the output from this query.

Here is the second query:

// project collection of tuples containing element and type

var q2 = q1

    .Select(

        e =>

        {

            string keyForGroupAdjacent = ".NonContentControl";

            if (e.Name == w + "sdt")

                keyForGroupAdjacent = e.Element(w + "sdtPr")

                    .Element(w + "tag")

                    .Attribute(w + "val")

                    .Value;

            if (e.Name == w + "sectPr")

                keyForGroupAdjacent = null;

            return new

            {

                Element = e,

                KeyForGroupAdjacent = keyForGroupAdjacent

            };

        }

    ).Where(e => e.KeyForGroupAdjacent != null);

 

// temporary code to dump q2

foreach (var item in q2)

    Console.WriteLine(item.KeyForGroupAdjacent);

Environment.Exit(0);

 

If the child element of the <w:body> element is a content control, then the KeyForGroupAdjacent member of the anonymous type is set to the tag value of the content control (highlighted in yellow above).

If the child element is not a content control, then KeyForGroupAdjacent is set to “.NonContentControl”, which is an invalid filename – no chance to conflict with the tag values of the content controls.

If the child element is a section marker (<w:sectPr>), then we want to ignore that child element.  Setting the KeyForGroupAdjacent to null, and then filtering out those null items takes care of that.

When we dump out q2 to the console, we see:

.NonContentControl

.NonContentControl

.NonContentControl

.NonContentControl

.NonContentControl

.NonContentControl

.NonContentControl

.NonContentControl

.NonContentControl

.NonContentControl

.NonContentControl

.NonContentControl

.NonContentControl

.NonContentControl

.NonContentControl

.NonContentControl

.NonContentControl

.NonContentControl

.NonContentControl

.NonContentControl

.NonContentControl

SolarOverview

Sun

Mercury

Venus

Earth

.NonContentControl

Mars

Jupiter

Saturn

Uranus

Neptune

Pluto

 

Next, we use the GroupAdjacent extension method to group the .NonContentControls together:

// group by type

var q3 = q2.GroupAdjacent(e => e.KeyForGroupAdjacent);

 

// temporary code to dump q3

foreach (var g in q3)

    Console.WriteLine("{0}:  {1}", g.Key, g.Count());

Environment.Exit(0);

 

When we run this, we see:

.NonContentControl:  21

SolarOverview:  1

Sun:  1

Mercury:  1

Venus:  1

Earth:  1

.NonContentControl:  1

Mars:  1

Jupiter:  1

Saturn:  1

Uranus:  1

Neptune:  1

Pluto:  1

 

Next, the code validates that the .DOCX files referenced by the content controls exist:

// validate existence of files referenced in content controls

foreach (var f in q3.Where(g => g.Key != ".NonContentControl"))

{

    string filename = f.Key + ".docx";

    FileInfo fi = new FileInfo(filename);

    if (!fi.Exists)

    {

        Console.WriteLine("{0} doesn't exist.", filename);

        Environment.Exit(0);

    }

}

 

Then, the code projects a collection of anonymous types that include the group, as well as the open WordprocessingDocument objects:

// project collection with opened WordProcessingDocument

var q4 = q3

    .Select(g => new

    {

        Group = g,

        Document = g.Key != ".NonContentControl" ?

            WordprocessingDocument.Open(g.Key + ".docx", false) :

            solarSystem

    });

 

The observant will notice that opening these documents very definitely introduces state to this very not-pure query.  We’ll need to close/dispose of those documents later.  I’ve been fermenting an idea about wrappers around the Open XML SDK that give true functional composability to Open XML documents.  This approach would eliminate this issue of classes that implement IDisposable.  If when I open that bottle it hasn’t turned to vinegar, I’ll blog it.

Finally, we’re ready to project the list of OpenXml.PowerTools.Source objects:

// project collection of OpenXml.PowerTools.Source

var sources = q4

    .Select(

        g =>

        {

            if (g.Group.Key == ".NonContentControl")

                return new Source(

                    g.Document,

                    g.Group

                        .First()

                        .Element

                        .ElementsBeforeSelf()

                        .Count(),

                    g.Group

                        .Count(),

                    false);

            else

                return new Source(g.Document, false);

        }

    ).ToList();

 

Finally the code calls DocumentBuilder.BuildDocument and disposes of all of the opened WordprocessingDocument objects (except the ‘template’ document, which will be disposed when exiting scope of the using statement).

DocumentBuilder.BuildDocument(sources, "solar-system-new.docx");

 

// dispose of the opened WordprocessingDocument objects

foreach (var g in q4)

    if (g.Group.Key != ".NonContentControl")

        g.Document.Dispose();

 

The entire example, including the implementation of the GroupAdjacent extension method and the GetXDocument extension method follow.  I’ve attached the source file and the sample documents to this post.  This code works with version 1.1.1 of DocumentBuilder (and not prior versions).  You can download DocProc.zip, which contains DocumentBuilder from http://www.CodePlex.com/PowerTools.  It’s under the ‘Downloads’ tab.

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using System.IO;

using System.Xml;

using System.Xml.Linq;

using DocumentFormat.OpenXml.Packaging;

using OpenXml.PowerTools;

 

public class GroupOfAdjacent<TSource, TKey> : IEnumerable<TSource>, IGrouping<TKey, TSource>

{

    public TKey Key { get; set; }

    private List<TSource> GroupList { get; set; }

 

    System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()

    {

        return ((System.Collections.Generic.IEnumerable<TSource>)this).GetEnumerator();

    }

 

    System.Collections.Generic.IEnumerator<TSource> System.Collections.Generic.IEnumerable<TSource>.GetEnumerator()

    {

        foreach (var s in GroupList)

            yield return s;

    }

 

    public GroupOfAdjacent(List<TSource> source, TKey key)

    {

        GroupList = source;

        Key = key;

    }

}

 

public static class LocalExtensions

{

    public static XDocument GetXDocument(this OpenXmlPart part)

    {

        XDocument xdoc = part.Annotation<XDocument>();

        if (xdoc != null)

            return xdoc;

        using (StreamReader streamReader =

                               new StreamReader(part.GetStream()))

            xdoc = XDocument.Load(XmlReader.Create(streamReader));

        part.AddAnnotation(xdoc);

        return xdoc;

    }

 

    public static IEnumerable<