Welcome to MSDN Blogs Sign in | Join | Help
Splitting Runs in Open XML Word Processing Document Paragraphs

[Blog Map]

(July 1, 2009 - Updated TransformRun to be recursive)

In Open XML Word processing document markup, paragraphs contain runs, and runs contain text elements.  Sometimes when transforming a document, we may want to split runs differently than in the original document.  This post presents a couple of small functions that help us deal with paragraphs and runs – determine the split locations of runs, and to split runs.

Word 2007 has a neat feature where you can lock a document and disallow editing of the content; yet allow the user to add comments.  You can send this document for review to a number of users, and after the reviewers return the documents, it would be handy to have some code that merges comments from all documents into a single document.  I’m currently working on a blog post that shows how to do this.  However, adding a comment to a paragraph can cause runs to be split, which adds a bit of complexity.

Paragraphs, Runs, and Text Elements

The following markup shows a very simple paragraph.  We can see the paragraph element, the run element, and the text element.

<w:p>

  <w:r>

    <w:t>abcdefghi</w:t>

  </w:r>

</w:p>

 

If we select “def” in the above text, and add a comment, the markup changes to look like this:

<w:p>

  <w:r>

    <w:t>abc</w:t>

  </w:r>

  <w:commentRangeStart w:id="0"/>

  <w:r>

    <w:t>def</w:t>

  </w:r>

  <w:commentRangeEnd w:id="0"/>

  <w:r>

    <w:rPr>

      <w:rStyle w:val="CommentReference"/>

    </w:rPr>

    <w:commentReference w:id="0"/>

  </w:r>

  <w:r>

    <w:t>ghi</w:t>

  </w:r>

</w:p>

 

In this paragraph, we can see the commentRangeStart and commentRangeEnd elements.  In addition, we can see a special run that contains information on the styling of the text that is commented.  This special run contains a commentReference element.

If we want to programmatically insert a comment into a document, we need to split runs as appropriate so that we can insert commentRangeStart, commentRangeEnd, and the special run that contains commentReference into the paragraph.

Note that a paragraph can be split into runs for a variety of reasons, and that there are a number of other valid child elements of the paragraph element.  For example, because the above text isn’t a correctly spelled word, and isn’t a sentence with proper grammar, the markup can include w:proofErr elements:

<w:p>

  <w:proofErr w:type="spellStart"/>

  <w:proofErr w:type="gramStart"/>

  <w:r>

    <w:t>abc</w:t>

  </w:r>

  <w:commentRangeStart w:id="0"/>

  <w:r>

    <w:t>def</w:t>

  </w:r>

  <w:commentRangeEnd w:id="0"/>

  <w:proofErr w:type="gramEnd"/>

  <w:r>

    <w:rPr>

      <w:rStyle w:val="CommentReference"/>

    </w:rPr>

    <w:commentReference w:id="0"/>

  </w:r>

  <w:r>

    <w:t>ghi</w:t>

  </w:r>

  <w:proofErr w:type="spellEnd"/>

</w:p>

 

When splitting runs, we want to honor those existing run splits, and make sure that we don’t disturb those other elements.

As Open XML developers know, content controls and custom XML markup are very powerful features of Open XML.  They enable a vast number of scenarios – we can make our documents smarter.  However, they add an interesting twist to markup.  The element for content controls is w:sdt, which contains another element, w:sdtContent, which contains the contents.  This means that runs that we potentially want to split occur at different levels of the XML hierarchy:

<w:p>

  <w:r>

    <w:t>123</w:t>

  </w:r>

  <w:sdt>

    <w:sdtContent>

      <w:r>

        <w:t>4567</w:t>

      </w:r>

    </w:sdtContent>

  </w:sdt>

  <w:r>

    <w:t>890</w:t>

  </w:r>

</w:p>

 

Custom XML markup has the same issue.  The following schema defines some custom XML markup:

<?xml version="1.0" encoding="utf-8"?>

<xs:schema attributeFormDefault="unqualified"

           elementFormDefault="qualified"

           xmlns:xs="http://www.w3.org/2001/XMLSchema">

  <xs:element name="Root">

    <xs:complexType>

      <xs:sequence>

        <xs:element name="Child"

                    type="xs:string" />

      </xs:sequence>

    </xs:complexType>

  </xs:element>

</xs:schema>

 

When we use this custom schema to add structure to a document, it looks like this:

The markup looks like this:

<w:p>

  <w:customXml w:uri="http://northwind.com"

               w:element="Root">

    <w:r>

      <w:t>12</w:t>

    </w:r>

    <w:customXml w:uri="http://northwind.com"

                 w:element="Child">

      <w:r>

        <w:t>34</w:t>

      </w:r>

    </w:customXml>

    <w:r>

      <w:t>56</w:t>

    </w:r>

  </w:customXml>

  <w:r>

    <w:t>7890</w:t>

  </w:r>

</w:p>

 

We may need to split runs at any level - as a child of the paragraph, as content in a content control, or within custom XML markup.  We need to use a recursive transform to do the transform, which then handles this issue nicely.

Determining Run Split Locations

The first piece of functionality that we need is a method to return an array of integers indicating where run splits are.  If we are moving comments from one document to another, then we want to find out where the run splits are in the source document so that we can create the same run splits in the destination document.

Here’s the prototype of simple method to do so:

static int[] RunSplitLocations(XElement paragraph)

 

The following paragraph markup contains three runs:

<w:p>

  <w:r>

    <w:t>abc</w:t>

  </w:r>

  <w:commentRangeStart w:id="0"/>

  <w:r>

    <w:t>def</w:t>

  </w:r>

  <w:commentRangeEnd w:id="0"/>

  <w:r>

    <w:rPr>

      <w:rStyle w:val="CommentReference"/>

    </w:rPr>

    <w:commentReference w:id="0"/>

  </w:r>

  <w:r>

    <w:t>ghi</w:t>

  </w:r>

</w:p>

 

If we call RunSplitLocations for this paragraph, it returns an array that contains:

0

3

6

 

Splitting Runs

If we have another document that contains no comments in this paragraph, and we want to split runs so that we can insert a comment on the middle three characters, we can call another method that takes an array of integers to do the splitting:

public static XElement SplitRunsInParagraph(XElement p, int[] positions)

 

If we have a paragraph with this markup:

<w:p>

  <w:r>

    <w:t>abcdefghi</w:t>

  </w:r>

</w:p>

 

And we call SplitRunsInParagraph passing an array that contains 0, 3, and 6, it returns a paragraph that looks like this:

<w:p xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">

  <w:r>

    <w:t>abc</w:t>

  </w:r>

  <w:r>

    <w:t>def</w:t>

  </w:r>

  <w:r>

    <w:t>ghi</w:t>

  </w:r>

</w:p>

 

As I previously mentioned, the paragraph may contain child elements other than runs.  SplitRunsInParagraph will leave those other elements in place.  Also, a run can contain styling information, which we also want to leave in place.

Now that we have some methods to determine where run splits are, and to create run splits, it will be pretty simple to write a pure functional transform to move comments from one document to another (if the documents contain the exact same content, with the exception of comments).

The Code

The following example contains RunSplitLocations and SplitRunsInParagraph.  This code uses a node cloning technique similar to what I presented in this post.  In addition, the code uses the pre-atomization approach that I showed in this post.  This code implements a pure functional transformation - no side effects anywhere, which will make it easy to use when writing the next transformation.

Here’s the code (also attached):

using System;

using System.Collections.Generic;

using System.IO;

using System.Linq;

using System.Text;

using System.Xml;

using System.Xml.Linq;

using DocumentFormat.OpenXml.Packaging;

 

public static class Extensions

{

    public static XDocument GetXDocument(this OpenXmlPart part)

    {

        XDocument xdoc = part.Annotation<XDocument>();

        if (xdoc != null)

            return xdoc;

        using (StreamReader streamReader = new StreamReader(part.GetStream()))

            xdoc = XDocument.Load(XmlReader.Create(streamReader));

        part.AddAnnotation(xdoc);

        return xdoc;

    }

 

    public static string StringConcatenate(this IEnumerable<string> source)

    {

        StringBuilder sb = new StringBuilder();

        foreach (string s in source)

            sb.Append(s);

        return sb.ToString();

    }

}

 

public static class W

{

    public static XNamespace w =

        "http://schemas.openxmlformats.org/wordprocessingml/2006/main";

 

    public static XName t = w + "t";

    public static XName r = w + "r";

    public static XName del = w + "del";

    public static XName body = w + "body";

    public static XName p = w + "p";

    public static XName moveFrom = w + "moveFrom";

}

 

class Program

{

    static int GetRunLength(XElement e)

    {

        return e

            .Descendants(W.t)

            .Select(t => (string)t)

            .StringConcatenate()

            .Length;

    }

 

    // return the run split locations for all runs in the paragraph

    static int[] RunSplitLocations(XElement paragraph)

    {

        // find the runs that don't have w:del or w:moveFrom as parent elements

        var runElements = paragraph

            .Descendants(W.r)

            .Where(e => e.Parent.Name != W.del && e.Parent.Name != W.moveFrom &&

                e.Descendants(W.t).Any());

 

        // determine the run length of each run

        var runs = runElements

            .Select(r => new

            {

                RunElement = r,

                RunLength = GetRunLength(r)

            });

 

        // determine the split locations

        var runSplits = runs

            .Select(r => runs

                .TakeWhile(a => a.RunElement != r.RunElement)

                .Select(z => z.RunLength)

                .Sum());

 

        return runSplits.ToArray();

    }

 

    // if value starts or ends with a space, return xml:space="preserve" attribute

    // else return null

    static XAttribute XmlSpacePreserved(string value)

    {

        if (value.Substring(0, 1) == " " || value.Substring(value.Length - 1) == " ")

            return new XAttribute(XNamespace.Xml + "space", "preserve");

        else

            return null;

    }

 

    private class RunSplits

    {

        public XElement RunElement { get; set; }

        public int RunLength { get; set; }

        public int RunLocation { get; set; }

    }

 

    private static object RunTransform(XElement element,

        int[] positions, IEnumerable<RunSplits> runSplits)

    {

        // split runs that have child text elements

        if (element.Name == W.r && element.Descendants(W.t).Any())

        {

            // get text of run

            string text = element

                .Descendants(W.t)

                .Select(t => (string)t).StringConcatenate();

 

            // find run in runSplits

            RunSplits rs = runSplits.First(r => r.RunElement == element);

 

            // find list of splits in this run

            var splitsInThisRun = positions

                .Where(p => p >= rs.RunLocation && p < rs.RunLocation + rs.RunLength);

 

            // adjust splits so that split locations are relative to this run instead of

            // relative to the beginning of the paragraph

            var splitsIntext = splitsInThisRun

                .Select(p => p - rs.RunLocation)

                .ToArray();

 

            // project collection of strings that will be in the new, split runs

            var splitText = splitsIntext

                .Select((p, i) =>

                    i != splitsIntext.Length - 1 ?

                    text.Substring(p, splitsIntext[i + 1] - p) :

                    text.Substring(p)

            );

 

            // project collection of runs that will replace the original run

            return splitText.Select(r =>

                new XElement(W.r,

                    rs.RunElement.Elements().Where(e => e.Name != W.t),

                    new XElement(W.t,

                        XmlSpacePreserved(r),

                        r)));

        }

 

        // clone elements other than runs

        // must be recursive to handle custom XML markup and content controls

        return new XElement(element.Name,

            element.Attributes(),

            element.Nodes().Select(n =>

            {

                XElement e = n as XElement;

                if (e != null)

                    return RunTransform(e, positions, runSplits);

                return n;

            })

        );

    }

 

    public static XElement SplitRunsInParagraph(XElement p, int[] positions)

    {

        // find the runs that don't have w:del or w:moveFrom as parent elements

        var runElements = p

            .Descendants(W.r)

            .Where(e => e.Parent.Name != W.del && e.Parent.Name != W.moveFrom &&

                e.Descendants(W.t).Any());

 

        // calculate the run length of each run

        var runs = runElements

            .Select(r => new

            {

                RunElement = r,

                RunLength = GetRunLength(r)

            });

 

        // calculate the location of each split

        var runSplits = runs

            .Select(r => new RunSplits

            {

                RunElement = r.RunElement,

                RunLength = r.RunLength,

                RunLocation = runs

                    .TakeWhile(a => a.RunElement != r.RunElement)

                    .Select(z => z.RunLength)

                    .Sum()

            });

 

        // the positions argument contains a list of locations where splits will be added

        // to the paragraph.  In addition, runs may already be split at various places, and

        // we want those splits to remain, so we need to create the complete list of

        // locations where we want run splits.

 

        // create ordered union of desired splits and existing splits

        int[] allSplits = runSplits

            .Select(rs => rs.RunLocation)

            .Concat(positions)

            .OrderBy(s => s)

            .Distinct()

            .ToArray();

 

        // transform the paragraph to a new paragraph with new splits in runs

        return new XElement(W.p,

            p.Elements().Select(e => RunTransform(e, allSplits, runSplits))

        );

    }

 

    static void Main(string[] args)

    {

        using (WordprocessingDocument doc1 =

            WordprocessingDocument.Open("Test.docx", true))

        {

            XDocument doc = doc1.MainDocumentPart.GetXDocument();

            XElement p = doc.Root.Element(W.body).Element(W.p);

            //XElement newPara = SplitRunsInParagraph(p, new[] { 12, 15 });

            XElement newPara = SplitRunsInParagraph(p, new[] { 10 });

            Console.WriteLine(newPara);

        }

    }

}

 

Posted: Monday, June 29, 2009 11:49 PM by EricWhite
Attachment(s): Program.cs

Comments

Syed Qadri said:

How to insert a comment in a paragraph after finding a specific text.

# September 11, 2009 12:22 PM

EricWhite said:

Hi Syed,

This would be a process of splitting nodes at the point where you want to attach the comment.  In other words, if you want the comment attachment point to start at some point in the paragraph, and end at another point, runs must be split at those points.  This would make a good blog post - I'll add it to my list.

-Eric

# September 11, 2009 1:48 PM
Leave a Comment

(required) 

(required) 

(optional)

(required) 

  
Enter Code Here: Required

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Page view tracker