June, 2009

  • Eric White's Blog

    Splitting Runs in Open XML Word Processing Document Paragraphs

    • 6 Comments

    In Open XML Word processing document markup, paragraphs contain runs, and runs contain text elements.  Sometimes when transforming a document, we may want to split runs differently than in the original document.  This post presents a couple of small functions that help us deal with paragraphs and runs – determine the split locations of runs, and to split runs.

    Note: I no longer recommend this approach.  Instead, I recommend an approach of breaking up runs into multiple runs, each with a single character.  Then, you can search for text (not using a method to find a string in a string, but to use a custom method that matches up runs (each 1 character long) with characters in a string.  Then you can replace the runs with the new content.  Finally, you can coalesce adjacent runs with identical formatting, so that the end result is neat and clean markup.  You can find a screen cast that discusses this in detail, as well as sample code to do this here.

    This blog is inactive.
    New blog: EricWhite.com/blog

    Blog TOC
    Word 2007 has a neat feature where you can lock a document and disallow editing of the content; yet allow the user to add comments.  You can send this document for review to a number of users, and after the reviewers return the documents, it would be handy to have some code that merges comments from all documents into a single document.  I’m currently working on a blog post that shows how to do this.  However, adding a comment to a paragraph can cause runs to be split, which adds a bit of complexity.

    Paragraphs, Runs, and Text Elements

    The following markup shows a very simple paragraph.  We can see the paragraph element, the run element, and the text element.

    <w:p>
      <w:r>
        <w:t>abcdefghi</w:t>
      </w:r>
    </w:p>
     
    If we select “def” in the above text, and add a comment, the markup changes to look like this:
    <w:p>
      <w:r>
        <w:t>abc</w:t>
      </w:r>
      <w:commentRangeStartw:id="0"/>
      <w:r>
        <w:t>def</w:t>
      </w:r>
      <w:commentRangeEndw:id="0"/>
      <w:r>
        <w:rPr>
          <w:rStylew:val="CommentReference"/>
        </w:rPr>
        <w:commentReferencew:id="0"/>
      </w:r>
      <w:r>
        <w:t>ghi</w:t>
      </w:r>
    </w:p>
     
    In this paragraph, we can see the commentRangeStart and commentRangeEnd elements.  In addition, we can see a special run that contains information on the styling of the text that is commented.  This special run contains a commentReference element.
    If we want to programmatically insert a comment into a document, we need to split runs as appropriate so that we can insert commentRangeStart, commentRangeEnd, and the special run that contains commentReference into the paragraph.
    Note that a paragraph can be split into runs for a variety of reasons, and that there are a number of other valid child elements of the paragraph element.  For example, because the above text isn’t a correctly spelled word, and isn’t a sentence with proper grammar, the markup can include w:proofErr elements:

    <w:p>
      <w:proofErrw:type="spellStart"/>
      <w:proofErrw:type="gramStart"/>
      <w:r>
        <w:t>abc</w:t>
      </w:r>
      <w:commentRangeStartw:id="0"/>
      <w:r>
        <w:t>def</w:t>
      </w:r>
      <w:commentRangeEndw:id="0"/>
      <w:proofErrw:type="gramEnd"/>
      <w:r>
        <w:rPr>
          <w:rStylew:val="CommentReference"/>
        </w:rPr>
        <w:commentReferencew:id="0"/>
      </w:r>
      <w:r>
        <w:t>ghi</w:t>
      </w:r>
      <w:proofErrw:type="spellEnd"/>
    </w:p>
     
    When splitting runs, we want to honor those existing run splits, and make sure that we don’t disturb those other elements.

    As Open XML developers know, content controls are very powerful features of Open XML.  They enable a vast number of scenarios – we can make our documents smarter.  However, they add an interesting twist to markup.  The element for content controls is w:sdt, which contains another element, w:sdtContent, which contains the contents.  This means that runs that we potentially want to split occur at different levels of the XML hierarchy:

    <w:p>
      <w:r>
        <w:t>123</w:t>
      </w:r>
      <w:sdt>
        <w:sdtContent>
          <w:r>
            <w:t>4567</w:t>
          </w:r>
        </w:sdtContent>
      </w:sdt>
      <w:r>
        <w:t>890</w:t>
      </w:r>
    </w:p>

    We may need to split runs at any level - as a child of the paragraph, or as content in a content control.  We need to use a recursive transform to do the transform, which then handles this issue nicely.
    Determining Run Split Locations

    The first piece of functionality that we need is a method to return an array of integers indicating where run splits are.  If we are moving comments from one document to another, then we want to find out where the run splits are in the source document so that we can create the same run splits in the destination document.

    Here’s the prototype of simple method to do so:

    staticint[] RunSplitLocations(XElement paragraph) 


    The following paragraph markup contains three runs:

    <w:p>
      <w:r>
        <w:t>abc</w:t>
      </w:r>
      <w:commentRangeStartw:id="0"/>
      <w:r>
        <w:t>def</w:t>
      </w:r>
      <w:commentRangeEndw:id="0"/>
      <w:r>
        <w:rPr>
          <w:rStylew:val="CommentReference"/>
        </w:rPr>
        <w:commentReferencew:id="0"/>
      </w:r>
      <w:r>
        <w:t>ghi</w:t>
      </w:r>
    </w:p>
     
    If we call
    RunSplitLocations for this paragraph, it returns an array that contains:
    0
    3
    6

    Splitting Runs

    If we have another document that contains no comments in this paragraph, and we want to split runs so that we can insert a comment on the middle three characters, we can call another method that takes an array of integers to do the splitting:

    publicstaticXElement SplitRunsInParagraph(XElement p, int[] positions) 


    If we have a paragraph with this markup:

    <w:p>
      <w:r>
        <w:t>abcdefghi</w:t>
      </w:r>
    </w:p>
     
    And we call SplitRunsInParagraph passing an array that contains 0, 3, and 6, it returns a paragraph that looks like this:
    <w:pxmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
      <w:r>
        <w:t>abc</w:t>
      </w:r>
      <w:r>
        <w:t>def</w:t>
      </w:r>
      <w:r>
        <w:t>ghi</w:t>
      </w:r>
    </w:p>
     
    As I previously mentioned, the paragraph may contain child elements other than runs.  SplitRunsInParagraph will leave those other elements in place.  Also, a run can contain styling information, which we also want to leave in place.


    Now that we have some methods to determine where run splits are, and to create run splits, it will be pretty simple to write a pure functional transform to move comments from one document to another (if the documents contain the exact same content, with the exception of comments).

    The Code

    The following example contains RunSplitLocations and SplitRunsInParagraph.  This code uses a node cloning technique similar to what I presented in this post.  In addition, the code uses the pre-atomization approach that I showed in this post.  This code implements a pure functional transformation - no side effects anywhere, which will make it easy to use when writing the next transformation.

    Here’s the code (also attached):

    using System;
    using System.Collections.Generic;
    using System.IO;
    using System.Linq;
    using System.Text;
    using System.Xml;
    using System.Xml.Linq;
    using DocumentFormat.OpenXml.Packaging;
     
    publicstaticclassExtensions
    {
       
    publicstaticXDocument GetXDocument(thisOpenXmlPart part)
        {
           
    XDocument xdoc = part.Annotation<XDocument>();
           
    if (xdoc != null)
               
    return xdoc;
            
    using (StreamReader streamReader = newStreamReader(part.GetStream()))
                xdoc =
    XDocument.Load(XmlReader.Create(streamReader));
            part.AddAnnotation(xdoc);
           
    return xdoc;
        }
     
       
    publicstaticstring StringConcatenate(thisIEnumerable<string> source)
        {
           
    StringBuilder sb = newStringBuilder();
           
    foreach (string s in source)
                sb.Append(s);
           
    return sb.ToString();
        }
    }
     
    publicstaticclassW
    {
       
    publicstaticXNamespace w =
           
    "http://schemas.openxmlformats.org/wordprocessingml/2006/main";
     
       
    publicstaticXName t = w + "t";
       
    publicstaticXName r = w + "r";
       
    publicstaticXName del = w + "del";
       
    publicstaticXName body = w + "body";
       
    publicstaticXName p = w + "p";
       
    publicstaticXName moveFrom = w + "moveFrom";
    }
     
    classProgram
    {
       
    staticint GetRunLength(XElement e)
        {
           
    return e
                .Descendants(
    W.t)
                .Select(t => (
    string)t)
                .StringConcatenate()
                .Length;
        }
     
       
    // return the run split locations for all runs in the paragraph
       
    staticint[] RunSplitLocations(XElement paragraph)
        {
           
    // find the runs that don't have w:del or w:moveFrom as parent elements
           
    var runElements = paragraph
                .Descendants(
    W.r)
                .Where(e => e.Parent.Name !=
    W.del && e.Parent.Name != W.moveFrom &&
                    e.Descendants(
    W.t).Any());
     
           
    // determine the run length of each run
           
    var runs = runElements
                .Select(r =>
    new
                {
                    RunElement = r,
                    RunLength = GetRunLength(r)
                });
     
           
    // determine the split locations
           
    var runSplits = runs
                .Select(r => runs
                    .TakeWhile(a => a.RunElement != r.RunElement)
                    .Select(z => z.RunLength)
                    .Sum());
     
           
    return runSplits.ToArray();
        }
     
       
    // if value starts or ends with a space, return xml:space="preserve" attribute
       
    // else return null
       
    staticXAttribute XmlSpacePreserved(string value)
        {
           
    if (value.Substring(0, 1) == " " || value.Substring(value.Length - 1) == " ")
               
    returnnewXAttribute(XNamespace.Xml + "space", "preserve");
           
    else
               
    returnnull;
        }
     
       
    privateclassRunSplits
        {
           
    publicXElement RunElement { get; set; }
           
    publicint RunLength { get; set; }
           
    publicint RunLocation { get; set; }
        }
     
       
    privatestaticobject RunTransform(XElement element,
           
    int[] positions, IEnumerable<RunSplits> runSplits)
        {
           
    // split runs that have child text elements
           
    if (element.Name == W.r && element.Descendants(W.t).Any())
            {
               
    // get text of run
               
    string text = element
                    .Descendants(
    W.t)
                    .Select(t => (
    string)t).StringConcatenate();
     
               
    // find run in runSplits
               
    RunSplits rs = runSplits.First(r => r.RunElement == element);
     
               
    // find list of splits in this run
               
    var splitsInThisRun = positions
                    .Where(p => p >= rs.RunLocation && p < rs.RunLocation + rs.RunLength);
     
               
    // adjust splits so that split locations are relative to this run instead of
               
    // relative to the beginning of the paragraph
               
    var splitsIntext = splitsInThisRun
                    .Select(p => p - rs.RunLocation)
                    .ToArray();
     
               
    // project collection of strings that will be in the new, split runs
                
    var splitText = splitsIntext
                    .Select((p, i) =>
                        i != splitsIntext.Length - 1 ?
                        text.Substring(p, splitsIntext[i + 1] - p) :
                        text.Substring(p)
                );
     
               
    // project collection of runs that will replace the original run
               
    return splitText.Select(r =>
                   
    newXElement(W.r,
                        rs.RunElement.Elements().Where(e => e.Name !=
    W.t),
                       
    newXElement(W.t,
                            XmlSpacePreserved(r),
                            r)));
            }
     
           
    // clone elements other than runs
           
    // must be recursive to handle custom XML markup and content controls
           
    returnnewXElement(element.Name,
                element.Attributes(),
                element.Nodes().Select(n =>
                {
                   
    XElement e = n asXElement;
                   
    if (e != null)
                       
    return RunTransform(e, positions, runSplits);
                   
    return n;
                })
            );
        }
     
       
    publicstaticXElement SplitRunsInParagraph(XElement p, int[] positions)
        {
           
    // find the runs that don't have w:del or w:moveFrom as parent elements
           
    var runElements = p
                .Descendants(
    W.r)
                .Where(e => e.Parent.Name !=
    W.del && e.Parent.Name != W.moveFrom &&
                    e.Descendants(
    W.t).Any());
     
           
    // calculate the run length of each run
           
    var runs = runElements
                .Select(r =>
    new
                {
                    RunElement = r,
                    RunLength = GetRunLength(r)
                });
     
           
    // calculate the location of each split
           
    var runSplits = runs
                .Select(r =>
    newRunSplits
                {
                    RunElement = r.RunElement,
                    RunLength = r.RunLength,
                    RunLocation = runs
                        .TakeWhile(a => a.RunElement != r.RunElement)
                        .Select(z => z.RunLength)
                        .Sum()
                });
     
           
    // the positions argument contains a list of locations where splits will be added
           
    // to the paragraph.  In addition, runs may already be split at various places, and
           
    // we want those splits to remain, so we need to create the complete list of
           
    // locations where we want run splits.
     
           
    // create ordered union of desired splits and existing splits
           
    int[] allSplits = runSplits
                .Select(rs => rs.RunLocation)
                .Concat(positions)
                .OrderBy(s => s)
                .Distinct()
                .ToArray();
     
           
    // transform the paragraph to a new paragraph with new splits in runs
           
    returnnewXElement(W.p,
                p.Elements().Select(e => RunTransform(e, allSplits, runSplits))
            );
        }
     
       
    staticvoid Main(string[] args)
        {
           
    using (WordprocessingDocument doc1 =
               
    WordprocessingDocument.Open("Test.docx", true))
            {
               
    XDocument doc = doc1.MainDocumentPart.GetXDocument();
               
    XElement p = doc.Root.Element(W.body).Element(W.p);
               
    //XElement newPara = SplitRunsInParagraph(p, new[] { 12, 15 });
               
    XElement newPara = SplitRunsInParagraph(p, new[] { 10 });
               
    Console.WriteLine(newPara);
            }
        }
    }
     




Page 1 of 3 (3 items) 123
Page 1 of 1 (3 items)