Blog - Title

Updated DocumentBuilder to work with Dec09 CTP of Open XML SDK V2

Updated DocumentBuilder to work with Dec09 CTP of Open XML SDK V2

  • Comments 6

DocumentBuilder is a small API (part of the PowerTools for Open XML project, an open source project on CodePlex) that allows you to merge contents of documents while retaining document integrity and resolving issues of markup interdependence.  This post contains detailed information on interdependence of Open XML WordprocessingML markup.  This post introduces DocumentBuilder, and gives a few examples of its use.  This post discusses how to control sections (and headers) when using DocumentBuilder.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOC
DocumentBuilder is licensed under the Microsoft Public License (Ms-PL), which gives you wide latitude in how you use the code.  To get DocumentBuilder, go to PowerTools for Open XML, click on the Downloads tab, and download DocumentBuilder.zip.

I've updated DocumentBuilder with a couple of minor bug fixes:

  • The Dec09 CTP changed the way that you add a custom XML part.  It also changed the way that you query for and add hyperlinks.  Updated DocumentBuilder to use the changed API.
  • Fixed a bug where images were copied one byte too short.

There are upcoming tasks for DocumentBuilder, and the PowerTools for Open XML in general:

  • Validate with Office 2010, and ISO/IEC 29500.  I want to revisit each of the cmdlets, and make sure that they work properly for ISO/IEC 29500.
  • Fix issue with duplicate IDs.  This is not a serious issue, as the resulting documents load in Word properly.  I believe that the spec indicates that an implementation is free to load even though IDs are not unique.
  • Enhance DocumentBuilder so that if multiple documents contain the same header, reuse the header instead of duplicating headers in the destination document.
  • Build new test harness/suite for DocumentBuilder, CommentMerger, and RevisionAccepter.
  • Build a cmdlet for CommentMerger.
  • There are a few cmdlets in PowerTools for Open XML that I want to revisit, and make more general.
  • Some of the cmdlets and supporting code can be refactored.  I plan on making a utility module for code that is specific to Open XML, and a utility module for code not related to Open XML, such as some of my favorite functional programming extension methods and classes.
  • Incorporate the new transform from WordprocessingML into XHtml.  (I haven't yet blogged on the final version of the transform yet.)
  • Valildate with PowerShell 2.0.
Leave a Comment
  • Please add 6 and 6 and type the answer here:
  • Post
  • Hi, I encounter un problem with your tool when parsing my master document. At the body level, I only have pr and tbl.

    For the paragraph, I havn't any trouble and it's very fast (thanks for that) but for the tables, my sub doc has to be placed usually in a cell. After the replacement, all my table is replace instead of the cell content. Can you help me please ?

  •                // project collection of tuples containing element and type

                   var q2 = q1

                       .Select(

                           e =>

                           {

                               string keyForGroupAdjacent = ".NonContentControl";

                               if (e.Name == w + "tbl")

                               {

                                   keyForGroupAdjacent = null;

                                   if (e.Value.StartsWith("{$C") && mokeDic.ContainsKey(e.Value))

                                       keyForGroupAdjacent = e.Value;

    ...

  • I tried with a more specialized linq query to find the cell but I think I have to create a new collection (eg q2_table)??

  • Hi Ernest,

    At this point, the way that you configure DocumentBuilder is to specify as sources elements that are child elements of the w:body element.  What this means is that as currently written, DocumentBuilder can't do what you want.  I would like to update DocumentBuilder so that you can specify a list of block-level content elements, block level content containers, or content controls.  This would allow you to do what you are trying to do.

    One possible work-around that you could do right now is that you could import your sub-doc as ordinary paragraphs, and then go in after the fact and surround those paragraphs with a table, placing them in a cell.  This would allow DocumentBuilder to do all of the resolutions of interrelated markup.

    -Eric

  • thanks for your prompt reply. I'm really expecting your update because it's incredibly faster than mine. Futher more, I'm using altchunk but with a deep level of imbrication and Word 2003 + 2007 plugin is unable to display them. But what I do is finding some token and keeping their position (I wasn't using XDocument) them I call a metod doing this

               using (WordprocessingDocument myDoc =

                   WordprocessingDocument.Open(destinationFile, true))

               {

                   string altChunkId = "AltChunkId" + Guid.NewGuid();

                   MainDocumentPart mainPart = myDoc.MainDocumentPart;

                   AlternativeFormatImportPart chunk = mainPart.AddAlternativeFormatImportPart(

                       AlternativeFormatImportPartType.WordprocessingML, altChunkId);

                   bool finish = false;

                   int I = 0;

                   DateTime dtStart = DateTime.Now;

                   while (!finish)

                   {

                       try

                       {

                           using (FileStream fileStream = File.Open(sourceFile, FileMode.Open))

                           {

                               chunk.FeedData(fileStream);

                           }

                           finish = true;

                       }

                       catch

                       {

                           //System.Threading.Thread.Sleep(100);

                       }

                       I++;

                   }

                   // insert after

                   AltChunk altChunk = new AltChunk();

                   altChunk.Id = altChunkId;

                   OpenXmlElement xWorking = null;

                   if (String.IsNullOrEmpty(positionLabel) || "???" == positionLabel)

                   {

                       xWorking = mainPart.Document.Body.Elements<OpenXmlElement>().Last();

                   }

                   else

                   {

                       xWorking = mainPart.Document.Descendants<OpenXmlElement>().Where(o => o.InnerText == positionLabel).FirstOrDefault();

                       if (null != xWorking)

                       {

                           if (xWorking is Body)

                           {

                               xWorking.Descendants<Paragraph>().Where(o => o.InnerText == positionLabel).FirstOrDefault().InsertAfterSelf(altChunk);

                               xWorking.Descendants<Paragraph>().Where(o => o.InnerText == positionLabel).FirstOrDefault().Remove();

                           }

    else if (xWorking is Paragraph)

    {

    /*

    mainPart.Document

    .Body

    .InsertAfter(altChunk, xWorking);

    mainPart.Document.Body.RemoveChild<OpenXmlElement>(xWorking);

    */

    if (xWorking.Ancestors<TableCell>().Count()>0)

    {

    xWorking.Ancestors<TableCell>().First().ChildElements.OfType<TableCellProperties>().Last().InsertAfterSelf(altChunk);

    }

    else if (xWorking.Ancestors<Body>().Count()>0)

    {

    xWorking.InsertBeforeSelf(altChunk);

    //xWorking.Ancestors<Body>().First().InsertAfterSelf(altChunk);

    }

    xWorking.Remove();

    }

    else if (xWorking is Table)

    {

    /*

    mainPart.Document

    .Body

    .InsertAfter(altChunk, xWorking);

    mainPart.Document.Body.RemoveChild<OpenXmlElement>(xWorking);

    */

    if (xWorking.Descendants<TableCell>().Count()>0)

    {

    xWorking.Descendants<TableCell>().First().ChildElements.OfType<TableCellProperties>().Last().InsertAfterSelf(altChunk);

    }

    else if (xWorking.Ancestors<Body>().Count()>0)

    {

    xWorking.InsertBeforeSelf(altChunk);

    //xWorking.Ancestors<Body>().First().InsertAfterSelf(altChunk);

    }

    xWorking.Remove();

    }

    //else if (xWorking is TableCell)

    //{

    //    xWorking.Descendants<TableCellProperties>().Last().InsertAfterSelf(altChunk);

    //    xWorking.Descendants<Paragraph>().Last().Remove();

    //}

    else if (xWorking is TableRow || xWorking is TableCell)

    {

    //

    xWorking.Descendants<Text>().Where(o => o.InnerText == positionLabel).FirstOrDefault().Ancestors<TableCell>().First().ChildElements.OfType<TableCellProperties>().Last().InsertAfterSelf(altChunk);

    xWorking.Descendants<Paragraph>().Where(o => o.InnerText == positionLabel).FirstOrDefault().Remove();

    }

    else

    {

    throw new NotImplementedException(xWorking.ToString());

    }

    }

                   }

  • Was wondering if you could help, i see you have noted these changes at the top...

    Fixed a bug where images were copied one byte too short.

    Enhance DocumentBuilder so that if multiple documents contain the same header, reuse the header instead of duplicating headers in the destination document.

    We are using PowerTools Version 2.0.8.0 when we try to open the document we get a error stating there are problems with the content.

    Parts are mssing or invalid: Location /word/headerc.xml Line 1, Column 1597

    This is when the headers contain images, the first image gets pulled through but any subsequent docs that contain the same header the images get removed.

    All keep section attributes are set as True

    Is there any known issues with images placed in the header sections?

Page 1 of 1 (6 items)