Blog - Title

Transforming Open XML Documents to Flat OPC Format

Transforming Open XML Documents to Flat OPC Format

  • Comments 11

Transforming Open XML documents using XSLT is an interesting scenario, but before we can do so, we need to convert the Open XML document into the Flat OPC format.  We then perform the XSLT transform, producing a new file in the Flat OPC format, and then convert back to Open XML (OPC) format.  This post is one in a series of four posts that present this approach to transforming Open XML documents using XSLT.  The four posts are:

Transforming Open XML Documents using XSLT

Presents an overview of the transformation process of Open XML documents using XSLT, and why this is important.  Also presents the ‘Hello World’ XSLT transform of an Open XML document.

Transforming Open XML Documents to Flat OPC Format (This Post)

This post describes the process of conversion of an Open XML (OPC) document into a Flat OPC document, and presents the C# function, OpcToFlat.

Transforming Flat OPC Format to Open XML Documents

This post describes the process of conversion of a Flat OPC file back to an Open XML document, and presents the C# function, FlatToOpc.

The Flat OPC Format

Presents a description and examples of the Flat OPC format.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOC
About the Code

The code presented in this post uses LINQ to XML and System.IO.Packaging to perform the conversion to Flat OPC.

The signature of the function to convert from an Open XML document to Flat OPC is:

static XDocument OpcToFlatOpc(string path);

You pass as an argument the path to the Open XML document.  The method returns an XDocument object, which you can then modify as necessary, transform using XSLT, serialize to the standard output, or save to a file.

The code to convert a binary part to a base 64 string uses the System.Convert.ToBase64String method.  The base 64 string needs to be broken up into lines of 76 characters (see The Flat OPC Format for more detail).  The code uses the technique described in Chunking a Collection into Groups of Three to do the chunking.

If you are not familiar with this style of programming, I recommend that you read this Functional Programming Tutorial.

The conversion code adds the appropriate XML processing instruction to the resulting Flat OPC XML document based on the filename of the source Open XML document.  If the source document has the .docx extension, then the code adds the XML processing instruction for Word.  If the source document has the .pptx extension, then the code adds the XML processing instruction for PowerPoint.

Here is the code to perform the transform (also attached):

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml.Linq;
using System.IO;
using System.IO.Packaging;
using System.Xml;
using System.Xml.Schema;

class Program
{
    static XElement GetContentsAsXml(PackagePart part)
    {
        XNamespace pkg = "http://schemas.microsoft.com/office/2006/xmlPackage";

        if (part.ContentType.EndsWith("xml"))
        {
            using (Stream str = part.GetStream())
            using (StreamReader streamReader = new StreamReader(str))
            using (XmlReader xr = XmlReader.Create(streamReader))
                return new XElement(pkg + "part",
                    new XAttribute(pkg + "name", part.Uri),
                    new XAttribute(pkg + "contentType", part.ContentType),
                    new XElement(pkg + "xmlData",
                        XElement.Load(xr)
                    )
                );
        }
        else
        {
            using (Stream str = part.GetStream())
            using (BinaryReader binaryReader = new BinaryReader(str))
            {
                int len = (int)binaryReader.BaseStream.Length;
                byte[] byteArray = binaryReader.ReadBytes(len);
                // the following expression creates the base64String, then chunks
                // it to lines of 76 characters long
                string base64String = (System.Convert.ToBase64String(byteArray))
                    .Select
                    (
                        (c, i) => new
                        {
                            Character = c,
                            Chunk = i / 76
                        }
                    )
                    .GroupBy(c => c.Chunk)
                    .Aggregate(
                        new StringBuilder(),
                        (s, i) =>
                            s.Append(
                                i.Aggregate(
                                    new StringBuilder(),
                                    (seed, it) => seed.Append(it.Character),
                                    sb => sb.ToString()
                                )
                            )
                            .Append(Environment.NewLine),
                        s => s.ToString()
                    );
                return new XElement(pkg + "part",
                    new XAttribute(pkg + "name", part.Uri),
                    new XAttribute(pkg + "contentType", part.ContentType),
                    new XAttribute(pkg + "compression", "store"),
                    new XElement(pkg + "binaryData", base64String)
                );
            }
        }
    }

    static XProcessingInstruction GetProcessingInstruction(string path)
    {
        if (path.ToLower().EndsWith(".docx"))
            return new XProcessingInstruction("mso-application",
                        "progid=\"Word.Document\"");
        if (path.ToLower().EndsWith(".pptx"))
            return new XProcessingInstruction("mso-application",
                        "progid=\"PowerPoint.Show\"");
        return null;
    }

    static XDocument OpcToFlatOpc(string path)
    {
        using (Package package = Package.Open(path))
        {
            XNamespace pkg = "http://schemas.microsoft.com/office/2006/xmlPackage";

            XDeclaration declaration = new XDeclaration("1.0", "UTF-8", "yes");
            XDocument doc = new XDocument(
                declaration,
                GetProcessingInstruction(path),
                new XElement(pkg + "package",
                    new XAttribute(XNamespace.Xmlns + "pkg", pkg.ToString()),
                    package.GetParts().Select(part => GetContentsAsXml(part))
                )
            );
            return doc;
        }
    }

    static void Main(string[] args)
    {
        XDocument doc;
        doc = OpcToFlatOpc("Test.docx");
        doc.Save("Test.xml", SaveOptions.DisableFormatting);
        doc = OpcToFlatOpc("Test2.pptx");
        doc.Save("Test2.xml", SaveOptions.DisableFormatting);
    }
}

Attachment: OpcToFlat.zip
Leave a Comment
  • Please add 6 and 8 and type the answer here:
  • Post
  • how can we tranform it directly to html file?

  • Important Safety Tip for Office Open XML - Flatten Your Package!

  • I think that code for chuking a string into 76 char lengths is nuts. It does give an example of how interesting the new language features are, but the example is inappropriate. Simple loop will be easier to read and take less code to write.

  • Hi Romeok,

    You're right, in this scenario, you could re-write with a loop.  Because of the design of the Open XML SDK (and System.IO.Packaging underneath), we're forced into an imperative approach here.  But in other scenarios, when writing pure functional transforms, writing a loop would move the code from expression context to statement context, and would mean either refactoring, having a locally impure method, or would make the code impure, which leads to a bunch of problems, including possibilities of bugs introduced by mixing imperative/declarative code, and eliminates the easy use of multiple processors.  I guess that I'm in the habit of chunking using the functional approach, and used it even though in this scenario, the code wouldn't suffer from using a loop.

    -Eric

  • Why no XLSX for GetProcessingInstruction?

  • Hi David, the reason I included the processing instructions that I did is that you can directly open these XML files in Word and PowerPoint.  The processing instruction enables opening by double clicking.  If your only purpose is to transform via XSLT, then sure, replace the processing instruction with one for XSLX.

    -Eric

  • BTW - thank you a ton for these posts. Incredibly helpful!!!

  • Hi Eric,

    Thank you very much for the post and your blog as a whole.

    The processing above is OK for a “standard” docx file. How would one go about when processing an embedded docx file into another one?

    Here is a scenario:

    1. Generating SubDoc1.docx.

    2. Merging SubDoc1.docx into MainDoc.docx.

    3. “Flatten” the MainDoc.docx to an OPC Format.

    The problem is that SubDoc1.docx is part of the package. I see that it has to be processed recursively, but which parts of the SubDoc1.docx needs to be extracted.

    Thank you !

  • I have a similar problem as Bukabi where after merging the docx (via altchunk), I'm having problem trying to convert the merged docx to Flat OPC Format (OpcToFlat). Any workaround for this?

    Thanks. Your blogs has been of great help to me.

  • Hi LMK and Bukabi,

    It depends on what you're trying to do, but after creating the document with altChunk elements, you need to merge the imported documents/html/etc. into the original document so that the document contains ordinary WordprocessingML markup (i.e. paragraphs, runs, text).  There are two ways to do this - either use Word to open and save the document (perhaps using automation), or to use Word Automation Services (msdn.microsoft.com/.../ff742315.aspx).  Then you can flatten, or process in a variety of ways.

    -Eric

  • can i write new in openxml file using c# code ?

    Because i want to replace new path on place of old path of video in powerpointpresentation openxml file. So how can i do ?

Page 1 of 1 (11 items)