September, 2008

  • Eric White's Blog

    Transforming Open XML Documents to Flat OPC Format

    • 11 Comments

    Transforming Open XML documents using XSLT is an interesting scenario, but before we can do so, we need to convert the Open XML document into the Flat OPC format.  We then perform the XSLT transform, producing a new file in the Flat OPC format, and then convert back to Open XML (OPC) format.  This post is one in a series of four posts that present this approach to transforming Open XML documents using XSLT.  The four posts are:

    Transforming Open XML Documents using XSLT

    Presents an overview of the transformation process of Open XML documents using XSLT, and why this is important.  Also presents the ‘Hello World’ XSLT transform of an Open XML document.

    Transforming Open XML Documents to Flat OPC Format (This Post)

    This post describes the process of conversion of an Open XML (OPC) document into a Flat OPC document, and presents the C# function, OpcToFlat.

    Transforming Flat OPC Format to Open XML Documents

    This post describes the process of conversion of a Flat OPC file back to an Open XML document, and presents the C# function, FlatToOpc.

    The Flat OPC Format

    Presents a description and examples of the Flat OPC format.

    This blog is inactive.
    New blog: EricWhite.com/blog

    Blog TOC
    About the Code

    The code presented in this post uses LINQ to XML and System.IO.Packaging to perform the conversion to Flat OPC.

    The signature of the function to convert from an Open XML document to Flat OPC is:

    static XDocument OpcToFlatOpc(string path);

    You pass as an argument the path to the Open XML document.  The method returns an XDocument object, which you can then modify as necessary, transform using XSLT, serialize to the standard output, or save to a file.

    The code to convert a binary part to a base 64 string uses the System.Convert.ToBase64String method.  The base 64 string needs to be broken up into lines of 76 characters (see The Flat OPC Format for more detail).  The code uses the technique described in Chunking a Collection into Groups of Three to do the chunking.

    If you are not familiar with this style of programming, I recommend that you read this Functional Programming Tutorial.

    The conversion code adds the appropriate XML processing instruction to the resulting Flat OPC XML document based on the filename of the source Open XML document.  If the source document has the .docx extension, then the code adds the XML processing instruction for Word.  If the source document has the .pptx extension, then the code adds the XML processing instruction for PowerPoint.

    Here is the code to perform the transform (also attached):

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Xml.Linq;
    using System.IO;
    using System.IO.Packaging;
    using System.Xml;
    using System.Xml.Schema;

    class Program
    {
        static XElement GetContentsAsXml(PackagePart part)
        {
            XNamespace pkg = "http://schemas.microsoft.com/office/2006/xmlPackage";

            if (part.ContentType.EndsWith("xml"))
            {
                using (Stream str = part.GetStream())
                using (StreamReader streamReader = new StreamReader(str))
                using (XmlReader xr = XmlReader.Create(streamReader))
                    return new XElement(pkg + "part",
                        new XAttribute(pkg + "name", part.Uri),
                        new XAttribute(pkg + "contentType", part.ContentType),
                        new XElement(pkg + "xmlData",
                            XElement.Load(xr)
                        )
                    );
            }
            else
            {
                using (Stream str = part.GetStream())
                using (BinaryReader binaryReader = new BinaryReader(str))
                {
                    int len = (int)binaryReader.BaseStream.Length;
                    byte[] byteArray = binaryReader.ReadBytes(len);
                    // the following expression creates the base64String, then chunks
                    // it to lines of 76 characters long
                    string base64String = (System.Convert.ToBase64String(byteArray))
                        .Select
                        (
                            (c, i) => new
                            {
                                Character = c,
                                Chunk = i / 76
                            }
                        )
                        .GroupBy(c => c.Chunk)
                        .Aggregate(
                            new StringBuilder(),
                            (s, i) =>
                                s.Append(
                                    i.Aggregate(
                                        new StringBuilder(),
                                        (seed, it) => seed.Append(it.Character),
                                        sb => sb.ToString()
                                    )
                                )
                                .Append(Environment.NewLine),
                            s => s.ToString()
                        );
                    return new XElement(pkg + "part",
                        new XAttribute(pkg + "name", part.Uri),
                        new XAttribute(pkg + "contentType", part.ContentType),
                        new XAttribute(pkg + "compression", "store"),
                        new XElement(pkg + "binaryData", base64String)
                    );
                }
            }
        }

        static XProcessingInstruction GetProcessingInstruction(string path)
        {
            if (path.ToLower().EndsWith(".docx"))
                return new XProcessingInstruction("mso-application",
                            "progid=\"Word.Document\"");
            if (path.ToLower().EndsWith(".pptx"))
                return new XProcessingInstruction("mso-application",
                            "progid=\"PowerPoint.Show\"");
            return null;
        }

        static XDocument OpcToFlatOpc(string path)
        {
            using (Package package = Package.Open(path))
            {
                XNamespace pkg = "http://schemas.microsoft.com/office/2006/xmlPackage";

                XDeclaration declaration = new XDeclaration("1.0", "UTF-8", "yes");
                XDocument doc = new XDocument(
                    declaration,
                    GetProcessingInstruction(path),
                    new XElement(pkg + "package",
                        new XAttribute(XNamespace.Xmlns + "pkg", pkg.ToString()),
                        package.GetParts().Select(part => GetContentsAsXml(part))
                    )
                );
                return doc;
            }
        }

        static void Main(string[] args)
        {
            XDocument doc;
            doc = OpcToFlatOpc("Test.docx");
            doc.Save("Test.xml", SaveOptions.DisableFormatting);
            doc = OpcToFlatOpc("Test2.pptx");
            doc.Save("Test2.xml", SaveOptions.DisableFormatting);
        }
    }

Page 3 of 19 (19 items) 12345»
Page 2 of 2 (19 items) 12