Welcome to MSDN Blogs Sign in | Join | Help
Transforming Open XML Documents to Flat OPC Format

[Blog Map] 

Transforming Open XML documents using XSLT is an interesting scenario, but before we can do so, we need to convert the Open XML document into the Flat OPC format.  We then perform the XSLT transform, producing a new file in the Flat OPC format, and then convert back to Open XML (OPC) format.  This post is one in a series of four posts that present this approach to transforming Open XML documents using XSLT.  The four posts are:

Transforming Open XML Documents using XSLT

Presents an overview of the transformation process of Open XML documents using XSLT, and why this is important.  Also presents the ‘Hello World’ XSLT transform of an Open XML document.

Transforming Open XML Documents to Flat OPC Format (This Post)

This post describes the process of conversion of an Open XML (OPC) document into a Flat OPC document, and presents the C# function, OpcToFlat.

Transforming Flat OPC Format to Open XML Documents

This post describes the process of conversion of a Flat OPC file back to an Open XML document, and presents the C# function, FlatToOpc.

The Flat OPC Format

Presents a description and examples of the Flat OPC format.

 

About the Code

The code presented in this post uses LINQ to XML and System.IO.Packaging to perform the conversion to Flat OPC.

The signature of the function to convert from an Open XML document to Flat OPC is:

static XDocument OpcToFlatOpc(string path);

 

You pass as an argument the path to the Open XML document.  The method returns an XDocument object, which you can then modify as necessary, transform using XSLT, serialize to the standard output, or save to a file.

The code to convert a binary part to a base 64 string uses the System.Convert.ToBase64String method.  The base 64 string needs to be broken up into lines of 76 characters (see The Flat OPC Format for more detail).  The code uses the technique described in Chunking a Collection into Groups of Three to do the chunking.

If you are not familiar with this style of programming, I recommend that you read this Functional Programming Tutorial.

The conversion code adds the appropriate XML processing instruction to the resulting Flat OPC XML document based on the filename of the source Open XML document.  If the source document has the .docx extension, then the code adds the XML processing instruction for Word.  If the source document has the .pptx extension, then the code adds the XML processing instruction for PowerPoint.

Here is the code to perform the transform (also attached):

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using System.Xml.Linq;

using System.IO;

using System.IO.Packaging;

using System.Xml;

using System.Xml.Schema;

 

class Program

{

    static XElement GetContentsAsXml(PackagePart part)

    {

        XNamespace pkg = "http://schemas.microsoft.com/office/2006/xmlPackage";

 

        if (part.ContentType.EndsWith("xml"))

        {

            using (Stream str = part.GetStream())

            using (StreamReader streamReader = new StreamReader(str))

            using (XmlReader xr = XmlReader.Create(streamReader))

                return new XElement(pkg + "part",

                    new XAttribute(pkg + "name", part.Uri),

                    new XAttribute(pkg + "contentType", part.ContentType),

                    new XElement(pkg + "xmlData",

                        XElement.Load(xr)

                    )

                );

        }

        else

        {

            using (Stream str = part.GetStream())

            using (BinaryReader binaryReader = new BinaryReader(str))

            {

                int len = (int)binaryReader.BaseStream.Length;

                byte[] byteArray = binaryReader.ReadBytes(len);

                // the following expression creates the base64String, then chunks

                // it to lines of 76 characters long

                string base64String = (System.Convert.ToBase64String(byteArray))

                    .Select

                    (

                        (c, i) => new

                        {

                            Character = c,

                            Chunk = i / 76

                        }

                    )

                    .GroupBy(c => c.Chunk)

                    .Aggregate(

                        new StringBuilder(),

                        (s, i) =>

                            s.Append(

                                i.Aggregate(

                                    new StringBuilder(),

                                    (seed, it) => seed.Append(it.Character),

                                    sb => sb.ToString()

                                )

                            )

                            .Append(Environment.NewLine),

                        s => s.ToString()

                    );

                return new XElement(pkg + "part",

                    new XAttribute(pkg + "name", part.Uri),

                    new XAttribute(pkg + "contentType", part.ContentType),

                    new XAttribute(pkg + "compression", "store"),

                    new XElement(pkg + "binaryData", base64String)

                );

            }

        }

    }

 

    static XProcessingInstruction GetProcessingInstruction(string path)

    {

        if (path.ToLower().EndsWith(".docx"))

            return new XProcessingInstruction("mso-application",

                        "progid=\"Word.Document\"");

        if (path.ToLower().EndsWith(".pptx"))

            return new XProcessingInstruction("mso-application",

                        "progid=\"PowerPoint.Show\"");

        return null;

    }

 

    static XDocument OpcToFlatOpc(string path)

    {

        using (Package package = Package.Open(path))

        {

            XNamespace pkg = "http://schemas.microsoft.com/office/2006/xmlPackage";

 

            XDeclaration declaration = new XDeclaration("1.0", "UTF-8", "yes");

            XDocument doc = new XDocument(

                declaration,

                GetProcessingInstruction(path),

                new XElement(pkg + "package",

                    new XAttribute(XNamespace.Xmlns + "pkg", pkg.ToString()),

                    package.GetParts().Select(part => GetContentsAsXml(part))

                )

            );

            return doc;

        }

    }

 

    static void Main(string[] args)

    {

        XDocument doc;

        doc = OpcToFlatOpc("Test.docx");

        doc.Save("Test.xml", SaveOptions.DisableFormatting);

        doc = OpcToFlatOpc("Test2.pptx");

        doc.Save("Test2.xml", SaveOptions.DisableFormatting);

    }

}

 

Posted: Monday, September 29, 2008 1:50 PM by EricWhite
Filed under:

Attachment(s): OpcToFlat.zip

Comments

unruledboy said:

how can we tranform it directly to html file?

# October 8, 2008 11:15 PM

John Holliday, MVP Office SharePoint Server 2007 said:

Important Safety Tip for Office Open XML - Flatten Your Package!

# October 25, 2008 12:36 AM

romeok said:

I think that code for chuking a string into 76 char lengths is nuts. It does give an example of how interesting the new language features are, but the example is inappropriate. Simple loop will be easier to read and take less code to write.

# September 30, 2009 9:15 PM

EricWhite said:

Hi Romeok,

You're right, in this scenario, you could re-write with a loop.  Because of the design of the Open XML SDK (and System.IO.Packaging underneath), we're forced into an imperative approach here.  But in other scenarios, when writing pure functional transforms, writing a loop would move the code from expression context to statement context, and would mean either refactoring, having a locally impure method, or would make the code impure, which leads to a bunch of problems, including possibilities of bugs introduced by mixing imperative/declarative code, and eliminates the easy use of multiple processors.  I guess that I'm in the habit of chunking using the functional approach, and used it even though in this scenario, the code wouldn't suffer from using a loop.

-Eric

# October 1, 2009 7:13 AM
Leave a Comment

(required) 

(required) 

(optional)

(required) 

  
Enter Code Here: Required

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Page view tracker