[Blog Map] This blog is inactive. New blog: EricWhite.com/blog
Transforming Open XML documents using XSLT is an interesting scenario. However, Open XML documents are stored using the Open Packaging Convention (OPC), which are essentially ZIP files that contain XML and binary parts. XLST processors can’t open and transform such files. But if we first convert this document to a different format, the Flat OPC format, we can then transform the document using XSLT. Perhaps the most compelling reason to use XSLT on Open XML documents is document generation. You can take a source ‘template’ Open XML document and source XML data document, and produce a finished, formatted Open XML document with content derived from the source XML data document.
This post is one in a series of four posts that present this approach to transforming Open XML documents using XSLT. The four posts are:
Transforming Open XML Documents using XSLT (This Post)
Presents an overview of the transformation process of Open XML documents using XSLT, and why this is important. Also presents the ‘Hello World’ XSLT transform of an Open XML document.
Transforming Open XML Documents to Flat OPC Format
This post describes the process of conversion of an Open XML (OPC) document into a Flat OPC document, and presents the C# function, OpcToFlat.
Transforming Flat OPC Format to Open XML Documents
This post describes the process of conversion of a Flat OPC file back to an Open XML document, and presents the C# function, FlatToOpc.
The Flat OPC Format
Presents a description and examples of the Flat OPC format.
This approach is particularly important in SharePoint – it allows us to write and install a SharePoint feature that can transform Open XML documents in a general way using XSL style sheets stored in document libraries. XSLT developers can then create a variety of XSL transforms of Open XML documents without writing and installing server-side code for each type of transform. I’ll be writing about this powerful technique in the near future.
As you can see in the code in the linked posts, the conversion to and from the Flat OPC format is simple – less than 100 lines of code for each type of conversion.
The program OpcXsltTransform (attached) uses the code in the above posts, and the classes in System.Xml.Xsl to perform a transform using a supplied XSL style sheet.
To run OpcXsltTransform, you supply as arguments the source Open XML document, the destination Open XML document, and the name of the XSL style sheet. You can optionally supply a fourth argument, -OutputIntermediate. If you supply this argument, then after converting the source Open XML document to the Flat OPC format, OpcXsltTransform saves this file to the disk, and after the XSL transform, OpcXsltTransform saves the new Flat OPC file to the disk. This can be helpful in debugging the XSL style sheet. The name of the source intermediate file is the same as the source DOCX, but with a file extension of ‘.xml’. The name of the destination intermediate file is the same as the destination OPC file, but with a file extension of ‘.xml’. Here is the usage of DocXslTransform:
DocXslTransform -source source.docx -destination dest.docx -xsl transform.xsl [-outputIntermediate]
Here is an artificially simplistic XSL style sheet that works with the Flat OPC format. It finds all paragraphs that have a text node that contains ‘Hello World’ and replaces those text nodes with a new one that contains ‘Goodbye World’.
<?xml version='1.0'?><xsl:stylesheet xmlns:xsl='http://www.w3.org/1999/XSL/Transform' xmlns:w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' version='1.0'> <xsl:template match="w:document/w:body/w:p/w:r/w:t[node()='Hello World']"> <w:t>Goodbye World</w:t> </xsl:template> <!-- The following transform is the identity transform --> <xsl:template match="/|@*|node()"> <xsl:copy> <xsl:apply-templates select="@*|node()"/> </xsl:copy> </xsl:template></xsl:stylesheet>
This style sheet, as well as the DOCX that it transforms, are included in the bin/debug directory in the attached ZIP file. You can build the project and run it to see the transform take place.
how about transform to html files?
Sure, no prob, modify the program that starts the XSLT transform so that it doesn't convert the XML that results from the transform back to a DOCX. Then, write the XSL to transfrom from the Flat OPC to XHTML (or HTML). The file that results from the XSLT transform can be whatever you want it to be.
i'd very much like to use something like the FlatToOpc in my Visual 2005 web project. How?
Hi Sylvain, System.IO.Packaging is available for .NET 3.0, and should work with C# 2.0. You would need to rewrite OpcToFlatOpc and FlatToOpc to work with C# 2.0. Given that the code is less than 100 lines long, shouldn't be too difficult. Does this answer your question?
This is exactly what FlexDoc does, except that it doesn't use the flatOpc-format: it works directly on the xml of the header-, footer- and maindocumentpart.
Check it out here: http://flexdoc.codeplex.com.
Is there similar one for excel format conversion ?
Hi SR, I've not yet written an XSLT conversion for excel format conversion.
I am very new to this technique. I have a word file, when I get the XML of that file the paragraph split into multiple RunItem. If I select first occurrence of the node in XSL I am getting only the first portion of the paragraph. Could you help me how to handle this using OpcXsltTransform.
Is this support XSLT 2.0?
It certainly can work. Microsoft does not have an XSLT 2.0 processor - however, you can use any of the other XSLT processors that you can use with .NET. This series of blog posts is really about transforming Open XML to XML that you can then process using any number of approaches to transform XML.