Blog Map
[Blog Map] This blog is inactive. New blog: EricWhite.com/blog
You can use LINQ to XML to transform XML trees with the same level of power and expressability as with XSLT, and in many cases more than with XSLT.
One of the reasons that XSL is so powerful is that you can write multiple rules to transform a node. The rule that most specifically matches is the one that is applied.
To make this clear, consider the following source document:
<Parent> <Heading>Heading 1 text</Heading> <Heading>Heading 2 text</Heading></Parent>
We can specify a transform like this:
1 <?xml version='1.0'?> 2 <xsl:stylesheet xmlns:xsl='http://www.w3.org/1999/XSL/Transform' version='1.0'> 3 <xsl:template match='/Parent'> 4 <Root> 5 <xsl:apply-templates/> 6 </Root> 7 </xsl:template> 8 <xsl:template match='Heading[1]'> 9 <SpecialHeading>10 <xsl:value-of select='.'/>11 </SpecialHeading>12 </xsl:template>13 <xsl:template match='Heading'>14 <H1>15 <xsl:value-of select='.'/>16 </H1>17 </xsl:template>18 </xsl:stylesheet>
When this stylesheet is applied to the source document, we see:
<Root> <SpecialHeading>Heading 1 text</SpecialHeading> <H1>Heading 2 text</H1></Root>
The template defined starting on line 8 is the transform that is applied for the first <Heading> element, even though the template defined on line 13 also matches. The rule on line 8 matches more specifically, so it is the one that is applied. This is the power of XSL – you supply transforms to nodes based on a pattern to match. The specificity of the rule is significant. This allows you to write powerful transformations where you first handle exception cases, and then impose rules that handle all other cases in a general way.
Another reason that XSL is so powerful is that you can apply a transformation to a specific node, and use the <xsl:apply-templates> element to indicate that child nodes should be transformed per their own rules.
If we have this source document:
<Parent> <Heading> <Text>This is some text</Text> </Heading></Parent>
And transform it with this stylesheet:
<?xml version='1.0'?><xsl:stylesheet xmlns:xsl='http://www.w3.org/1999/XSL/Transform' version='1.0'> <xsl:template match='/Parent'> <Root> <xsl:apply-templates/> </Root> </xsl:template> <xsl:template match='Heading'> <H1> <xsl:apply-templates/> </H1> </xsl:template> <xsl:template match='Text'> <t> <xsl:value-of select='.'/> </t> </xsl:template></xsl:stylesheet>
It results in this XML:
<Root> <H1> <t>This is some text</t> </H1></Root>
We were able to specify separate transforms for the Heading and Text elements, and by using the <xsl:apply-templates> element, the template to transform the heading doesn't have to concern itself with transforms of the child Text element.
Some time ago, I blogged on a technique for using annotations to transform LINQ to XML trees in this same style – the style of XSLT. Ever since that time, I've been mulling over the approach, thinking about how to improve it. This post summarizes and shows my current thoughts about this approach to performing document-centric transformations using LINQ to XML.
The example presented here transforms an Open XML word processing document to XHTML in less than 100 lines of code (not counting the infrastructure code that enables the transformation). The transformation that I present includes transforming paragraphs styled as Heading1 and Heading2 to h1 and h2 nodes, transforming hyperlinks, and bolded text. It even includes a rudimentary transformation of a word table to an XHTML table.
Note: all of the code mentioned in this post is attached to this page.
Here is the word document that I transform:
Here is the rendering of the resulting XHTML:
The code presented here is not a complete, full fidelity transform. However, it will serve to demonstrate the technique that I'm presenting here.
Note: I have plans to enhance this code (over time) so that this transformation is more complete. In particular, I plan on enhancing this code so that I can transform a DOCX into XHTML for my blog posts. I'd really like code presented in a blog post to have an automatically inserted "Copy Code" button above each code snippet.
The code presented here has the following features:
Document-Centric Transforms
Some XML documents are "document-centric" With such documents, you don't necessarily know the shape of child nodes of an element. For instance, a node that contains text may look like this:
<text>A phrase with <b>bold</b> and <i>italic</i> text.</text>
For any given text node, there may be any number of child <b> and <i> elements.
Open XML documents contain document-centric markup. For example, the body of the document can contain any number of paragraphs; tables are siblings to paragraphs; each paragraph can contain any number of formatted text runs; hyperlinks are expressed as sibling elements to text run elements. One of the primary characteristics of document centric XML is that you do not know exactly which child elements any particular element will have. They may be interspersed randomly.
If you want to transform nodes in a tree where you don't necessarily know which particular children an element may have, then this approach that uses annotations is an effective approach. This approach allows you to specify the transformation in a minimum amount of code.
Overview of the Approach
The summary of the approach is:
In detail, the approach consists of:
This is analogous to the specification of transforms in XSL. The query that selects a set of nodes is analogous to the XPath expression for a template. The code to create the new node in TransformAnnotation.Replace is analogous to the sequence constructor in XSL, and as mentioned, the ApplyTransforms node is analogous in function to the <xsl:apply-templates> element in XSL.
One primary advantage to taking this approach - as you formulate queries, you are always writing queries on the unmodified source tree. You don't need to concern yourself about how modifications to the tree affect the queries that you are writing.
Another primary advantage to this approach – you can specify that any node found throughout the source tree be transformed according to the specified rule without concerning yourself with the specific child nodes of the node. Those child nodes can have their own rule to specify their transformation.
As mentioned at the top of this post, in XSL, it's possible to define multiple rules that apply to any specific node. The semantics of XSL specify that the most specific match found is the transform that is applied. This allows you to define very specific transforms for certain nodes. You can then define a more general transform that applies in all other cases. The approach presented here has analogous semantics – the first annotation added is the one that is used for the transform. You can add other annotations to the node, but the subsequent annotations are simply ignored by the transformation. The first annotation added is the effective one.
The following is a simple example that shows how to transform a tree. It uses a special rule to transform the first heading to the element <SpecialHeading>. Other heading elements are transformed to <H1> elements. This demonstrates that the transform that we specified for the first heading takes precedence over transforms that were subsequently specified.
Example 1:
XElement sourceDocument = XElement.Parse( @"<document> <body> <heading>Overview of the Technique</heading> <t>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</t> <heading>The Technique in Detail</heading> <t>Nunc viverra imperdiet enim. Fusce est. Vivamus a tellus.</t> <heading>Summary</heading> <t>Pellentesque habitant morbi tristique.</t> </body> </document>"); // transform body to DocumentBodysourceDocument .Element("body") .TransformReplace(new XElement("Body", new ApplyTransforms())); // transform the first heading in a special waysourceDocument .Descendants("heading") .First() .TransformReplace(new XElement("SpecialHeading", new ApplyTransforms())); // transform heading to H1foreach (var item in sourceDocument.Descendants("heading")) item.TransformReplace(new XElement("H1", new ApplyTransforms())); Console.WriteLine(sourceDocument.Transform());
This example produces the following output:
<document> <Body> <SpecialHeading>Overview of the Technique</SpecialHeading> <t>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</t> <H1>The Technique in Detail</H1> <t>Nunc viverra imperdiet enim. Fusce est. Vivamus a tellus.</t> <H1>Summary</H1> <t>Pellentesque habitant morbi tristique.</t> </Body></document>
The following example demonstrates the use of modes. It uses the same source document as the above example. It defines two transforms, one where the mode = "TOC", which transforms the document into a table of contents. The second transform passes no argument to the Transform method, which means that it matches when mode = null. This transforms the document into a different form for the body of the new document.
Example 2:
XElement sourceDocument = XElement.Parse( @"<document> <body> <heading>Overview of the Technique</heading> <t>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</t> <heading>The Technique in Detail</heading> <t>Nunc viverra imperdiet enim. Fusce est. Vivamus a tellus.</t> <heading>Summary</heading> <t>Pellentesque habitant morbi tristique.</t> </body> </document>"); // define the root transform for the table of contentssourceDocument.TransformReplace( new XElement("TableOfContents", new ApplyTransforms(sourceDocument.Element("body").Elements("heading"), "TOC")), "TOC"); // define the transform of each heading element for the table of contentsforeach (var item in sourceDocument.Descendants("heading")){ item.TransformReplace(new XElement("TocItem", (string)item), "TOC");} // define the transform of the document bodysourceDocument.Element("body").TransformReplace( new XElement("Body", new ApplyTransforms(sourceDocument.Element("body").Elements()) )); // define the transforms of heading elements for the document bodyforeach (var item in sourceDocument.Descendants("heading")){ item.TransformReplace(new XElement("H1", new ApplyTransforms()));} // define the transforms of t elements for the document bodyforeach (var item in sourceDocument.Descendants("t")){ item.TransformReplace(new XElement("Text", new ApplyTransforms()));} // assemble the new document with both TOC and bodyXElement newDoc = new XElement("Root", sourceDocument.Transform("TOC"), sourceDocument.Element("body").Transform()); Console.WriteLine(newDoc);
This example produces:
<Root> <TableOfContents> <TocItem>Overview of the Technique</TocItem> <TocItem>The Technique in Detail</TocItem> <TocItem>Summary</TocItem> </TableOfContents> <Body> <H1>Overview of the Technique</H1> <Text>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</Text> <H1>The Technique in Detail</H1> <Text>Nunc viverra imperdiet enim. Fusce est. Vivamus a tellus.</Text> <H1>Summary</H1> <Text>Pellentesque habitant morbi tristique.</Text> </Body></Root>
The final example presented here transforms an Open XML document into XHTML. It defines a number of transforms by annotating a variety of nodes. At the end, it adds annotations to every node in the tree indicating that the node should be deleted from the transformed tree. But this rule that deletes nodes is ignored for all nodes that have already been annotated.
Note: this code uses the Open XML SDK, which is available here.
DocxToHtml:
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open("Test.docx", true)){ XDocument doc = wordDoc.MainDocumentPart.GetXDocument(); XNamespace w = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"; XNamespace r = "http://schemas.openxmlformats.org/officeDocument/2006/relationships"; XNamespace h = "http://www.w3.org/1999/xhtml"; // transform the document root element to the XHTML root element doc.Root.TransformReplace( new XElement(h + "html", new XElement(h + "head", new XElement(h + "title", "Test.docx") ), new ApplyTransforms(doc.Root.Elements(w + "body")) ) ); // transform the w:body element to the XHTML h:body element doc.Element(w + "document").Element(w + "body").TransformReplace( new XElement(h + "body", new ApplyTransforms())); // transform every hyperlink in the document to the XHTML h:A element foreach (var item in doc.Descendants(w + "hyperlink")) { item.TransformReplace( new XElement(h + "A", new XAttribute("href", wordDoc.MainDocumentPart .ExternalRelationships .Where(x => x.Id == (string)item.Attribute(r + "id")) .First() .Uri ), new XText(item.Elements(w + "r") .Elements(w + "t") .Select(s => (string)s).StringConcatenate()) ) ); } // transform every Heading1 styled paragraph to the XHTML h:h1 element foreach (var item in doc.Descendants(w + "p") .Where(z => (string)z.Elements(w + "pPr") .Elements(w + "pStyle") .Attributes(w + "val") .FirstOrDefault() == "Heading1")) { item.TransformReplace(new XElement(h + "h1", new ApplyTransforms())); } // transform every Heading2 styled paragraph to the XHTML h:h2 element foreach (var item in doc.Descendants(w + "p") .Where(z => (string)z.Elements(w + "pPr") .Elements(w + "pStyle") .Attributes(w + "val") .FirstOrDefault() == "Heading2")) { item.TransformReplace(new XElement(h + "h2", new ApplyTransforms())); } // transform every text run that is styled as bold to the XHTML h:b element foreach (var item in doc.Descendants(w + "r") .Where(z => z.Elements(w + "rPr").Elements(w + "b").Any())) { item.TransformReplace( new XElement(h + "b", item.Elements(w + "t") .Select(e => (string)e).StringConcatenate())); } // transform every text run that is not styled as bold to a text node that contains the // text of the paragraph. foreach (var item in doc.Descendants(w + "r") .Where(z => !z.Elements(w + "rPr").Elements(w + "b").Any())) { item.TransformReplace( new XText(item.Elements(w + "t").Select(e => (string)e).StringConcatenate())); } // transform w:p to h:p foreach (var item in doc.Descendants(w + "p")) { item.TransformReplace(new XElement(h + "p", new ApplyTransforms())); } // transform w:tbl to h:tbl foreach (var item in doc.Descendants(w + "tbl")) { item.TransformReplace( new XElement(h + "table", new XAttribute("border", 1), new ApplyTransforms() ) ); } // transform w:tr to h:tr foreach (var item in doc.Descendants(w + "tr")) { item.TransformReplace(new XElement(h + "tr", new ApplyTransforms())); } // transform w:tc to h:td foreach (var item in doc.Descendants(w + "tc")) { item.TransformReplace(new XElement(h + "td", new ApplyTransforms())); } // the following removes any nodes that haven't been replaced. foreach (var item in doc.DescendantNodes()) { item.TransformRemove(); } XElement newDoc = (XElement)doc.Root.Transform(); newDoc.Save("test.html");}
When run using the document attached to this post, it produces the following:
<html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Test.docx</title> </head> <body> <h1>LINQ to XML Transformations in the Style of XSLT</h1> <h2>Styled Text</h2> <p>Some <b>bold</b> text.</p> <p>Some normal text.</p> <h2>Hyperlinks</h2> <p>See my <A href="http://blogs.msdn.com/ericwhite" mce_href="http://blogs.msdn.com/ericwhite">blog</A>.</p> <h2>Tables</h2> <p>This text introduces the following tables:</p> <table border="1"> <tr> <td> <p> <b>Order Number</b> </p> </td> <td> <p> <b>Order Date</b> </p> </td> <td> <p> <b>Amount</b> </p> </td> </tr> <tr> <td> <p>124245</p> </td> <td> <p>10/24/2008</p> </td> <td> <p>42.55</p> </td> </tr> <tr> <td> <p>147867</p> </td> <td> <p>10/31/2008</p> </td> <td> <p>88.99</p> </td> </tr> </table> <p /> <p>Item Detail for Order 124245</p> <table border="1"> <tr> <td> <p> <b>Line Number</b> </p> </td> <td> <p> <b>Item</b> </p> </td> <td> <p> <b>Quantity</b> </p> </td> </tr> <tr> <td> <p>1</p> </td> <td> <p>HH242</p> </td> <td> <p>3</p> </td> </tr> <tr> <td> <p>2</p> </td> <td> <p>TY149</p> </td> <td> <p>8</p> </td> </tr> <tr> <td> <p>3</p> </td> <td> <p>ZZTXT</p> </td> <td> <p>4</p> </td> </tr> </table> <p /> </body></html>
Thanks to Dirk Myers who suggested that this approach could support modes.