[Blog Map]
When thought of in a certain way, XML documents come in two flavors – data-centric and document-centric. Further, there are two types of document-centric documents. This post presents my thoughts about approaches to various types of document-centric transformations – data-centric to document-centric, document-centric to data-centric, and document-centric to document-centric. Then, I’ll tie my thoughts back to Open XML transformations.
Data-centric to data-centric is, of course, the scenario that LINQ to XML absolutely shines at. There’s been a lot written about this. This post won’t focus on these types of transformations, but instead will give my thoughts on the wrinkle that document-centric XML documents give to transformations.
First, I’ll define what I mean by data-centric and document-centric XML documents.
Data-Centric XML Document
A data-centric XML document contains regular repeating elements. Child elements of a given element might all have the same tag name, or they might not. Typically, child element order doesn’t matter. There are lots of examples of this – many types of transforms of a relational database to XML results in data-centric XML. RSS feeds are another.
Here’s a data-centric XML document:
<Customers>
<Customer>
<Name>Bob</Name>
<Age>45</Age>
</Customer>
<Customer>
<Name>Jill</Name>
<Age>37</Age>
</Customer>
</Customers>
Document-Centric XML Document
Document-centric XML documents have the characteristic that the child elements of a given element are much less bounded – you might have many child elements of a given name, or you might have none. You might have ‘recursion’ in the hierarchy – element A is a child of element B, which is itself a child of a different element A. A number of examples: Open XML word processing markup, XHTML, and XPS.
I further divide document-centric XML documents into two camps – those that contained mixed content, and those that don’t. Mixed content is a variety of XML where significant text nodes and elements are interspersed. Insignificant text nodes are the white space that provides indenting when formatting XML. Open XML word processing markup doesn’t contain mixed content, whereas XHTML does:
An Open XML paragraph that contains some bold text:
<w:p>
<w:r>
<w:t>abc</w:t>
</w:r>
<w:r>
<w:rPr>
<w:b/>
</w:rPr>
<w:t>def</w:t>
</w:r>
<w:r>
<w:t>ghi</w:t>
</w:r>
</w:p>
An XHTML document that contains significant text nodes interspersed with element start and end tags:
<html>
<head></head>
<body>
<p>abc<b>def</b>ghi</p>
</body>
</html>
Types of Transformations
If we’re going to divide the XML world into the categories of data-centric and document-centric, it follows that there are four types of transformations.
Data-Centric => Data Centric
There is a lot to say (and has been said) about these types of transformations. In the LINQ to XML documentation, I included a tutorial on pure functional transformations of XML. I also have a tutorial on my blog on composing queries in the pure functional style.
Data-Centric => Document Centric
These transforms are report writers for databases – take some subset of records, transform to XML, then transform that XML into another form – XPS, for instance. The transform may be based on another source document, the report definition. These types of transforms are straightforward to write in the pure functional style. Based on the simplicity or complexity of the report definition, this type of transform could be a few hundred lines of code or many thousands.
There are also many good examples of transforming data-centric XML to Open XML markup. We may want to transform a collection of records into a table in a word processing document, or into rows and cells in a worksheet.
Document-Centric => Data Centric
We write this type of transform when querying an Open XML document for some aspect of the markup. If we want to retrieve a collection of comments from a document, or if we want a collection of content controls, then we write a query that iterates over certain descendant elements, projecting a regular data structure – perhaps a collection of strings or anonymous types. The query that I develop in my functional programming tutorial is a document-centric => data-centric transform.
Another example is finding all hyper-links in an XHTML document. It is easy to write a LINQ query to retrieve a collection of links and transform the collection to a regular repeating data structure.
Document-Centric => Document-Centric
This is where it starts to get a little more involved. There are a variety of these types of transformations.
Common-vocabulary document-centric transform: Sometimes we want to transform a source tree to a new tree where all comments have been removed, or all tracked revisions accepted. The source document has the same vocabulary as the transformed document.
Different-vocabulary document-centric transform: Sometimes we want to transform from one document-centric vocabulary to another one – Open XML => XHTML, or XHTML => Open XML. With this type of transform, the ease with which we write the transform is directly related to whether the two vocabularies have a similar structure. For instance, there is much that is parallel between Open XML and XHTML. There is a body element. The body contains paragraphs and tables. Tables contain rows, which contain cells. Tables can contain other tables in cells.
XSLT works well for these types of transformations – you write a pattern to match a node, and then supply the transformation for just that node. In the case of XSLT, you can indicate to the transform engine to ‘continue processing rules for child elements’, so that you can specify the transforms for those child elements in their own rules. If you are aware of Flat OPC, it is pretty easy to process Open XML documents using XSLT.
Some time ago, I write a post on an approach for using LINQ to XML annotations for doing this type of transform. In that post, I was proving out whether you could write document-centric transforms using LINQ to XML in a style similar to XSLT. It’s easy to read the code to specify the transform if you read LINQ code easily, but there are obvious problems with the approach, not the least of which is that annotating a tree in that fashion might have performance issues if you are working with too large of a tree.
Even though Open XML and XHTML have similar structures, there are places where the structures are not parallel, and in those cases, you still must jump through hoops. In XSLT, this often means generating intermediate trees to use in subsequent transforms. I’ve seen XSLT transforms where the first thing the developer did was to transform the tree to a new tree with new attributes on elements – the purpose of the attributes was to aid further transforms. If using the LINQ to XML approach that uses annotations, you must deal with the same issues – parts of the transformation are expressed as nice mappings between a pattern that matches nodes and the subsequent transform of those nodes, and parts of the transformation deals with abstractions that often must be explained in comments. It’s just more complicated to do these types of transformations.
Open XML Document-Centric Transforms
There are lots of examples of interesting common-vocabulary document-centric Open XML transforms. Removing comments and accepting revisions are two, but there are many others.
Because I know the size of documents that I potentially need to process (>2 million nodes), I rejected the annotations approach for simple transforms of Open XML documents. For performance reasons, it just wouldn’t fly.
I also rejected using XSLT – I really don’t want to step out into another language. XSLT is an attractive approach if you already have an XSLT transform written, or if you are particularly fluent in XSLT. You must deal with converting the OPC (Zip) file to the Flat OPC format, but this is easy. But when I’m writing little examples that show how to do something interesting in Open XML, XSLT isn’t appropriate.
So, for instance, for the code to accept tracked revisions, I opted for the tree-modification approach. This isn’t idea from a functional programming purist’s point of view, but it performs well in the real world. You have to be careful when coding, but no big deal.
Recursive Approach to Transforms
Lately, I’ve been writing more of these types of transforms in a recursive style. The gist of this technique is that you write a recursive function to clone a tree, and while cloning, you trim nodes, or transform nodes, or whatever.
This approach has good performance, and it is appealing in that when you are writing a more complicated recursive transform, you can write it in terms of other simpler recursive transforms. The code should be written with no side effects, and if so, transforms are easy to write and debug.
This approach has a draw-back. It’s somewhat harder to intuitively see the mapping between the pattern that matches a node, and the transform for that pattern. However, we don’t lose this entirely. For example, we may want to write a recursive transform of an XHTML document to Open XML. Here is the XHTML document:
<html>
<head></head>
<body>
<p>abc<b>def</b>ghi</p>
</body>
</html>
We want to transform it to this document:
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" >
<w:body>
<w:p>
<w:r>
<w:t>abc</w:t>
</w:r>
<w:r>
<w:rPr>
<w:b/>
</w:rPr>
<w:t>def</w:t>
</w:r>
<w:r>
<w:t>ghi</w:t>
</w:r>
</w:p>
</w:body>
</w:document>
We can write the recursive transform like this:
using System;
using System.Linq;
using System.Xml.Linq;
class Program
{
static object XHtmlToOpenXml(XNode node)
{
XNamespace w = "http://schemas.openxmlformats.org/wordprocessingml/2006/main";
XElement element = node as XElement;
if (element != null)
{
if (element.Name == "html")
return new XElement(w + "document",
new XAttribute(XNamespace.Xmlns + "w", w.NamespaceName),
new XElement(w + "head", ""),
element.Elements().Select(e => XHtmlToOpenXml(e)));
if (element.Name == "body")
return new XElement(w + "body",
element.Elements().Select(e => XHtmlToOpenXml(e)));
if (element.Name == "p")
return new XElement(w + "p",
element.Nodes().Select(n => XHtmlToOpenXml(n)));
if (element.Name == "b")
return new XElement(w + "r",
new XElement(w + "rPr",
new XElement(w + "b")),
new XElement(w + "t",
element.Value));
}
XText t = node as XText;
if (t != null)
return new XElement(w + "r",
new XElement(w + "t", t.Value));
// ignore all other nodes
return null;
}
static void Main(string[] args)
{
XElement root = XElement.Parse(
@"<html>
<head></head>
<body>
<p>abc<b>def</b>ghi</p>
</body>
</html>");
Console.WriteLine(XHtmlToOpenXml(root));
}
}
In the above transform, the code highlighted in yellow serves the same purpose as the XPath pattern to match in an XSLT template. The code highlighted in green is the “sequence constructor”. The expressions element.Nodes().Select(n => XHtmlToOpenXml(n)) and element.Elements().Select(e => XHtmlToOpenXml(e)) serves the same purpose as the xsl:apply-templates element in an XSLT template.
Key to understanding this transform is that selectively, we can cause every node/element to be passed to this method. But we can also trim descendant nodes if we like, sending only a subset back through this method.
I initially started talking about this approach in a post that described manually cloning XML trees. The code is short and easy to understand.
I used this approach for code to normalize an XML tree. It performs well. Of the approaches that I could have taken for coding the sample, it was by far the easiest.
I also used this approach for the code to split runs in paragraphs. Again, it was the easiest way for me to write the code.
This certainly isn’t the last word. This is what has been on my mind lately, so wanted to blog it before I forgot about it.
I’m fascinated by XML document transformation, primarily because of the power it gives me. The ability to spin out an Open XML document in a couple hundred lines of code opens up a lot of interesting scenarios. Generating documents server-side in SharePoint or a web application allows us to use documents to make it easier for people to communicate. Document-centric transforms are key in these scenarios.
Peter Galli, Senior Product Manager at Microsoft, has posted that Microsoft will be applying the Community Promise to the ECMA 334 and ECMA 335 specs. In his post, he says, "It is important to note that, under the Community Promise, anyone can freely implement these specifications with their technology, code, and solutions." Very cool.
Last January, I blogged about an approach to normalizing LINQ to XML trees. That post is based on another post, Manually Cloning LINQ to XML Trees. In those posts, my code to clone an element would clone a self-closing element (<Tag/>) as self-closing, and an empty element with a start and end tag (<Tag></Tag>) as an element with start and end tag.
But, in fact, this was not necessary – empty elements can be always serialized as self-closing elements – the XML specification states, “The representation of an empty element is either a start-tag immediately followed by an end-tag, or an empty-element tag."
Further, per the specification, “the empty-element tag SHOULD be used, and SHOULD only be used, for elements which are declared EMPTY”. This means that it’s always safe to serialize an empty element as a self-closing element, but sometimes it’s not correct to serialize an empty element with a start and end tag.
Originally, the code to clone an element looked like this:
static XElement CloneElement(XElement element)
{
return new XElement(element.Name,
element.Attributes(),
element.Nodes().Select(n =>
{
XElement e = n as XElement;
if (e != null)
return CloneElement(e);
return n;
}),
(!element.IsEmpty && !element.Nodes().OfType<XText>().Any()) ? "" : null
);
}
I’ve revised both of the above referenced posts to remove the code to exactly serialize empty elements as they were in the source document. The new code looks like this:
static XElement CloneElement(XElement element)
{
return new XElement(element.Name,
element.Attributes(),
element.Nodes().Select(n =>
{
XElement e = n as XElement;
if (e != null)
return CloneElement(e);
return n;
})
);
}
static void Main(string[] args)
{
XElement root = XElement.Parse("<Root></Root>");
Console.WriteLine("Original tree");
Console.WriteLine(root);
Console.WriteLine();
Console.WriteLine("Cloned tree");
XElement rootClone = CloneElement(root);
Console.WriteLine(rootClone);
}
The code is simpler and more correct. When you run this example, it produces:
Original tree
<Root></Root>
Cloned tree
<Root />
Thanks to Sean Hederman for pointing this out.
[BlogMap]
Sometimes we want to compare two word processing documents to see if they contain the same content. I’m working on a blog post to merge comments from multiple Open XML documents into a single document. This is based on a feature in Word 2007 that allows you to lock a document and prevent changes to content, yet allows users to add comments to the document. However, we don’t want to attempt to merge comments if the documents don’t contain the same content.
One Approach to Comparing Two Documents for Equivalency
If two documents contain exactly the same content, they will have the same number of paragraphs, tables, content controls, and custom XML markup, and more, and these elements will occur in the same order, and have the same content. However, two paragraphs may contain the same content yet their XML representation may be very different if one has a comment and the other does not – the paragraph with the comment may have its runs split differently. I’ve written a previous post that examines run splitting in detail, and contains a method to report where the run splits are, and a method that splits runs based on a list of split locations.
The following markup shows a very simple paragraph. We can see the paragraph element, the run element, and the text element.
<w:p>
<w:r>
<w:t>abcdefghi</w:t>
</w:r>
</w:p>
If we select “def” in the above text, and add a comment, the markup changes to look like this:
<w:p>
<w:r>
<w:t>abc</w:t>
</w:r>
<w:commentRangeStart w:id="0"/>
<w:r>
<w:t>def</w:t>
</w:r>
<w:commentRangeEnd w:id="0"/>
<w:r>
<w:rPr>
<w:rStyle w:val="CommentReference"/>
</w:rPr>
<w:commentReference w:id="0"/>
</w:r>
<w:r>
<w:t>ghi</w:t>
</w:r>
</w:p>
We can write a query that returns a collection of a very specific subset of the elements in the XML document. This is the subset of elements that won’t change if the contents of the document don’t change. This query consists of all elements in the document except:
-
w:commentRangeStart and w:commentRangeEnd – these elements will be added when the user adds comments to a document. Most commonly, these elements occur under the paragraph element (to be trimmed, see below), but its valid for these elements to be children of the body elements, so we should trim them.
-
w:proofErr – this element is added automatically by Word when there are spelling or grammar errors, and has no effect on content. Word can (and will) add this element even though the document is locked for editing with the exception of being able to add comments. Therefore, we want to trim this element from the collection returned by a query that we’re going to use to determine document equivalency.
-
Finally, we want to eliminate all of the descendants of paragraphs from the query, as these elements can change quite a bit even if the contents of the document don’t change. Instead, we want to write a bit of code to determine whether two paragraphs are equivalent.
Here is the query that returns a collection of the elements that we’re interested in:
XDocument xDoc1 = doc1.MainDocumentPart.GetXDocument();
var doc1Elements = xDoc1
.Descendants()
.Where(e => e.Name != W.commentRangeStart &&
e.Name != W.commentRangeEnd &&
e.Name != W.proofErr &&
!e.Ancestors(W.p).Any());
We can query two word processing documents, and if the elements in the returned collection are not in the exact same order, then the documents are different. And if corresponding paragraphs contain the same content, per whatever algorithm that we define, then we can say that the documents contain the same content. In the example that I present in this post, I validate paragraph equivalency by checking actual textual content, disregarding formatting changes for runs within the paragraph. In my case, this is good enough, as the transformation that I wrote (and will present in an upcoming post) that moves comments from one document to another will work properly if the paragraphs have the same text.
The above query works just fine for documents that contain tables, content controls, and custom XML. The markup for tables, content controls, and custom XML contains paragraphs, and the paragraphs will be in the same order and have the same content if the documents are equivalent.
We could change this query easily enough to define document equality in just about any way we want. If we want to disregard bookmarks, it’s easy enough to remove them from the results of the query.
The above query is not the most efficient way to do this – more efficient would be to write an iterator that goes through the Descendants axis and trims appropriately. But queries show intent in a better way, and in my informal testing, the above query performs well enough as is for many scenarios.
Now that we’ve defined the query that will return the elements that won’t change if the document content doesn’t change, we can define another query that determines if two queries, evaluated on two Open XML documents, contain the same items, and in the same order. This is a job for the Zip extension method, coming with C# 4.0.
The Zip Extension Method
The Zip extension method processes two sequences, matching up each item in one sequence with a corresponding item in another sequence. While this method won’t be part of the framework until C# 4.0, a simple implementation that we can use with C# 3.0 is trivial:
public static IEnumerable<TResult> Zip<TFirst, TSecond, TResult>(
this IEnumerable<TFirst> first,
IEnumerable<TSecond> second,
Func<TFirst, TSecond, TResult> func)
{
var ie1 = first.GetEnumerator();
var ie2 = second.GetEnumerator();
while (ie1.MoveNext() && ie2.MoveNext())
yield return func(ie1.Current, ie2.Current);
}
Note: Bart De Smet has a great explanation of the Zip extension method, as well as this example of the implementation of it. That post also has a good explanation of how iterators work, using IL to explain them. With regards to iterators, it’s also useful to read the section 8.14 in the C# 3.0 specification.
Using the Zip Extension Method
If we have one sequence that contains names, and another sequence that contains ages, and we know that the two sequences contain corresponding elements, we can project a new collection of anonymous objects:
string[] names = new[] { "Jim", "Bob", "Susan" };
int[] ages = new[] { 50, 35, 41 };
var q = names.Zip(ages, (name, age) => new
{
Name = name,
Age = age
});
foreach (var item in q)
Console.WriteLine(item);
When you run this example, you see:
{ Name = Jim, Age = 50 }
{ Name = Bob, Age = 35 }
{ Name = Susan, Age = 41 }
Notice that for the projection, we write a lambda expression that takes two arguments – each pair of corresponding items from the two source collections is passed as arguments to the lambda expression.
The following query uses the Zip extension method to project a collection of Booleans indicating if the element or paragraph is equivalent:
IEnumerable<bool> correspondingElementEquivalency = doc1Elements.Zip(doc2Elements, (e1, e2) =>
{
if (e1.Name != e2.Name)
return false;
// determine if two paragraphs contain the same content
if (e1.Name == W.p && (GetParagraphText(e1) != GetParagraphText(e2)))
return false;
return true;
});
GetParagraph text is defined as:
// return the text of a paragraph with revisions accepted
public static string GetParagraphText(XElement p)
{
return p.Descendants(W.r)
.Where(e => e.Parent.Name != W.del && e.Parent.Name != W.moveFrom)
.Descendants(W.t)
.Select(t => (string)t)
.StringConcatenate();
}
So then, we can use the Any extension method to determine if the documents are equivalent:
return ! correspondingElementEquivalency.Any(e => e != true);
This will be pretty efficient, as it uses lazy evaluation, and the Any extension method will terminate processing as soon as the code determines that the documents are different.
One Final Note
This code doesn’t process math markup – if the two documents contain a math formula, and one of the documents is commented, then this query will report that the documents differ. The structure and approach to take are exactly parallel to the approach that I take with comments in regular paragraphs. Extending this to math markup is another post.
The Code
Following is an example that compares two documents to determine if they have the same content. The code is quite short. Note this uses the Open XML SDK.
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using DocumentFormat.OpenXml.Packaging;
public static class Extensions
{
public static XDocument GetXDocument(this OpenXmlPart part)
{
XDocument xdoc = part.Annotation<XDocument>();
if (xdoc != null)
return xdoc;
using (StreamReader streamReader = new StreamReader(part.GetStream()))
xdoc = XDocument.Load(XmlReader.Create(streamReader));
part.AddAnnotation(xdoc);
return xdoc;
}
public static string StringConcatenate(this IEnumerable<string> source)
{
StringBuilder sb = new StringBuilder();
foreach (string s in source)
sb.Append(s);
return sb.ToString();
}
public static IEnumerable<TResult> Zip<TFirst, TSecond, TResult>(
this IEnumerable<TFirst> first,
IEnumerable<TSecond> second,
Func<TFirst, TSecond, TResult> func)
{
var ie1 = first.GetEnumerator();
var ie2 = second.GetEnumerator();
while (ie1.MoveNext() && ie2.MoveNext())
yield return func(ie1.Current, ie2.Current);
}
}
public static class W
{
public static XNamespace w =
"http://schemas.openxmlformats.org/wordprocessingml/2006/main";
public static XName p = w + "p";
public static XName r = w + "r";
public static XName t = w + "t";
public static XName commentRangeStart = w + "commentRangeStart";
public static XName commentRangeEnd = w + "commentRangeEnd";
public static XName proofErr = w + "proofErr";
public static XName del = w + "del";
public static XName moveFrom = w + "moveFrom";
}
class Program
{
// return the text of a paragraph with revisions accepted
public static string GetParagraphText(XElement p)
{
return p.Descendants(W.r)
.Where(e => e.Parent.Name != W.del && e.Parent.Name != W.moveFrom)
.Descendants(W.t)
.Select(t => (string)t)
.StringConcatenate();
}
// returns true if the documents contain the same content, otherwise false
private static bool CompareDocuments(WordprocessingDocument doc1,
WordprocessingDocument doc2)
{
XDocument xDoc1 = doc1.MainDocumentPart.GetXDocument();
XDocument xDoc2 = doc2.MainDocumentPart.GetXDocument();
var doc1Elements = xDoc1
.Descendants()
.Where(e => e.Name != W.commentRangeStart &&
e.Name != W.commentRangeEnd &&
e.Name != W.proofErr &&
!e.Ancestors(W.p).Any());
var doc2Elements = xDoc2
.Descendants()
.Where(e => e.Name != W.commentRangeStart &&
e.Name != W.commentRangeEnd &&
e.Name != W.proofErr &&
!e.Ancestors(W.p).Any());
IEnumerable<bool> correspondingElementEquivalency = doc1Elements
.Zip(doc2Elements, (e1, e2) =>
{
if (e1.Name != e2.Name)
return false;
// determine if two paragraphs contain the same content
if (e1.Name == W.p && (GetParagraphText(e1) != GetParagraphText(e2)))
return false;
return true;
});
return ! correspondingElementEquivalency.Any(e => e != true);
}
static void Main(string[] args)
{
using (WordprocessingDocument doc1 = WordprocessingDocument.Open("Test3a.docx", false))
using (WordprocessingDocument doc2 = WordprocessingDocument.Open("Test3b.docx", false))
{
bool same = CompareDocuments(doc1, doc2);
Console.WriteLine(same);
}
}
}
The interoperability team here at Microsoft has posted about a C# SourceForge open source project that converts from binary documents to Open XML. The blog post indicates that the code works with Mono, so it provides some level of portability across operating systems. The blog post also has a good explanation about the architecture of the project. There’s more good information on the SourceForge site. The Developer’s Corner on the SourceForge site has a number of good links to Open XML resources – the binary file format specs, and Open XML specs, and the Implementation Notes site.
What caught my eye are the well-written papers that contain detailed information about some of the issues involved in conversion. Of particular note is the guide that provides a very nice explanation of Freeform Shapes in the Office Drawing Format.
Also:
How to Retrieve Text from a Binary .doc File
A Guide to Table Formatting
The Storage of Macros and OLE Objects
Frank Rice published an article on MSDN: Programmatically Update Multiple External Data Connections in Excel 2007 by Using Open XML
Hadley Pettigrew published an article on OpenXmlDeveloper.org: Use Ruby on Rails to modify an Open XML Document
Wouter van Vugt blogged on copying a chart from a spreadsheet to a presentation.
After posting on splitting runs in word processing documents, I realized that I had neglected to account for content controls and custom XML in the transform. I’ve updated that post with the corrected code. I also included an explanation of the markup that requires a recursive approach.
[Blog Map]
(July 1, 2009 - Updated TransformRun to be recursive)
In Open XML Word processing document markup, paragraphs contain runs, and runs contain text elements. Sometimes when transforming a document, we may want to split runs differently than in the original document. This post presents a couple of small functions that help us deal with paragraphs and runs – determine the split locations of runs, and to split runs.
Word 2007 has a neat feature where you can lock a document and disallow editing of the content; yet allow the user to add comments. You can send this document for review to a number of users, and after the reviewers return the documents, it would be handy to have some code that merges comments from all documents into a single document. I’m currently working on a blog post that shows how to do this. However, adding a comment to a paragraph can cause runs to be split, which adds a bit of complexity.
Paragraphs, Runs, and Text Elements
The following markup shows a very simple paragraph. We can see the paragraph element, the run element, and the text element.
<w:p>
<w:r>
<w:t>abcdefghi</w:t>
</w:r>
</w:p>
If we select “def” in the above text, and add a comment, the markup changes to look like this:
<w:p>
<w:r>
<w:t>abc</w:t>
</w:r>
<w:commentRangeStart w:id="0"/>
<w:r>
<w:t>def</w:t>
</w:r>
<w:commentRangeEnd w:id="0"/>
<w:r>
<w:rPr>
<w:rStyle w:val="CommentReference"/>
</w:rPr>
<w:commentReference w:id="0"/>
</w:r>
<w:r>
<w:t>ghi</w:t>
</w:r>
</w:p>
In this paragraph, we can see the commentRangeStart and commentRangeEnd elements. In addition, we can see a special run that contains information on the styling of the text that is commented. This special run contains a commentReference element.
If we want to programmatically insert a comment into a document, we need to split runs as appropriate so that we can insert commentRangeStart, commentRangeEnd, and the special run that contains commentReference into the paragraph.
Note that a paragraph can be split into runs for a variety of reasons, and that there are a number of other valid child elements of the paragraph element. For example, because the above text isn’t a correctly spelled word, and isn’t a sentence with proper grammar, the markup can include w:proofErr elements:
<w:p>
<w:proofErr w:type="spellStart"/>
<w:proofErr w:type="gramStart"/>
<w:r>
<w:t>abc</w:t>
</w:r>
<w:commentRangeStart w:id="0"/>
<w:r>
<w:t>def</w:t>
</w:r>
<w:commentRangeEnd w:id="0"/>
<w:proofErr w:type="gramEnd"/>
<w:r>
<w:rPr>
<w:rStyle w:val="CommentReference"/>
</w:rPr>
<w:commentReference w:id="0"/>
</w:r>
<w:r>
<w:t>ghi</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
</w:p>
When splitting runs, we want to honor those existing run splits, and make sure that we don’t disturb those other elements.
As Open XML developers know, content controls and custom XML markup are very powerful features of Open XML. They enable a vast number of scenarios – we can make our documents smarter. However, they add an interesting twist to markup. The element for content controls is w:sdt, which contains another element, w:sdtContent, which contains the contents. This means that runs that we potentially want to split occur at different levels of the XML hierarchy:
<w:p>
<w:r>
<w:t>123</w:t>
</w:r>
<w:sdt>
<w:sdtContent>
<w:r>
<w:t>4567</w:t>
</w:r>
</w:sdtContent>
</w:sdt>
<w:r>
<w:t>890</w:t>
</w:r>
</w:p>
Custom XML markup has the same issue. The following schema defines some custom XML markup:
<?xml version="1.0" encoding="utf-8"?>
<xs:schema attributeFormDefault="unqualified"
elementFormDefault="qualified"
xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="Root">
<xs:complexType>
<xs:sequence>
<xs:element name="Child"
type="xs:string" />
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
When we use this custom schema to add structure to a document, it looks like this:
The markup looks like this:
<w:p>
<w:customXml w:uri="http://northwind.com"
w:element="Root">
<w:r>
<w:t>12</w:t>
</w:r>
<w:customXml w:uri="http://northwind.com"
w:element="Child">
<w:r>
<w:t>34</w:t>
</w:r>
</w:customXml>
<w:r>
<w:t>56</w:t>
</w:r>
</w:customXml>
<w:r>
<w:t>7890</w:t>
</w:r>
</w:p>
We may need to split runs at any level - as a child of the paragraph, as content in a content control, or within custom XML markup. We need to use a recursive transform to do the transform, which then handles this issue nicely.
Determining Run Split Locations
The first piece of functionality that we need is a method to return an array of integers indicating where run splits are. If we are moving comments from one document to another, then we want to find out where the run splits are in the source document so that we can create the same run splits in the destination document.
Here’s the prototype of simple method to do so:
static int[] RunSplitLocations(XElement paragraph)
The following paragraph markup contains three runs:
<w:p>
<w:r>
<w:t>abc</w:t>
</w:r>
<w:commentRangeStart w:id="0"/>
<w:r>
<w:t>def</w:t>
</w:r>
<w:commentRangeEnd w:id="0"/>
<w:r>
<w:rPr>
<w:rStyle w:val="CommentReference"/>
</w:rPr>
<w:commentReference w:id="0"/>
</w:r>
<w:r>
<w:t>ghi</w:t>
</w:r>
</w:p>
If we call RunSplitLocations for this paragraph, it returns an array that contains:
0
3
6
Splitting Runs
If we have another document that contains no comments in this paragraph, and we want to split runs so that we can insert a comment on the middle three characters, we can call another method that takes an array of integers to do the splitting:
public static XElement SplitRunsInParagraph(XElement p, int[] positions)
If we have a paragraph with this markup:
<w:p>
<w:r>
<w:t>abcdefghi</w:t>
</w:r>
</w:p>
And we call SplitRunsInParagraph passing an array that contains 0, 3, and 6, it returns a paragraph that looks like this:
<w:p xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
<w:r>
<w:t>abc</w:t>
</w:r>
<w:r>
<w:t>def</w:t>
</w:r>
<w:r>
<w:t>ghi</w:t>
</w:r>
</w:p>
As I previously mentioned, the paragraph may contain child elements other than runs. SplitRunsInParagraph will leave those other elements in place. Also, a run can contain styling information, which we also want to leave in place.
Now that we have some methods to determine where run splits are, and to create run splits, it will be pretty simple to write a pure functional transform to move comments from one document to another (if the documents contain the exact same content, with the exception of comments).
The Code
The following example contains RunSplitLocations and SplitRunsInParagraph. This code uses a node cloning technique similar to what I presented in this post. In addition, the code uses the pre-atomization approach that I showed in this post. This code implements a pure functional transformation - no side effects anywhere, which will make it easy to use when writing the next transformation.
Here’s the code (also attached):
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using DocumentFormat.OpenXml.Packaging;
public static class Extensions
{
public static XDocument GetXDocument(this OpenXmlPart part)
{
XDocument xdoc = part.Annotation<XDocument>();
if (xdoc != null)
return xdoc;
using (StreamReader streamReader = new StreamReader(part.GetStream()))
xdoc = XDocument.Load(XmlReader.Create(streamReader));
part.AddAnnotation(xdoc);
return xdoc;
}
public static string StringConcatenate(this IEnumerable<string> source)
{
StringBuilder sb = new StringBuilder();
foreach (string s in source)
sb.Append(s);
return sb.ToString();
}
}
public static class W
{
public static XNamespace w =
"http://schemas.openxmlformats.org/wordprocessingml/2006/main";
public static XName t = w + "t";
public static XName r = w + "r";
public static XName del = w + "del";
public static XName body = w + "body";
public static XName p = w + "p";
public static XName moveFrom = w + "moveFrom";
}
class Program
{
static int GetRunLength(XElement e)
{
return e
.Descendants(W.t)
.Select(t => (string)t)
.StringConcatenate()
.Length;
}
// return the run split locations for all runs in the paragraph
static int[] RunSplitLocations(XElement paragraph)
{
// find the runs that don't have w:del or w:moveFrom as parent elements
var runElements = paragraph
.Descendants(W.r)
.Where(e => e.Parent.Name != W.del && e.Parent.Name != W.moveFrom &&
e.Descendants(W.t).Any());
// determine the run length of each run
var runs = runElements
.Select(r => new
{
RunElement = r,
RunLength = GetRunLength(r)
});
// determine the split locations
var runSplits = runs
.Select(r => runs
.TakeWhile(a => a.RunElement != r.RunElement)
.Select(z => z.RunLength)
.Sum());
return runSplits.ToArray();
}
// if value starts or ends with a space, return xml:space="preserve" attribute
// else return null
static XAttribute XmlSpacePreserved(string value)
{
if (value.Substring(0, 1) == " " || value.Substring(value.Length - 1) == " ")
return new XAttribute(XNamespace.Xml + "space", "preserve");
else
return null;
}
private class RunSplits
{
public XElement RunElement { get; set; }
public int RunLength { get; set; }
public int RunLocation { get; set; }
}
private static object RunTransform(XElement element,
int[] positions, IEnumerable<RunSplits> runSplits)
{
// split runs that have child text elements
if (element.Name == W.r && element.Descendants(W.t).Any())
{
// get text of run
string text = element
.Descendants(W.t)
.Select(t => (string)t).StringConcatenate();
// find run in runSplits
RunSplits rs = runSplits.First(r => r.RunElement == element);
// find list of splits in this run
var splitsInThisRun = positions
.Where(p => p >= rs.RunLocation && p < rs.RunLocation + rs.RunLength);
// adjust splits so that split locations are relative to this run instead of
// relative to the beginning of the paragraph
var splitsIntext = splitsInThisRun
.Select(p => p - rs.RunLocation)
.ToArray();
// project collection of strings that will be in the new, split runs
var splitText = splitsIntext
.Select((p, i) =>
i != splitsIntext.Length - 1 ?
text.Substring(p, splitsIntext[i + 1] - p) :
text.Substring(p)
);
// project collection of runs that will replace the original run
return splitText.Select(r =>
new XElement(W.r,
rs.RunElement.Elements().Where(e => e.Name != W.t),
new XElement(W.t,
XmlSpacePreserved(r),
r)));
}
// clone elements other than runs
// must be recursive to handle custom XML markup and content controls
return new XElement(element.Name,
element.Attributes(),
element.Nodes().Select(n =>
{
XElement e = n as XElement;
if (e != null)
return RunTransform(e, positions, runSplits);
return n;
})
);
}
public static XElement SplitRunsInParagraph(XElement p, int[] positions)
{
// find the runs that don't have w:del or w:moveFrom as parent elements
var runElements = p
.Descendants(W.r)
.Where(e => e.Parent.Name != W.del && e.Parent.Name != W.moveFrom &&
e.Descendants(W.t).Any());
// calculate the run length of each run
var runs = runElements
.Select(r => new
{
RunElement = r,
RunLength = GetRunLength(r)
});
// calculate the location of each split
var runSplits = runs
.Select(r => new RunSplits
{
RunElement = r.RunElement,
RunLength = r.RunLength,
RunLocation = runs
.TakeWhile(a => a.RunElement != r.RunElement)
.Select(z => z.RunLength)
.Sum()
});
// the positions argument contains a list of locations where splits will be added
// to the paragraph. In addition, runs may already be split at various places, and
// we want those splits to remain, so we need to create the complete list of
// locations where we want run splits.
// create ordered union of desired splits and existing splits
int[] allSplits = runSplits
.Select(rs => rs.RunLocation)
.Concat(positions)
.OrderBy(s => s)
.Distinct()
.ToArray();
// transform the paragraph to a new paragraph with new splits in runs
return new XElement(W.p,
p.Elements().Select(e => RunTransform(e, allSplits, runSplits))
);
}
static void Main(string[] args)
{
using (WordprocessingDocument doc1 =
WordprocessingDocument.Open("Test.docx", true))
{
XDocument doc = doc1.MainDocumentPart.GetXDocument();
XElement p = doc.Root.Element(W.body).Element(W.p);
//XElement newPara = SplitRunsInParagraph(p, new[] { 12, 15 });
XElement newPara = SplitRunsInParagraph(p, new[] { 10 });
Console.WriteLine(newPara);
}
}
}
[Blog Map]
(Update June 25, 2009 - fixed bugs in event handlers associated with deleting last node and inserting node at beginning of list)
Occasionally I need to query LINQ to XML nodes in reverse document order. I’m currently writing some LINQ to XML queries over Open XML documents where I need to select paragraph nodes based on content in the immediately preceding paragraph. However, nodes in LINQ to XML are forward-linked only. We can see evidence of this in the XNode.NodesBeforeSelf and the XElement.ElementsBeforeSelf methods - these methods return collections of nodes in document order, not reverse document order. This was by design – LINQ to XML was designed to provide great performance for the vast majority of scenarios with the minimum memory footprint possible. The need to process nodes in reverse document order is rare, so the designers of LINQ to XML decided that it was more important to reduce memory footprint than to allow for good performance in the few scenarios that require processing in reverse document order, and of course it was a good decision. But the need does exist.
In my scenario (a functional transform that processes Open XML document revisions), it is possible that I would need to process 80,000 (or more) paragraphs. If we use the XNode.PreviousNode property, we won’t have acceptable performance. There is an easy work-around that provides us the ability to query in reverse document order in a way that performs well.
-
We define a new class, PreviousNodeAnnotation , that contains one public field, public XNode PreviousNode;.
-
We add instances of this class as annotations on nodes that we need to query in reverse document order.
In the following small example, I select nodes based on previous node value for a document size of 50,000. The slow version exhibits performance of O(n2). I limited the sample size in the example to 50,000 nodes. When I increased the doc size to 80,000 nodes (the size of one of my documents that I need to query), the execution time of the slow version exceeded my patience. In any case, it is clear that I can’t use XNode.PreviousNode for my scenario.
using System;
using System.Linq;
using System.Xml.Linq;
class PreviousNodeAnnotation
{
public XNode PreviousNode;
public PreviousNodeAnnotation(XNode prev) { PreviousNode = prev; }
}
class Program
{
static int DocumentSize = 50000;
static void SlowPreviousNodeAccess()
{
// create a tree with lots of nodes
XElement root = new XElement("Root",
Enumerable.Range(0, DocumentSize).Select(i => new XElement("Child", i)));
// query for all elements where the previous element has a value of 1000
DateTime start = DateTime.Now;
var q = root
.Elements()
.Where(e =>
{
XElement p = e.PreviousNode as XElement;
return (string)p == "1000";
});
var q2 = q.ToList(); // force iteration
TimeSpan duration = DateTime.Now - start;
Console.WriteLine(duration);
}
static void FastPreviousNodeAccess()
{
// create a tree with lots of nodes
XElement root = new XElement("Root",
Enumerable.Range(0, DocumentSize).Select(i => new XElement("Child", i)));
// initialize previous node annotations
XElement prev = null;
foreach (var item in root.Elements())
{
item.AddAnnotation(new PreviousNodeAnnotation(prev));
prev = item;
}
// query for all elements where the previous element has a value of 1000
DateTime start = DateTime.Now;
var q = root
.Elements()
.Where(e =>
{
XElement p = e
.Annotation<PreviousNodeAnnotation>()
.PreviousNode as XElement;
return (string)p == "1000";
});
var q2 = q.ToList(); // force iteration
TimeSpan duration = DateTime.Now - start;
Console.WriteLine(duration);
}
static void Main(string[] args)
{
FastPreviousNodeAccess();
SlowPreviousNodeAccess();
}
}
On my old slow laptop, the execution time of these two queries is .015 seconds and 30 seconds, respectively.
We can expand on this technique a bit by declaring three extension methods:
public static XNode PreviousNodeFast(this XNode node)
public static IEnumerable<XNode> NodesBeforeSelfFast(this XNode node)
public static IEnumerable<XElement> ElementsBeforeSelfFast(this XElement element)
It’s convenient to use these methods in queries.
In addition, while I’m partial to pure transforms with no side effects, we can declare two event handlers that keep the previous node annotations in sync when adding or deleting nodes. Here’s an example that includes these extension methods and event handlers:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml.Linq;
class PreviousNodeAnnotation
{
public XNode PreviousNode;
public PreviousNodeAnnotation(XNode prev) { PreviousNode = prev; }
}
public static class Extensions
{
public static XNode PreviousNodeFast(this XNode node)
{
return node.Annotation<PreviousNodeAnnotation>().PreviousNode;
}
public static IEnumerable<XNode> NodesBeforeSelfFast(this XNode node)
{
XNode currentNode = node;
while (true)
{
XNode prevNode = currentNode.PreviousNodeFast();
if (prevNode == null)
yield break;
else
yield return prevNode;
currentNode = prevNode;
}
}
public static IEnumerable<XElement> ElementsBeforeSelfFast(this XElement element)
{
return NodesBeforeSelfFast(element).OfType<XElement>();
}
}
class Program
{
static void ValidatePreviousNodes(XElement element)
{
XNode prev = null;
foreach (XNode node in element.Nodes())
{
if (node.PreviousNodeFast() != prev)
{
Console.WriteLine("ERROR: previous nodes are invalid");
Environment.Exit(0);
}
prev = node;
}
Console.WriteLine("Validated");
}
static void Main(string[] args)
{
// create a tree with lots of nodes
XElement root = new XElement("Root",
Enumerable.Range(0, 100).Select(i => new XElement("Child", i)));
// setup previous nodes after tree creation
XElement prev = null;
foreach (var item in root.Elements())
{
item.AddAnnotation(new PreviousNodeAnnotation(prev));
prev = item;
}
// add event handlers to take care of adding / deleting nodes
root.Changed += new EventHandler<XObjectChangeEventArgs>((o, e) =>
{
if (e.ObjectChange == XObjectChange.Add)
{
Console.WriteLine("Add");
XNode node = o as XNode;
// o could be an XAttribute, in which case it's not applicable
if (node != null)
{
node.AddAnnotation(new PreviousNodeAnnotation(node.PreviousNode));
if (node.NextNode != null)
{
node.NextNode.RemoveAnnotations<PreviousNodeAnnotation>();
node.NextNode.AddAnnotation(new PreviousNodeAnnotation(node));
}
}
}
});
root.Changing += new EventHandler<XObjectChangeEventArgs>((o, e) =>
{
if (e.ObjectChange == XObjectChange.Remove)
{
Console.WriteLine("Remove");
XNode node = o as XNode;
// o could be an XAttribute, in which case it's not applicable
if (node != null)
{
if (node.NextNode != null)
{
node.NextNode.RemoveAnnotations<PreviousNodeAnnotation>();
node.NextNode
.AddAnnotation(new PreviousNodeAnnotation(node.PreviousNode));
}
}
}
});
ValidatePreviousNodes(root);
root.Elements().ElementAt(3).AddAfterSelf(
new XElement("NewChild", 999)
);
ValidatePreviousNodes(root);
root.Nodes().ElementAt(2).Remove();
ValidatePreviousNodes(root);
root.Nodes().ElementAt(3).NodesBeforeSelfFast().Remove();
ValidatePreviousNodes(root);
ValidatePreviousNodes(root);
root.AddFirst(
new XElement("ANode", 1)
);
ValidatePreviousNodes(root);
root.Add(
new XElement("ANode", 2)
);
ValidatePreviousNodes(root);
root.Nodes().Last().NodesBeforeSelfFast().Remove();
ValidatePreviousNodes(root);
root.Add(
new XElement("ANode", 2)
);
ValidatePreviousNodes(root);
root.Add(
new XElement("ANode", 2)
);
root.Add(
new XElement("ANode", 2)
);
root.Add(
new XElement("ANode", 2)
);
ValidatePreviousNodes(root);
root.Nodes().First().Remove();
ValidatePreviousNodes(root);
root.Add(
new XElement("ANode", 2)
);
ValidatePreviousNodes(root);
root.Nodes().Last().Remove();
root.Add(Enumerable.Range(0, 100).Select(i => new XElement("Child", i)));
XElement last = root.Elements().ElementAt(50);
foreach (var item in last.NodesBeforeSelfFast())
{
Console.WriteLine(item);
}
}
}
I would personally only define these extension methods in the module where I need good performance of reverse document order queries. It would be messy to have these extension methods in scope for modules that don't set up the annotations.
Code is attached.
Darcy Thomas has written two interesting Open XML articles on OpenXmlDeveloper.org:
Converting a Facebook stream into an Open XML spreadsheet
This sample is an example of an app which pulls down a stream and puts it into a nicely formatted Open XML spread sheet using PHPExcel. The spreadsheet reorganizes posts into a time log with rows showing the person’s photo, their name (with a link back to their profile), the actual message from the post and the time and date of the post.
Accessing the C# code from PowerTools for Open XML in a .NET application
This is an interesting article that shows how you can take advantage of the C# code that sits behind the PowerTools for Open XML. He recreates in C# the PowerShell application that Lawrence Hodson wrote for this article. The primary purpose of the PowerTools for Open XML isn’t to provide a C# library for developers to use. Its main purpose is to provide example code and guidance for the types of things you commonly want to do with Open XML. That said, this is an interesting article that shows how to look around inside the PowerTools source code, and to use the DocumentBuilder class (from C#) to generate a nicely formatted word processing document.
Gray Knowlton (Group Product Manager for Office) has posted news that Office Developer Conference (ODC) will not take place this year. Instead, the ODC content will be included within the SharePoint Conference.
From his post:
As you may have seen at PDC, TechEd or elsewhere, Office 2010 is on its way. To help you get ready, Office 2010 for Developers will be highlighted at the upcoming SharePoint Conference (October 2009, Las Vegas, NV) and TechEd conferences around the world in 2009 and 2010.
NET: Office Developer Conference will not take place this year; instead we are including the Office Developer Conference content within the SharePoint Conference. If you are an attendee of Office Developer Conference in the past, we strongly recommend you come see us at the SharePoint Conference in October, where we’ll cover Office client development in depth. Be sure to sign up for the Technical Preview as well!
We are optimizing our show presence for developers seeking opportunities to build on the Office platform, which includes Office client applications, SharePoint, Exchange and Communicator. By adding the ODC track to the 2009 SharePoint conference, we can provide better exposure to those seeking to develop solutions across the platform.
Open XML Developer
There are some great new articles:
Introduction to the Open XML SDK 2.0
XSLT transforming XML to Open XML using Java
OPC Team Blog
The Open Packaging Conventions (OPC) team here at Microsoft has started a new blog. They’ve started off a series of posts with Adventures in Packaging Episode 1. I see more and more opportunities to take advantage of OPC, and it’s clear that other folks do too.
LoBand and PDA Views of MSDN Content
If you haven’t seen this before, the Lo Band and PDA views of MSDN content ROCK! I spend a fair amount of time on the bus. I use the PDA view on my phone to fill in gaps in my knowledge on the .NET framework. Check out the LINQ to XML docs on your phone: http://msdn.microsoft.com/en-us/library/bb387098(pda).aspx J The Library Experience team (LEX) has a great blog post that details the new views of MSDN content.
MSDN Code Search
The MSDN Code Search Preview lets you search for code in the MSDN Library, MSDN Code Gallery, and CodePlex. This is a much-needed addition to developer’s toolkits.
Doug Mahugh and a bunch of the standards crew (both in and out of Microsoft) have been having a great discussion on document format interoperability. They (and referenced posts) are worth reading.
Michael Kiselman and his crew have been publishing some great case studies on the use of Open XML. These cases studies are hard evidence about the uptake of the format. I really love seeing people putting the document formats to good use. Folks are building innovative solutions that really wouldn’t be possible without using Open XML.
-
Microsoft has a group (The Microsoft IT Business Intelligence Center of Excellence Core Scorecard Team) who delivers key information to more than 3000 people worldwide, sometimes in the form of automatically generated macro-enabled spreadsheets. They wanted to deliver these through Excel Services in SharePoint, but SharePoint doesn’t allow web delivery of macro enabled spreadsheets. Well, they did something cool – before delivering from SharePoint to the user, they add (using Open XML) digitally signed VBA code that contains the necessary macros. This is an interesting read.
-
Darwin Information Typing Architecture (DITA) is an XML based solution for topic-based authoring. Content Technologies has built DITA Exchange, which uses some of the cool features of Open XML to allow people to author DITA XML in Word. They translate DITA XML to Open XML and back again.
-
Savo Group provides an on-demand collaborative Sales Enablement solution called SAVO. Customers wanted documents delivered in Open XML. Using the Open XML SDK, SAVO generates customized documents for download to end users.
Look for these among the studies that you find here.
Often XML schemas allow for optional elements and attributes. When you write queries on these elements or attributes, you may be tempted to write code that does lots of testing for null. There is a better way to do this, laid out in this post. I covered this idiom in a previous post, but the main purpose of that post wasn’t to explain this idiom. I’m speaking on using LINQ with Open XML tomorrow at TechEd 2009, and need a better example.
The following XML document is a simplified variation of markup that you can find in Open XML word processing documents:
<document>
<body>
<p>
<r>
<t>Text of first para.</t>
</r>
</p>
<p>
<pPr>
<pStyle val="Heading1"/>
</pPr>
<r>
<t>Text of second para.</t>
</r>
</p>
</body>
</document>
The first paragraph doesn’t have a <pPr> element, whereas the second does. This is allowable in Open XML word processing documents. The first paragraph has the default style.
Our task is to write a query that returns the style name for each paragraph, but if the paragraph has no style name, then the paragraph has the default style. The code projects a collection of an anonymous type that contains the style name and the text. If the paragraph has the default style, the StyleName is set to null.
The approach where the code tests for null looks like this:
using System;
using System.Linq;
using System.Xml.Linq;
class Program
{
static string GetStyleName(XElement p)
{
XElement pPr = p.Element("pPr");
if (pPr != null)
{
XElement pStyle = pPr.Element("pStyle");
if (pStyle != null)
return (string)pStyle.Attribute("val");
}
return null;
}
static void Main(string[] args)
{
XElement root = XElement.Parse(
@"<document>
<body>
<p>
<t>Text of first para.</t>
</p>
<p>
<pPr>
<pStyle val='Heading1'/>
</pPr>
<t>Text of second para.</t>
</p>
</body>
</document>");
var paragraphs = root
.Element("body")
.Elements("p")
.Select(p => new
{
StyleName = GetStyleName(p),
Text = (string)p.Element("t")
});
foreach (var item in paragraphs)
Console.WriteLine(item);
}
}
This works just fine, and yields the expected results:
{ StyleName = , Text = Text of first para. }
{ StyleName = Heading1, Text = Text of second para. }
Beyond making the code harder to read, this approach introduces two additional points of possible failure. If I had neglected to write the code to test for null, my code would throw an exception.
There is another way to write this query, which is to use the Elements and Attributes extension methods that operate on IEnumerable<XElement>.
using System;
using System.Linq;
using System.Xml.Linq;
class Program
{
static void Main(string[] args)
{
XElement root = XElement.Parse(
@"<document>
<body>
<p>
<t>Text of first para.</t>
</p>
<p>
<pPr>
<pStyle val='Heading1'/>
</pPr>
<t>Text of second para.</t>
</p>
</body>
</document>");
var paragraphs = root
.Element("body")
.Elements("p")
.Select(p => new
{
StyleName = (string)p.Elements("pPr").Elements("pStyle")
.Attributes("val").FirstOrDefault(),
Text = (string)p.Element("t")
});
foreach (var item in paragraphs)
Console.WriteLine(item);
}
}
This also yields the same results, and doesn’t contain the two points of possible failure.
Here’s how this code works. In the snippet below, the highlighted code evaluates to a collection of XElement objects. Notice that I used the Elements method, not the Element method, even though I know that there could only be zero or one <pPr> elements. The highlighted code returns a collection of either zero or one items.
StyleName = (string)p.Elements("pPr").Elements("pStyle")
.Attributes("val").FirstOrDefault(),
The Elements extension method yields all child elements with the given name for each and every element in the source collection. In the snippet below, the highlighted code will return either one XElement object (if there was a <pPr> element), or an empty collection, if there wasn’t a <pPr> element:
StyleName = (string)p.Elements("pPr").Elements("pStyle")
.Attributes("val").FirstOrDefault(),
Next, the code ‘dots’ into the Attributes extension method. Again, the Attributes extension method is happy to take a collection of elements as source. If an empty collection is passed to the Attributes extension method, it also returns an empty collection:
StyleName = (string)p.Elements("pPr").Elements("pStyle")
.Attributes("val").FirstOrDefault(),
The FirstOrDefault extension method either returns the first element in a collection, or it returns the default value for the type of items in the collection. The default value for all reference types (which XAttribute and XElement are) is null. In this case, FirstOrDefault will either return the one “val” attribute, or it will return null.
Finally we cast this one value (either null or an XAttribute) to string. String is a nullable type, of course, and the explicit conversion operator (the cast operator for XAttribute or XElement) is happy to take null, and return null. If there is no <pPr> element, StyleName will be set to null. If there is a pPr element, a pStyle element, and a val attribute, then StyleName will be set to the value of the attribute.
This is where the other nullable CLR types (int?, bool?, double?, etc.) come in handy. If we want to get the value of an optional element or attribute, and we know that the value is an integer or double, or whatever, instead of casting to string, we can cast to any of the other nullable types. The same semantics apply.
Interestingly, this is also efficient. Here’s why:
-
The LINQ to XML axes use deferred execution.
-
FirstOrDefault starts the process of materialization, requesting the first item in the collection from the Attributes extension method.
-
It, in turn, requests the first item from the Elements(“pStyle”) call.
-
It, in turn, requests the first element from the XElement.Elements method, which finally yields up an XElement (or returns an empty collection if there isn’t one).
-
This one element is yielded up to the Elements(“pStyle”) extension method.
-
This one element is yielded up to the Attributes(“val”) extension method.
-
The one attribute is yielded up to the FirstOrDefault extension method, which due to its semantics, “short circuits” the query, and never requests another item from its source.
The net result of this is that this idiom is efficient, reduces points of possible failure, and is shorter. Once I started using this idiom regularly, it became quite natural. Once one of the LINQ architects described this idiom, I started using it, and mostly don't write code in the other style. So I’m curious – question for all you LINQ users out there, do you use this idiom?