Out of the Angle Brackets
In certain scenarios, it is important to be able to compare two XML trees for equivalence. For example, if you are writing a web service that serves results of queries, and you want to cache query results so that duplicate queries use previously cached results instead of always accessing the underlying database. However, the senders of those queries may potentially be using a variety of tools to generate the queries, and these tools may introduce trivial differences into the XML. The intent of the queries may be identical, but XNode.DeepEquals returns false if you compare the XML that contains semantically equivalent, but trivially different queries.
This post describes an approach and presents code for normalizing LINQ to XML trees. After normalizing XML trees, you can call XNode.DeepEquals with a greater chance that semantically equivalent XML trees will evaluate as equivalent. The approach presented in this post makes use of the assistance that XSD can provide when normalizing XML trees. If an XML document has been validated using XSD, and the XML tree has been annotated with the Post Schema Validation Infoset (PSVI), then we can use that PSVI to help normalize the tree. Much of this post is based on the OASIS specification Schema Centric XML Canonicalization Version 1.0 (C14N), which describes a variety of ways to normalize an XML tree, including through the use of PSVI.
Let’s look at a few common cases where XNode.DeepEquals reports that equivalent XML trees are unequal. In the following example, the two trees have exactly the same content. The only difference is that one of them uses a default namespace, and the other uses the same namespace using a namespace prefix. XNode.DeepEquals will report that the two trees are different.
XElement root1 = XElement.Parse(
@"<Root xmlns='http://www.northwind.com'>
<Child>1</Child>
</Root>");
XElement root2 = XElement.Parse(
@"<n:Root xmlns:n='http://www.northwind.com'>
<n:Child>1</n:Child>
</n:Root>");
if (XNode.DeepEquals(root1, root2))
Console.WriteLine("Equal");
else
Console.WriteLine("Not Equal");
Here’s another case where XNode.DeepEquals returns false:
@"<Root a='1' b='2'>
@"<Root b='2' a='1'>
These two are equivalent, but the attributes are not ordered. However, because you can’t have two attributes in an element that have the same qualified name, we can sort the attributes by namespace and name, and then compare.
The situation gets more interesting when we consider normalizing a document that has been validated using XSD. Consider the following simple schema:
<xsd:schema xmlns:xsd='http://www.w3.org/2001/XMLSchema'>
<xsd:element name='Root'
type='xsd:double'/>
</xsd:schema>
Let’s say that we are comparing two small XML documents:
Document 1:
<Root>25</Root>
Document 2:
<Root>+25</Root>
These two documents have essentially the same content, but the ‘+’ before the 25 in the second document will cause the two documents to compare as not equals. However, if we have first validated the documents and annotated the tree with a PSVI using the XDocument.Validate extension method, we know that the data type of the element is double, and can normalize the value of the element. Once normalized, the two documents will compare as equals.
Going further, a schema can declare default attributes and elements. If an XML document in one case has an attribute with the default value, and in another case, the XML document is missing the default attribute, the attribute can be added when validating, and the two trees will compare as equivalent.
For illustration, consider this schema and two documents:
<xsd:element name='Root'>
<xsd:complexType>
<xsd:simpleContent>
<xsd:extension base='xsd:string'>
<xsd:attribute name='ADefaultBooleanAttribute'
default='false'/>
</xsd:extension>
</xsd:simpleContent>
</xsd:complexType>
</xsd:element>
<Root/>
<Root ADefaultBooleanAttribute='false'/>
The above two documents will compare as equivalent after validation and adding PSVI.
In the remainder of this post, I’ll present a list of areas where it would be possible to normalize a tree before comparison. Then, I’ll discuss a simple implementation of normalization using LINQ to XML.
Normalization Issues
The following is a list of issues that can be addressed when normalizing an XML tree:
One important point about normalization is that you can define application-specific normalization rules. It would be easy to modify the normalization code presented in this post to, say, look for a particular named complex type in the PSVI, and normalize that element per specific rules, perhaps converting attribute or element values to upper or lower case, or some other bespoke normalization.
LINQ to XML Implementation of Normalization
The approach that I took when normalizing is to generate a new, normalized XML tree (instead of modifying the existing tree to normalize it.) This has the advantage that we can write this code in the pure functional style. The resulting code will be small and easy to maintain. In the example code that I present in this post, the pure functional transform to generate a new normalized XML tree is about 190 lines of code. For more information on this style of cloning, see the blog post “Manually Cloning LINQ to XML Trees”.
Instead of optimizing for processing speed, I optimized for compactness and readability of code. This is probably a good trade-off. This code will be pretty fast in any case, and other processes in our solution, such as database access, can be multiple orders of magnitude slower, so it probably doesn’t matter if there are ways to write the code so that it will execute slightly faster.
XML Trees Must Validate before Normalization
The first and most important point is that this code relies on the XML validating successfully before normalizing. The code allows you to specify no schema, but in this case, it only does the normalization that doesn’t require PSVI. But if you specify a schema, and the code doesn’t throw an exception, then the document(s) were valid per the schema.
White Space
The default behavior of LINQ to XML is to discard insignificant white space when populating an XML tree. No text nodes for insignificant white space are materialized. This makes it easy for us – the code presented here assumes that insignificant white space is not in the XML tree.
Namespace Prefix Normalization
Because LINQ to XML has a simple model for namespaces, normalization is easy. The full namespace is included in every XName object. There can be XAttribute objects that contain namespace declarations, however, these “attributes” only have an effect upon serialization. For the purposes of normalization, we can simply remove all XAttribute objects that are namespace declarations. We can determine whether an attribute is a namespace declaration using the XAttribute.IsNamespaceDeclaration property.
Adding Default Elements and Attributes
One of the overloads of the Validate extension methods for XDocument allows us to pass a Boolean parameter that indicates that the XML tree be populated with the PSVI. As part of this population of the PSVI, default elements and attributes will be added to the tree. They will, then, be cloned in the new tree.
Normalizing Values of Elements and Attributes of Certain Data Types
For five data types (xsd:boolean, xsd:decimal, xsd:double, xsd:dateTime, and xsd:float), we can take advantage of one particular aspect of LINQ to XML semantics: LINQ to XML is lax when validating values that are passed to constructors, and is strict when serializing values. For instance, if we know that an attribute is type xsd:double, we can create a new normalized attribute like this:
return new XAttribute(a.Name, (double)a);
And in a similar way, we can create a normalized xsd:double element like this:
return new XElement(element.Name,
NormalizeAttributes(element, havePSVI),
(double)element);
Elements and attributes of type xsd:hexBinary and xsd:language are case insensitive. We can normalize by converting them to lower case.
Removing Comments, Processing Instructions, and the Attributes xsi:schemaLocation and xsi:noNamespaceSchemaLocation
This is straightforward – the code trims these while cloning.
Ordering Attributes by Name
While cloning the tree, it is straightforward to order attributes by name.
Normalization Issues Not Handled
Schema Centric XML Canonicalization Version 1.0 (C14N) describes a number of other possibilities for normalization that I’ve not handled:
About the Code
This post presents two methods: Normalize and DeepEqualsWithNormalization.
Normalize
This method creates and returns a new, cloned, normalized XDocument.
Because it relies on schema validation, it normalizes an XDocument, not an XElement. However, it is easy to create an XDocument from an XElement if you need to normalize an XML tree rooted in an XElement.
It is valid to pass null for the schema parameter, in which case the method will only do the normalizations that are possible without using PSVI.
Signature:
public static XDocument Normalize(XDocument source, XmlSchemaSet schema)
DeepEqualsWithNormalization
This method compares two XDocument objects after normalization.
public static bool DeepEqualsWithNormalization(XDocument doc1, XDocument doc2,
XmlSchemaSet schemaSet)
Test Cases
In the attached code, along with the public and private methods to do the normalization, etc, there are about a dozen test cases that cause full code coverage of the normalization code. The sample code runs each test case, and reports whether the new normalized trees are equivalent.
The Code
The following listing shows the code used for XML tree normalization and comparison, including an extensions class and the Normalize and DeepEqualsWithNormalization methods:
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using System.Xml.Schema;
public static class MyExtensions
{
public static string ToStringAlignAttributes(this XDocument document)
XmlWriterSettings settings = new XmlWriterSettings();
settings.Indent = true;
settings.OmitXmlDeclaration = true;
settings.NewLineOnAttributes = true;
StringBuilder stringBuilder = new StringBuilder();
using (XmlWriter xmlWriter = XmlWriter.Create(stringBuilder, settings))
document.WriteTo(xmlWriter);
return stringBuilder.ToString();
}
class Program
private static class Xsi
public static XNamespace xsi = "http://www.w3.org/2001/XMLSchema-instance";
public static XName schemaLocation = xsi + "schemaLocation";
public static XName noNamespaceSchemaLocation = xsi + "noNamespaceSchemaLocation";
bool havePSVI = false;
// validate, throw errors, add PSVI information
if (schema != null)
source.Validate(schema, null, true);
havePSVI = true;
return new XDocument(
source.Declaration,
source.Nodes().Select(n =>
// Remove comments, processing instructions, and text nodes that are
// children of XDocument. Only white space text nodes are allowed as
// children of a document, so we can remove all text nodes.
if (n is XComment || n is XProcessingInstruction || n is XText)
return null;
XElement e = n as XElement;
if (e != null)
return NormalizeElement(e, havePSVI);
return n;
)
);
XDocument d1 = Normalize(doc1, schemaSet);
XDocument d2 = Normalize(doc2, schemaSet);
return XNode.DeepEquals(d1, d2);
private static IEnumerable<XAttribute> NormalizeAttributes(XElement element,
bool havePSVI)
return element.Attributes()
.Where(a => !a.IsNamespaceDeclaration &&
a.Name != Xsi.schemaLocation &&
a.Name != Xsi.noNamespaceSchemaLocation)
.OrderBy(a => a.Name.NamespaceName)
.ThenBy(a => a.Name.LocalName)
.Select(
a =>
if (havePSVI)
var dt = a.GetSchemaInfo().SchemaType.TypeCode;
switch (dt)
case XmlTypeCode.Boolean:
return new XAttribute(a.Name, (bool)a);
case XmlTypeCode.DateTime:
return new XAttribute(a.Name, (DateTime)a);
case XmlTypeCode.Decimal:
return new XAttribute(a.Name, (decimal)a);
case XmlTypeCode.Double:
case XmlTypeCode.Float:
return new XAttribute(a.Name, (float)a);
case XmlTypeCode.HexBinary:
case XmlTypeCode.Language:
return new XAttribute(a.Name,
((string)a).ToLower());
return a;
private static XNode NormalizeNode(XNode node, bool havePSVI)
// trim comments and processing instructions from normalized tree
if (node is XComment || node is XProcessingInstruction)
XElement e = node as XElement;
// Only thing left is XCData and XText, so clone them
return node;
private static XElement NormalizeElement(XElement element, bool havePSVI)
var dt = element.GetSchemaInfo();
switch (dt.SchemaType.TypeCode)
(bool)element);
(DateTime)element);
(decimal)element);
(float)element);
((string)element).ToLower());
default:
element.Nodes().Select(n => NormalizeNode(n, havePSVI)),
(!element.IsEmpty && !element.Nodes().OfType<XText>().Any()) ?
"" : null
Eric Whitehttp://blogs.msdn.com/ericwhite/
PingBack from http://gardendecordesign.info/story.php?id=5960