Use of GetElementsByTagName considered harmful

As you may already be aware, there is an article on msdn about the great performance improvements we made in the V2 xml stack. This is a pretty big event for the team, since increasing performance was one of the big goals of this release.

While I posted this to give a shout-out to the team for the great perf work in they did for Whidbey, I want to take a minute to point out another interesting part of the article. Up at the top of this article you will find a this paragraph:

2. We also found that Sun is using the GetElementsByTagName method in C#, and this results in a mismatch of the C# and Java versions. Although both harnesses use XmlElement.GetElemensByTagName(), the C# implementation of this method keeps track of a live list of nodes which is negatively impacted by some of the DOM test scenarios that have edits. A better method for element selection in the C# harness is XmlElement.SelectNodes() which matches the Java harness in functionality.

When I first read this, it caused me some concern. Were we somehow ‘cheating’ by changing the nature of the test? Was there something wrong with GetElementsByTagName that I (and the various xml developers out there) should be aware of? I did a little digging, and wanted to share the results.

The W3C Level1 DOM spec has this to say about the results of GetElementsByTagName (I underlined the last sentence for emphasis):

The DOM also specifies a NodeList interface to handle ordered lists of Nodes, such as the children of a Node, or the elements returned by the Element.getElementsByTagName method, and also a NamedNodeMap interface to handle unordered sets of nodes referenced by their name attribute, such as the attributes of an Element. NodeLists and NamedNodeMaps in the DOM are "live", that is, changes to the underlying document structure are reflected in all relevant NodeLists and NamedNodeMaps.

 

In other words, GetElementsByTagName is, according to the spec, supposed to return a ‘live list’ where changes to the underlying DOM are reflected in the returned NodeList. The MS implementation of this function conforms to the spec, but the Sun/Java implementation does not.

Unfortunately, this conformance comes at a cost. In this case, the cost is a non-trivial hit to the speed and working set of code which uses GetElementsByTagName. Let me explain why.

 

In order to keep the NodeLists returned from GetElementsByTagName live, each list needs to ‘register’ somehow with the related element so that it can be notified when the children of that element change. When we first implemented this function, we thought about adding a series of events to XmlElement. NodeLists could register for these events and react appropriately. Unfortunately, this turns out to bloat the size of a DOM tree, and the performance costs caused by the related cache misses were unacceptable. Because of this we moved to a model where only XmlDocument exposed ‘OnChange’ events. When an item is added into the DOM, the XmlDocument.NodeInsterted event fires, each NodeList can determine if the new element should be added to its list or not, and then take the appropriate action.

 

This solution has the advantage that it adds no extra overhead to scenarios which do not call GetElementsByTagName, and the overhead is minimal even for scenarios which do use GetElementsByTagName, as long as those scenarios do not change the DOM (in this case the only overhead is the cost of registering for the event and a small increase in allocated memory for the created delegate).

 

The real problem shows up when you combine calls to GetElementsByTagName and edits to the DOM. Imagine that you had a moderate sized DOM (a few hundred elements) and for each element you called GetElementsByTagName, which results in a few hundred NodeLists registering for the XmlDocument.NodeInsterted. Now, after walking through the tree, you add a few elements. Each time you add an element, the NodeInserted event fires, and hundreds of delegate functions will get called. As you can imagine, this can cause quite a hit to the performance of your application.

 

The bottom line? If you are using GetElementsByTagName, especially in scenarios where you edit the DOM, you should investigate using SelectNodes instead. SelectNodes takes an XPath expression, but all you would need to do to get back the same elements as what is called by GetElementsByTagName would be to pass in “\\ElementName” [edit: that should read "//ElementName] to SelectNodes where you would have passed “ElementName” in to GetElementsByTagName. As always, you should measure how this changes your perf before switching over, especially in cases where you are not editing the DOM, since SelectNodes does a little more work to support the full syntax, but don’t be afraid to try. Even if your perf is equivalent in both cases, you may want to look at switching over, since we are very focused on improving XPath performance, so you may find your perf improve over time along with our XPath implementation (note: as always Microsoft’s plans are always subject to change, while we currently are focused on XPath performance, some future change may alter this in the future).