Please Recycle...

(My post today is on a topic some of you may already be familiar with - reusing NameTable across Xml messages.  If so, please feel free to skip this entry.  I posted it because, while I have seen other press on this issue, I thought it was important enough that I wanted to get more information out there.)

For anyone who has not met him yet, I would like to introduce you to an old friend, XmlNameTable. XmlNameTable is an interface used by XmlReader and XmlDocument when they want to store atomized names. For example, imagine that you parsed the following document:

<?xml version="1.0" encoding="utf-8" ?><stable> <horse>Mr. Ed</horse> <horse>Seabicuit</horse> <horse>Man O' War</horse> <!-- Imagine 10,000 more horses here --> <horse>Quick Draw McGraw</horse></stable>

Then you use the following code:

System.Xml.XmlReader reader = XmlReader.Create(…);List<string> elementNames=new List<string>();reader.MoveToContent();while(reader.Read()){ if(reader.NodeType==XmlNodeType.Element) elementName.Add(reader.Name);}

How many copies of the string "horse" do you expect to find contained in elementNames?

The answer is that the list will contain 10,000 or so references to only one string. This is because each time the parser comes upon a name, it checks to see if that name already exists in its XmlNameTable, and if so it uses that copy of the name.  This not only decreases memory pressure, but it also allows you to do reference comparisons for names.

If you look at the XmlReaderSettings class, you will see that it contains a NameTable property that takes an implementation of XmlNameTable.  This property exists so that you can pass in your own XmlNameTable when you create a new reader through the XmlReader.Create(…) API.  Doing this allows you to pass in your own implementation of XmlNameTable, which is mildly useful in some odd scenarios.  More importantly, this also allows you to re-use a name table across multiple documents, which can be an important performance optimization, particularly when parsing many small documents that have similar names.

For example, let's say that you had some xml that represented a purchase order:

<?xml version="1.0" encoding="utf-8" ?><purchaseOrder poNumber="111"> <customer name="Mr. Ed"> <lineItems> <lineItem itemType="horseshoe" quantity="4" /> <lineItem itemType="saddle" quantity="1" /> </lineItems> </customer></purchaseOrder>

If  you have to process many of these purchase orders but use different name tables each time you process the message, you will continually have the overhead of allocating the names 'purchaseOrder', 'customer, name', 'lineItems', 'lineItem', 'itemType' and 'quantity' over and over.  You will also have to allocate new entries in the underlying collection class used by the XmlNameTable.  If instead you reuse your NameTable across readers, you can see quite a bit of savings. 

In order to understand these differences, I ran a test.  I created two xml documents.  The first document was similar to the 'stable' sample above, where there were very few unique names, and some of the names ('horse' in this case) were repeated over and over.  The second document was similar to the 'purchaseOrder' sample, where just about every name in the document was unique, although my document was somewhat larger then the sample shown above). 

I parsed each of these documents 100,000 times.  In half of these cases I reused the same name table.  In the other half I recreated the name table each time I parsed the document.

Here are my results:

Repeating Names (stable)

Unique Names

Re-use name table

2.596 seconds

1.308 seconds

Don’t re-use name table

2.696 seconds

1.771 seconds

As you can see, the performance improvement looks roughly like this:

  • Xml with many repeating names (and very few unique ones): 3.7% improvement.
  • Xml with many unique names (and very few repeating ones): 26.1% improvement.

This makes sense.  If your document consists of the same names repeating over and over again, then you will not spend much of your time parsing names over and over.  If, on the other hand, you have a small document, or one with many unique names, then you will spend a proportionately higher percentage of your time in name management.  Make sure you measure your case, since your mileage will vary depending on the shape of your xml.

There are two things you need to be careful of when re-using your name table:

  • Watch out for multithreading: The default implementation is not thread-safe.  If you are parsing on multiple threads using the same name table, you will either need to roll your own, thread-safe implementation or use one name table per thread (perhaps storing it in thread-local storage).
  • Watch out for name-bloat: If you are parsing untrusted data, a malicious hacker can send you a large number of different names, each of which will be added to the name table.  Over time this can starve your application of memory.  Even without hackers, you may see the size of your name table grow if your parse many xml messages with different names.  For these reasons, you may want to occasionally clear the table out.