Surprisingly, I haven't seen much information out there discussing the performance impacts of XML tag name lengths (ie using "<c>" instead of "<table-cell>"). My last post about some of the design goals behind SpreadsheetML raised some questions from folks about where the time is actually spent when loading an XML file. There are a ton of things that we do to improve the performance when opening and saving Open XML files. The move to using these formats as the default for Office 2007 meant we had to get really serious about how the formats were constructed so they could open efficiently. I'd be really interested to hear from other people who've worked on XML formats if they've had similar experiences.
For a lot of people who have worked with XML, that parsing of tags isn't really more than a percent to two of the overall load and save times. With office document formats, that's not always the case. Just to give you a bit of an idea about the scale these documents can get to, check out the article by George Ou about performance of spreadsheet files: http://blogs.techrepublic.com.com/Ou/?p=120
In that article, the Spreadsheet he uses is pretty big, but we have definitely seen much larger and more complex spreadsheets, so don't assume that it's a fringe case. If you save from the article using the new Open XML format, you get the following:
So, as you can see, that's a lot of XML to parse over. As we looked at files like this, we saw that we absolutely needed to find different ways to optimize the formats to make them faster. Using shorter tag names was one of the first obvious ones.
In the profiles that we've looked at over the years, we've seen that simply using shorter tag names can significantly improve the performance depending on the type of file. For those of you really interested, you should do your own profiles and let me know what you find out. Remember that for an application like Excel, we're talking about the potential for millions of XML tags to represent a rich spreadsheet. Let's look at a couple issues now:
I know that for a lot of people who've played around with XML, the parsing isn't really something that you would think of as being a major part of the file load times. This is not the case with office document formats, and especially spreadsheet documents.
With the latest SpreadsheetML design, we've seen that the XML parsing alone (not including our parsing numbers, refs, formulas) can often range from 10-40% of the entire file load. That's just the time it takes to read each tag and each attribute. This shouldn't be too surprising though, as the internal memory structures for a spreadsheet application should be fairly similar to the shapes that are used in the format design. A big piece is just reading the XML in and interpreting what the tags are.
SpreadsheetML was designed so that for any tag or attribute that would appear frequently, we used super short tag names. We also established naming conventions for the abbreviations shared across all three formats (so that they become easier to interpret as you work with them). Elements that may only appear once in a file often have longer tag names, since their size doesn't have nearly the same impact. Right now, most of our frequently used tag names are no more than a couple characters in length. Let's imagine instead we decided to use longer more descriptive names so each tag was around 5 times larger (you can use the older SpreadsheetML or the OpenDocument format for examples of longer tag names):
Short tag example:
<row><c><v>1</v></c><c><v>2</v></c><c><v>3</v></c></row><row><c><v>4</v></c><c><v>5</v></c><c><v>6</v></c></row>
Long tag example:
<table:table-row table:style-name="ro1"><table:table-cell office:value-type="float" office:value="1"><text:p>1</text:p></table:table-cell><table:table-cell office:value-type="float" office:value="2"><text:p>2</text:p></table:table-cell><table:table-cell office:value-type="float" office:value="3"> <text:p>3</text:p></table:table-cell></table:table-row><table:table-row table:style-name="ro1"><table:table-cell office:value-type="float" office:value="4"><text:p>4</text:p></table:table-cell><table:table-cell office:value-type="float" office:value="5"><text:p>5</text:p></table:table-cell> <table:table-cell office:value-type="float" office:value="6"><text:p>6</text:p></table:table-cell></table:table-row>
For that example, the top one is using SpreadsheetML from the Ecma Office Open XML format. The second example is using the OpenDocument format. There is another optimization that SpreadsheetML does where you can optionally write out the column and row information on cells, but I removed that since it's actually another performance optimization that I'd like to discuss in a separate future post (and as I said it's optional).
Let's imagine we have that file I mentioned earlier with 7 million elements and 10 million attributes. If on average each attribute and element is about 2 characters long, then you have 34 megabytes of data to parse (which is a ton), just in tag names and element names. If instead though, the average length of an attribute and element were more like 10 characters, then your talking about 170 megabytes. That is a very significant difference.
This isn't rocket science of course. Most folks I've talked to agree that it's important to keep tag names short, especially in structures that are highly repetitive. In SpreadsheetML, you'll see that a lot of the element names actually are pretty long and descriptive, but only if they appear in a few places, and won't be much of a burden. Any element that can have a high frequency of occurrence is definitely kept to a minimum length.
Remember, we're not talking about creating a format for hobbyists. This format is supposed to be used by everyone, and most of those folks aren't going to be happy with feature loss and performance degradation just so they can save out as XML (the average user doesn't care about XML). The original SpreadsheetML from Office XP was actually more like a hobbyist format, and as a result, it was really easy to develop against, but it was bloated and slow. I wish that we didn't have to worry so much about performance, but if you really expect these formats to be used by everyone, then you have to take the training wheels off. That's why the standardization in Ecma is so important though, so that we can ensure that everything is fully documented and all the information is there to allow you to develop against them.
I'll talk more about the other parts of the design that are optimized around performance in future posts. This was one that people had some questions on though so I just wanted to clarify and make sure there wasn't any more confusion. If you've looked at similar issues in your file format designs and found other interesting things like this, I'd love to hear about them!
-Brian