I've blogged in the past about some of the things we did with the SpreadsheetML format to help improve the performance of opening and saving those files. While a file format most likely won't affect the performance of an application once the file is loaded, it can significantly impact the performance of the actual load or save. The majority of a wordprocessing file is usually the actual text content and anything you can do to avoid making the storage of that content complex the better. The majority of a presentation is usually the rich media and styling information and it's important to store those in an efficient manner. The biggest challenge though hands down in terms of file format performance though is a spreadsheet.
It's very easy to get Spreadsheets comprised of millions of cells. In older versions of Excel, it used to be that the maximum number of columns allowed in a worksheet was 256 and the maximum number of rows was 64,000. That meant that a single worksheet in a workbook could consist of over 16 million cells. An of course there can me many worksheets in a workbook.
There were a number of customers who were hitting this limit and wanted us to increase the number of rows and columns allowed in a worksheet (this was actually one of Excel's top requests). This had been a request for awhile, but it was going to be extremely difficult to increase this limit in the old binary formats so we didn't do it. The move to the new Open XML formats though allowed us to increase the limits, so now in Excel 2007, you can have up to 16,000 columns and 1 million rows. That means you could have a worksheet with 16 billion cells. I can't imaging anyone hitting that limit, but to be honest, I couldn't believe folks were hitting the older limits.
So, while that's a look at the extreme end of things, it's not too uncommon to have very large spreadsheets, and a workbook with a million cells of data is definitely something we see a lot of.
Now, while in a Wordprocessing document you may potentially have a million words (which would be about 2000 pages), it's not as likely. More importantly though, the structure of the XML doesn't have as big of an impact. In wordprocessingML, you don't have every word contained in it's own XML element. So you while may have a million words, your XML element count would probably only be in the tens of thousands.
In a spreadsheet though, you have each cell represented by at least one, if not more XML elements which means that a spreadsheet of any reasonable size can easily have millions of XML elements and attributes. This can have significant implications on the load and save performance of the files.
There are a number of things we looked into doing to help improve the performance of these files both using current technology as well as thinking about future approaches. Some of the approaches we took that are easiest to understand are:
The simplest way to see the impact for yourself is to just take an application like OpenOffice, and get a decent sized spreadsheet (I took a file with just one worksheet, that was 15 columns by 15,000 rows). Save it in ODF and make 2 copies. In one of the copies open content.xml and change some of the namespace prefixes to make them much longer. I aimed to make it so that the resulting content.xml file was about 3 times larger than the original since this helps replicate the size difference you would see if we changed SpreadsheetML so that it's tag length was longer and more descriptive like ODF (it would probably be more than that, but I found that making content.xml 3 times larger was more than enough to get a noticable slowdown in load times). I used namespace prefix approach because it's the easiest way to replicate the effect of a larger element name without changing the actual element names (since that would prevent the file from being readable by OpenOffice). In my test, by essentially tripling the size of the element names, I found that the file went from loading in about 9 seconds to taking about 11 seconds.
There are clearly a large number of factors that play into the performance of a particular format. As you can see, it primarily comes into affect when loading or saving a file. If you do any type of delay loading though or incremental saves, it can also start to have an impact on your editing performance though. The key issues I've focused on primarily affect the general load and save times. The examples above are just a handful of issues we looked at when designing the formats.