It's been awhile since I've talked in detail about the SpreadsheetML schema and I apologize. I had a number of posts back in the summer which talked through Office XP's SpreadsheetML format that we built about 6 years ago, but obviously a lot has changed since then.

The new SpreadsheetML that is part of the Open XML formats coming with Office 2007 had to undergo serious work in order to make it ready to be the default format. As you all know, the majority of folks don't really care about what kind of format they are using, they just want it to work (remember that most end users have never even heard of XML). We wanted our formats to play a more vital role in business processes though, which is why we've slowly been progressing towards these new default XML formats. We want people to be able to easily build solutions on top of the formats, but at the same time, we don't want the average end user to feel much noticeable difference with the change (at least no negative differences).

That leads me to why we had to restructure SpreadsheetML from the original design. The two issues with the SpreadsheetML format from 6 years ago was that it wasn't full fidelity, and it wasn't optimized for performance/file size. The term "Full fidelity" just means that everything that is in your file can be saved into the format without fear of it being modified or lost. The old SpreadsheetML format didn't support a number of feature like images, charts, objects, etc. So we had to add all those additional things to the format.

The second part (performance) was a really important and challenging one. We wanted to move to an open format so that people could build solutions around our formats. Like many other applications out there, we chose a combination of ZIP and XML to achieve this. We had to write the XML though in such a way that it could be parsed extremely efficiently so that the file open and save experience wouldn't get significantly slower. There have been a number of articles related to this issue, where people have complained about performance in other applications that use XML as their format. Of course we had to keep this in mind with our design, and for those of you who have played around with it I'm sure you've noticed the difference.

While I'm not going to go into a full description on the SpreadsheetML format, I'd at least like to give you a brief introduction. A SpreadsheetML package has a few different pieces that it's comprised of. Let's lok at a basic diagram of the pieces of a spreadsheet:

The main parts I wanted to call out for today are:

  1. "sheet1" – This is the data for the worksheet. Each worksheet is stored as its own XML file within the ZIP package which means you can easily get at your data within a particular sheet without having to parse all the other sheets.
  2. "sharedStrings" – Any string (not number, just string) used in the sheet is actually stored in a separate location. There is a part called the "Shared string table" that stores all the strings used in the files. So, if you have a column called "states", and "Washington" appears 100 times in the spreadsheet, it will only need to be saved into the file once, and then just referenced.

I think an example might be best to help show what I'm talking about. Let's take a spreadsheet that looks like this:

ID

Num

Resource

1

543

F068BP106B.DWG

2

248

F068BP106B.DWG

 

In the Open XML file, there would be an XML file that contained the strings used, that would look like this:

Shared String Table

<sst xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/5/main">
  <si>
    <t>ID</t>
  </si>
  <si>
    <t>Num</t>
  </si>
  <si>
    <t>Resource</t>
  </si>
  <si>
    <t>F068BP106B.DWG</t>
  </si>
</sst>

Then, in the main sheet, there would be cell values, and pointers into the string table wherever a string occurs:

Sheet1

<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/5/main">
 
<sheetData>
   
<row>
     
<c t="s">
       
<v>0</v> 
     
</c> 
     
<c t="s"> 
       
<v>1</v> 
     
</c> 
     
<c t="s"> 
       
<v>2</v> 
     
</c> 
    
</row> 
    
<row> 
     
<c> 
       
<v>1</v> 
     
</c> 
     
<c> 
        <v>0</v> 
      </c> 
      <c t="s"> 
        <v>3</v> 
      </c>
    </row> 
    <row> 
      <c> 
        <v>2</v> 
      </c> 
      <c> 
        <v>0</v> 
      </c> 
      <c t="s"> 
        <v>3</v> 
      </c>
    </row>
  </sheetData>
</worksheet>

Notice that in the first row, each cell has the attribute t="s" which means it's a string. Then, the value is interpreted as in index into the string table, rather than an actual number value. In the 2nd and 3rd rows, the first two cells are interpreted as numbers, so they don't have the t="s" attribute, and the values are actual values.

This may seem a bit complex, but remember that while this format was designed for developers to be able to use, it we couldn't take the hit that comes with making it completely intuitive. Believe me, as a developer, I would have loved to make the formats more verbose and straight forward, but that would have meant that everyone else opening the files would have to suffer for it. If the example above was a more complex set of data with a number of separate worksheets, each with a few thousand rows, you can imagine how quickly the savings of the string table and terse tag names would add up. I had a couple posts back in the summer talking about some other basic things we do to make sure that the formats are quick and efficient.

This tradeoff of who you design around and how you way ease of use versus efficiency is something folks have to look at every day when they design products. Whether it's an API, a user interface, or a file format, you need to decide which target user you are going to give more weight to when you make your design decisions. We had to give more weight to the end user, and instead require a bit more knowledge from the developer. That's why the Ecma documentation is so important. We need to make sure that the format is documented 100% and there are no barrier to interoperability. The great group of people we have on TC45 are really helping a lot here. As I said last week, the Novell guys have already built some working code that allows Gnumeric to open and save Spreadsheet files in the Open XML format. I'm sure we'll see more and more implementations as we provide better documentation and get closer to a complete standard. It's really exciting! That's one of the great things we'll see more and more of up on the openxmldeveloper.org site.

-Brian