Erika Ehrli - Adventures with Office Products & Technologies
MSDN & TechNet: Releasing Office, SharePoint, Exchange & Lync Centers and content for developers and IT professionals.

Open XML File Formats: What is it, and how can I get started?

Open XML File Formats: What is it, and how can I get started?

  • Comments 21

While being at Tech Ed, a lot of people were interested in finding a way to programmatically generate documents without Interop. Some of the business scenarios contemplated generating over 5,000 documents and some IT professionals were interested in finding the best option. A great option to solve this business need is: The Open XML File Formats.

Some people have been following the news and are even ahead of most of us already building solutions to generate documents using the Open XML File Formats. Some other people are not familiar with this technology and want to learn more about this, so here is a quick introduction for those of you who want to learn more about: What is it, and how you can get started. I have to warn you that this is going to be a long blog entry, but I promise it's worth the reading.

What is it?

The new formats improve file and data management, data recovery, and interoperability with line-of-business systems. They extend what is possible with the binary files of earlier versions. Any application that supports XML can access and work with data in the new file format. The application does not need to be part of the Microsoft Office system or even a Microsoft product. Users can also use standard transformations to extract or repurpose the data. In addition, security concerns are drastically reduced because the information is stored in XML, which is essentially plain text. Thus, the data can pass through corporate firewalls without hindrance.

The new Open XML File Formats take advantage of the Open Packaging Conventions, which describe the method for packaging information in a file format and describe metadata, parts, and relationships. The new Open XML Format, with a few minor exceptions, is written entirely in XML and is contained in a .zip file. This creates significant advantages over the old binary file format:

  • The file size is much smaller because of ZIP compression.
  • The file is much more robust because it is broken up into different document parts. Should one part become damaged (for example, a part describing headers), the rest of the document remains intact and still opens successfully.
  • The file is easier to work with programmatically because of the new structure. For example, it is easier to access embedded content, such as images, because they are stored in their native format inside the file.
  • Custom XML is also easier to work with because it is stored in its own part, separate from the XML that describes the bulk of a document.

The old binary file format was created when priorities in software differed from the priorities of today. Back then, the ability to transfer a Word document from computer to computer using a floppy disc ranked very high, and the tight structure of a binary format worked well. As software advanced, other priorities became clear, such as the ability to write code against a file format and make it as robust as possible. XML is a clear solution.

Microsoft began to address this issue in previous versions of Microsoft Office by introducing SpreadSheetML and WordprocessingML. However, only now, with the 2007 release of Microsoft Office, have the goals that were conceived as far back as 1999 been accomplished fully. By including the XML File Format inside a ZIP container, the benefit of a small compressed file format is also realized. Excel 2007 and PowerPoint 2007 share this new file format technology, described by the Open Packaging Conventions. Together, the shared formats are called the Microsoft Office Open XML Formats. The new Word 2007 XML Format is the default file format, although the old binary file format is still available in the 2007 Microsoft Office system.

An easy way to look inside the new file format is to save a Word 2007 document in the new default format and then rename the file with a .zip extension. By double-clicking the renamed file, you can open and look at its contents. Inside the file, you can see the document parts that make up the file, along with the relationships that describe how the parts interact with one another. However, it is important to note that, with a few exceptions defined within the Open Packaging Conventions, the actual file directory structure is arbitrary. The relationships of the files within the package, not the file structure, are what determine file validity. You can rearrange and rename the parts of an Word 2007 file inside its .zip container if you update the relationships properly so that the document parts continue to relate to one another as designed. If the relationships are accurate, the file opens without error. The initial file structure in a Word 2007 file is simply the default structure created by Word. This default structure enables developers to determine the composition of Word 2007 files easily.

Contents of a sample document in a ZIP file

How can I get started?

The easiest way to modify a Word 2007 XML file programmatically is to use the System.IO.Packaging class in the Microsoft® Windows® Software Development Kit (SDK) for Beta 2 of Windows Vista and WinFX Runtime Components. Using this technology, you can easily update header and footer files programmatically across numerous Word 2007 documents stored on a server.

We published recently some resources that might be of your interest if you are trying to learn more about the Open XML File Formats:

Open XML Snippets

  • Open XML: Get OfficeDocument Part: Given an Open XML file, retrieve the part with the http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument relationship type.

Microsoft Office Excel Snippets

  • Excel: Add Custom UI: This snippet adds a custom UI Ribbon part to a given workbook.
  • Excel: Delete Comments by a specific User: This snippet deletes all comments from a given user from a given workbook.
  • Excel: Delete Worksheet: This snippet deletes the specified worksheet from within a given workbook and resets the selected worksheet to the next one on the list. Returns true if successful, false if failure.
  • Excel: Delete Excel 4.0 Macro sheets: This snippet deletes all the Excel 4.0 Macro (XLM) sheets from a given workbook.
  • Excel: Retrieve hidden rows or columns: This snippet returns a list of hidden row numbers or column names from a given workbook and worksheet.
  • Excel: Export Chart: Given a workbook and title of a chart, this snippet exports the chart as a Chart (.crtx) file.
  • Excel: Get Cell Value: Given a workbook, worksheet and cell address, this snippet returns the value of the cell as a string.
  • Excel: Get Comments as XML: Given a workbook, this snippet returns all the comments as an XmlDocument.
  • Excel: Get Hidden Worksheets: This snippet returns a list containing the name and type of all hidden sheets in a given workbook.
  • Excel: Get Worksheet Information: This snippet returns a list containing the name and type of all sheets in a given workbook.
  • Excel: Get Cell for Reading: Given a workbook, worksheet and cell address, this snippet demonstrates how to navigate to the cell to retrieve its contents. The cell must exist for the function to find it.
  • Excel: Get Cell for Writing: Given a workbook, worksheet and cell address, this snippet demonstrates how to navigate to the cell to set its value. If the cell does not exist, the snippet creates it.
  • Excel: Insert Custom XML: Given a workbook and a custom XML value, this snippet inserts the custom XML into the workbook.
  • Excel: Insert Header or Footer: Given a workbook, worksheet and text to insert and a header or footer type, this snippet inserts the header or footer with the given text into the worksheet.
  • Excel: Insert a Numeric Value into a Cell: Given a workbook, worksheet, cell address and numeric value, this snippet inserts the value into the cell.
  • Excel: Insert a String Value into a Cell: Given a workbook, worksheet, cell address and string value, this snippet inserts the value into the cell.
  • Excel: Set Recalc Option: Given a workbook and a RecalcOption, this snippet sets the recalculation property to the new option.

Microsoft Office PowerPoint Snippets

  • PowerPoint: Delete Comments by User: Given a presentation and a user name, this snippet deletes all comments by that user.
  • PowerPoint: Delete Slide by Title: Given a presentation and slide title, this snippet deletes the first instance of a slide with that title (titles are not unique).
  • PowerPoint: Get Slide Count: This snippet returns the number of slides in a given presentation.
  • PowerPoint: Get Slide Titles: Given a presentation, this snippet returns a list of the slide titles in the order presented.
  • PowerPoint: Modify Slide Title: Given a presentation, old slide title, and new slide title, this snippet changes the first instance of a slide with the given title to the new value. The snippet returns true if successful, false if not successful.
  • PowerPoint: Reorder Slides: Given a presentation, an original position, and a new position, attempt to place the slide from the original position into the new position within the deck. If the original position is outside the range of the number of slides in the deck, use the last slide. If the new position is outside the range of slides in the deck, put the selected slide at the end of the deck. The snippet returns the loctation wher the slide was placed, or -1 on failure.
  • PowerPoint: Replace Image: Given a presentation, slide title and image file, this snippet replaces the first image on the slide with the given image.
  • PowerPoint: Retrieve Slide Location by Title: Given a presentation and a slide title, this snippet returns the 0-based location of the first slide with a matching title.

Microsoft Office Word Snippets

  • Word: Accept Revisions: Given a document and an author name, this snippet accepts the revisions by that author.
  • Word: Add Header: Given a document and a stream containing valid header content, add the stream content as a header in the document.
  • Word: Convert DOCM to DOCX: Given a macro-enabled document (.docm), this snippet removes the VBA project and converts the file to a macro-free Word Document (.docx).
  • Word: Remove Comments: Given a Word Document, this snippet removes all the comments.
  • Word: Remove Headers and Footers: This snippet removes all headers and footers from a given Word document.
  • Word: Remove Hidden Text: This snippet removes any hidden text in a given document.
  • Word: Replace Style: Given a document and valid header content, this snippet adds the content as a header in the document.
  • Word: Retrieve Application Property: Given a document name and an app property, this snippet returns the value of the property.
  • Word: Retrieve Core Property: Given a document name and a core property, this snippet returns the value of the property.
  • Word: Retrieve Custom Property: Given a document name and a custom property, this snippet returns the value of the property.
  • Word: Retrieve Table of Contents: Given a document name, this snippet returns a table of contents as an XmlDocument.
  • Word: Set Application Property: This snippet sets a property’s value given a document name, application property and value. The snippet returns the old value if successful.
  • Word: Set Core Property: Given a document name, a core property, and property value, this snippet sets the property value.
  • Word: Set Custom Property: Given a document name, a custom property, and a value, this snippet sets the property’s value. If the property does not exist, create it. Returns true if successful, false if not.
  • Word: Set Print Orientation: Given a document name, this snippet sets the print orientation for all sections in the document.

Download them here!

Finally, if you want to stay current with new resources to work with the Open XML File Formats, go to the XML in Office Developer Portal. We launched this portal recently to create a special section of the MSDN Office Developer Center where you will find bloggers, technical articles, code samples, developer documentation, and multimedia presentations on working with XML in Office.

Happy Office XML programming!
Leave a Comment
  • Please add 5 and 6 and type the answer here:
  • Post


  • May I point out a few objections to the article?

    I think that you are deliberately omitting to mention automation even though it's still in the 2007 timeframe the best, most powerful, quickest and safest way to target the new xml file formats.

    I think there is an ongoing confusion between the target file format and the programming language and SDKs used to target this new file format. That the new file format is "XML by default" does not mean you are required to use an XML programming stack to create/update it. Quite the opposite, if the XML file format serializes values resulting from engine calculations (Excel formulas, Word repaginations, ...), then I wonder how the ease and simplicity of an XML programming stack is going to do any good to update those values without an appropriate running instance of Excel, Word or Powerpoint doing the bulk of the work for you, just as it always did so far. And if you maintain all of that with the XML programming stack plus all your custom code, then you are rewriting Excel, Word and Powerpoint.

    May be there is something obvious I have missed though. But if you take some of the code snippets mentioned in the article, then you have to say that some of these either put the document in an unknown state or a plain corrupt state. Why? because if you change the value somewhere, you need to also update all the places that were referencing it. So to me, telling everyone to use an XML programming stack to do any significant changes in an Office document is going to result in interesting discussions...

    Not everything is bad though, there are scenarios where you can for instance replace a part in a file and be done with it, exemplifying a decoupling. What about documenting those scenarios? Here is one, change the chart part of an Excel document, by replacing 2D bars with 3D bars.

    Automation is still the way to go. If you'd like to update a value in a document, it's one or a couple lines of automation code, not hundreds (see the code snippets). If you add this up in case you are really making big changes, or are generating a complex document, then you end up with perhaps ten of thousands lines of code, with all the consequences attached to it. Interesting...

    There are a number of inaccuracies IMHO in the article. Let me point out a few ones,

    - "The file is much more robust because it is broken up into different document parts. Should one part become damaged (for example, a part describing headers), the rest of the document remains intact and still opens successfully." : I don't think so. I think opening up the parts lead to an entire new class of document corruptions. Since what you have done is basically let anyone corrupt anything within the document. Don't forget that so far Office developers would either use automation (directly or through VSTO interop), or a third-party : developers would never touch the file themselves.

    - "The file is easier to work with programmatically" : see the length of the code snippets, and compare that to automation code. Automation code is several orders of magnitude shorter, safer, intuitive, and so on. An Office developer using automation does not have to care about the underlying semantics, references and so on. It's not just "I like Automation better because it's my taste", it's really what it takes to make consistent changes to a file : in most cases, you need the engines to keep everything validated and consistent.

    - "Custom XML is also easier to work with" : Custom XML is a feature of Office 2003, and was already stored in a separate stream.

    - "The new Open XML Format, with a few minor exceptions, is written entirely in XML and is contained in a .zip file." : that's not quite true. The document implementation follows the specs, among which the open packaging conventions (whatever the "open" word means). And what this means is that the internals must adhere to a strict structure of parts, relations and relationships, where some of the parts may end up being XML. That's not quite the same than saying the document is mostly XML, since this would imply you are passing a real XML file (XML MIME content type) which could have had some binary data islands in some section(s). Also, there is at least one known case where OLE is involved as well (password-protected document).

    - "Users can also use standard transformations to extract or repurpose the data." : in a number of cases, you need to understand the data you deal with. What if you are dealing with an Office document where one or more VBA macros are attached to some events triggered when a value gets updated? Again, if you don't make consistent changes throughout the entire document, maitaining indexes and references, updating the values and parts accordingly, then what are you doing exactly other than dirtying the document and leaving it to an unknown state. By unknown state, I mean a document that is not ready for printing for instance. Or a plain corruption by the way.

    - "In addition, security concerns are drastically reduced because the information is stored in XML, which is essentially plain text. Thus, the data can pass through corporate firewalls without hindrance." : now I am confused, do you mean you are passing XML through the wire? Last time I checked, .docx/.xlsx/.pptx are binary file formats just like .doc/.xls/.ppt. It's only some internal parts of it which may end up being XML : if you are passing those documents over the wire, you are still left to the classic MIME content type exclusion. For instance, if you attach such document to an email, it may be blocked by default as per the enterprise policies.

    In the end, I am not saying the XML programming stack being suggested (WINFX open packaging API, plus whatever XML parser/serializer) is not useful in 100% of cases. But I think the confusion needs to be cleared up sooner than later. I think that, at the moment, you are setting expections way too high.
  • Hi Mike,

    Thanks for your comment, probably the longest comment I have seen since I started blogging! I have seen your comments on Brian Jones’ blog. I have seen that you have been following the news for a while…
    http://blogs.msdn.com/brian_jones/archive/2006/06/02/613702.aspx

    The purpose of this blog entry is to let developers know that there IS a technology that can help you to read/write/modify Office (Word, Excel, PowerPoint) files without automation. Also, the purpose of this blog entry is to point people to resources related with the Open XML File Formats and to let them know what the File Formats are.

    My intentions with this blog entry were not to say that working with the File Formats is better than using automation. It’s hard to say that automation is better than the Open XML File Formats or the other way around. Any developer that has worked with automation and WordML/SpreadsheetML can tell you the right answer is: “It depends on what you are trying to accomplish.” This is because there are scenarios where using automation is not the best option. Developers should know what are the options on the menu and the pros/cons of each one in order to make the right decision when it comes to select a technology.

    Why are people interested in not using automation? Not that they dislike it of course. I agree with you that it’s very easy to use and that it’s the bread and butter of all Office developers. However, there are scenarios when you know that automation is not the best option. Two examples I can quickly think about (and there are more, but I am doing my best to keep this answer short):

    1. Server-side scenarios: If you have a Web application and you need to programmatically generate Word documents, or export data to an Excel Spreadsheet. I think the best option here would be to avoid the use of automation. It’s not a good idea to create an instance of a Word or Excel application on a Web server. Open XML File Formats would be a wise option to use for this specific scenario.
    2. Massive document generation: If you need to generate a single document (client-side), automation might help. But try generating 1,000 documents using automation! You will see the performance is not going to be nice. Automation makes an intensive use of memory and even if you turn some settings off to optimize performance you are still not going to be happy with the time and resources needed to accomplish the mission. Now try serializing the 1,000 documents using XML (use an XMLTextWriter) and you will see it’s way much better. Also, if you use the XML File Formats, documents are more compact, so they are easier to move across the wire. Definitely an option that application developers and solution architects would consider.

    About getting corrupted files, I don’t think that’s going to be the outcome. I think one of the best things of the Open XML File Formats is that it allows document to be loosely coupled. You have document parts that you can update without changing the entire document, and the result is not corruption. It’s just a great way to isolate document parts and update them easily. I have worked with the Office 2003 WordML and SpreadsheetML formats (you have everything on a single file); whereas the new File Formats isolate document parts and I can tell you it’s the dream of software engineers to work with loosely coupled applications.

    "The new Open XML Format, WITH A FEW MINOR EXCEPTIONS, is written entirely in XML and is contained in a .zip file." :  This is true, see, for OLE objects (embedded objects like Visio diagrams) or images you won’t have an XML serialization, but for everything else you have an XML document. Just do the exercise of saving a Word file as Open XML and you will see what I mean.

    If I am setting high expectations for this technology, well, that’s because based in my experience working with Office, the current Office XML schemas, and automation, I  find this technology to be AMAZING and I think it provides a great additional option for developers who for very specific reasons need to consider more options.

  • Thanks for your answers.

    "The purpose of this blog entry is to let developers know that there IS a technology that can help you to read/write/modify Office (Word, Excel, PowerPoint) files without automation." : I know you are evangelizing the technology, and don't expect you to bark on it.

    "It’s hard to say that automation is better than the Open XML File Formats or the other way around." : Hmmm, this is one of the things I don't get. I think it's pretty obvious that automation is not only better, it's shorter and safer. Let me give a few examples. Let's say you'd like to delete a slide, or a worksheet. In automation code, this is really one line of code, period. How much with direct access to the internals of the file formats? And I have kept the best part of the argument : those code snippets above actually corrupt the document (this is a comment I made to Kevin Boske in his blog a few days ago).

    If you are an Office developer, you can use automation to find your way out. And, you can rely on a third-party if you'd rather not have a running Office instance. There are tons out there that have managed to reverse engineer the file formats and expose it through an Automation-equivalent API without REQUIRING to deal with the guts of the internals (which is unfortunately what this new XML is about).

    The only reason to use an XML programming stack, and dare to directly access the internals of the file format would be if there was a good reason to do so, for instance if relying on XML xpath, you end up with a very robust, short and elegant code to create/update a document. Can you give some examples of that? None of the code snippets above can beat their Automation's equivalent in robustness, conciseness and elegance. Also don't forget that the built-in Macro recorder has been providing Office developers a great way to discover Automation code snippets without having to actually learn the underlying object model. Tell me about it with the 4000+ page TC45 specs (ironically enough, those specs don't cover everything you need to know, far from that : an example is password-protected documents).

    The core of the argument is how better are the XML schemas compared to a binary file format serialized as XML (in other words, binary bits with angle brackets). I have said in the previous comment that there could be a scenario where you could replace a chart part for another and be done with it. Even this positive view is actually oversimplying things a bit : if you use custom error bars in charts, then I believe there are references to other parts, and all of a sudden you can't do a simple part replacement anymore.

    Where are the XML-based examples where Automation is much longer, less robust? Note that I am not saying "robust" as a reliable .exe, I am talking robust as something that keeps the document consistent, validated, ready for printing, ready for sharing, ...

    "This is true, see, for OLE objects (embedded objects like Visio diagrams) or images you won’t have an XML serialization, but for everything else you have an XML document." : I think there is an over-statement here. The way everyone understands a sentence like this is as in Word 2003's XML, i.e. you have a single XML file, with some data islands in it. But the new Word 2007's XML (for instance), XML is scattered in pieces, so it's much harder to work with it, especially since you as a developer need to maintain the indexes and references yourself. I have also pointed out to Kevin that the WinFX packaging API exposes the guts of parts, not abstracting those enough to avoid mistakes.

    And, dealing with XML, where is the validation API? How can you ensure that, after you have used one of the code snippets above, the resulting document is valid without adding all the clutter of a separate yet-to-be-mentioned XML validation stack combined with all of the required semantics of the internals of the file formats ?

    All what this is calling for is for third-parties to abstract this away, because it's my belief (from my experience as an ISV) that it's too hard, time-consuming, dangerous and complicated to deal with directly. (third-parties provide a solution to the server-side scenarios you mention above : in server reliability and speed).

    I think there is a disturbing confusion between XML as a serialization format (binary bits with angle brackets) and XML as a true programming model (strong decoupling, needs not much knowledge of core specs,...). We'll see how it goes.

  • I just spotted a great post on the TechEd Bloggers feed from Erika Ehrli - "Open XML File Formats: What...
  • E' stato fatto oggi un importante annuncio da Microsoft, che anche per noi è stato una bella sorpresa:...
  • Yeah, I know long time no B(log). I was real busy those days, but don’t worry I will tell you what was...
  • Remember my last post about ODF?
    Things go quite fast in our industry. Indeed, Microsoft has decided...
  • Does somebody knows where I can get the Open XML file format detail reference?
    I want to write a program to read and write Open Xml files directly.
  • Infact i have got a PDF version of the reference:Office Open XML Document Interchange Specification /Ecma TC45 Working Draft 1.3 / Public Distribution May 2006.
    But i want a .chm version of it.
    If somebody knows where it is, please give me a link, thanks a lot!

  • Microsoft PacWest SharePoint Server Newsletter – July 2006

     Update on Download Availability of...
  • As far as I have seen, it is not enough to change the custom property value in the docProps\Custom.xml file because the original value of the custom property inserted in the document is still there (in the word\document.xml file). I had to manually update all field codes in the document using Word (F9) before the new custom property value will be seen.

    Comments?
  • whui1978, you can find the ECMA spec here:

    http://www.ecma-international.org/activities/office%20open%20xml%20formats/tc45_fd_xml_docform.zip

    I haven't seen a chm.

    Lars, read my August blog entries, you will find samples on how to read/write Office XML File Formats. I am writing a new document and I change programatically the custom.xml file without changing anything else. I only replace the file and it works, not sure why you are experiencing this behavior.
  • i ve to convert my ppt file to xml using c# , what i was able to do using system.io.packaging, now i ve to recollect the package and retrieve my original ppt back, please do hep out a way of doing it

  • How can i retrieve my files back , once i save them as open xml format

  • Some people have been following the news and are even ahead of most of us already buidling solutions to generate documents using the Open XML File Formats. Some other people are not familiar with this technology and want to learn more about this, so her

Page 1 of 2 (21 items) 12