In this blog post, I’m going to cover some of the details of how we approached the challenges of testing our ODF 1.1 implementation that was released in Office 2007 SP2.
Adding support for a new document format such as ODF to Office is a large and complex project. Office has a very broad range of functionality, and we had to map that functionality to the structures defined in ODF. This mapping then needed to be rigorously tested, in isolation and also in rich documents that reflect typical usage of various combinations of features, to assure that our generated documents are conformant to the specification and to maximize interoperability with other implementations.
When we began work on our ODF 1.1 implementation, we started by developing a set of high-level guiding principles that we would follow. I covered those in a blog post last year, as well as a recent post that explained how we see the relationship between standards and interoperability.
After we had reached agreement on these principles, the various feature teams began designing the details. A “feature team” here at Microsoft is made up of three groups of people: program managers (PMs), developers, and testers. In broad simple terms, PMs are responsible for writing down the specifications, developers are responsible for implementing those specifications, and testers are responsible for verifying that everything works as intended. Since there was a specification for ODF in hand already, the main job of the feature team was to write down the details of how we would implement it. In this post I’ll be focusing on the work of the testers, although inevitably that will include some discussion of the work of the PMs and developers, because the three disciplines work very closely together in an iterative manner.
Most of the people who planned and executed our ODF implementation are members of the same teams that are responsible for other aspects of the design, development and testing of the Office clients. We created an “ODF virtual-team” that included specific individuals from each of the relevant product teams – Word, Excel, PowerPoint, and graphics, primarily --- and the v-team approached the project with the same management structure and business processes that we use for other work on Office. Attendees of the DII workshop in Redmond last summer had a chance to meet several key members of the ODF v-team, who gave presentations and participated in the roundtable discussions at that event.
In addition to these people in Redmond, we have other teams that we can call on for projects like this one, and for the testing work on our ODF implementation we pulled in people from the Office group in four countries, as well as people who worked on Office years ago but have moved on to other roles (for their expertise in older features that we wanted to verify are supported correctly in our ODF implementation).
Mapping Between ODF and Open XML
Office’s internal representation of documents is very closely aligned with the Open XML formats, so one of the first steps in planning our ODF implementation was to do detailed mapping between the Open XML structures that Office already supported, and the ODF structures that we would be saving and loading to/from in ODF 1.1 documents.
The PMs had primary responsibility for this, and they created sets of spreadsheets to capture the mappings between every ODF and Open XML element and attribute. This mapping needed to be defined in both directions: OXML->ODF for File/Save operations, and ODF->OXML for File/Open operations.
As a simple example of how that worked, here is part of the spreadsheet for the concept of bold text, as mapped from OXML to ODF:
This excerpt is just a subset of what was captured in the mapping; the PMs also identified required/optional status, default values, and other information.
And here’s the converse mapping for bold text, going from ODF to OXML:
I’ve used a very simple example here, and yet as you can see there are many details involved. There were thousands of details like this in the mapping spreadsheets, and collectively these spreadsheets served two roles:
The process of creating the mapping spreadsheets is interesting unto itself, due to the many places where ODF and Open XML had different approaches or different capabilities. I’ll cover the mapping spreadsheets in more detail in a future blog post.
Test Tools and Test Documents
Like any professional test team, the Office testers have a wide variety of tools they’ve built to help automate their work. Here are a few examples of the tools that were used to test Office’s ODF implementation:
These tools, and others developed by the test teams, all work against large collections of documents. These test documents came from a variety of sources:
Our libraries of test documents are dynamic and constantly growing. As a recent example, we found that the latest Committee Draft of the ODF 1.2 specification uses styles in a way that exposed a bug in Word’s implementation. (Rick Jelliffe has blogged about this bug.) So we’ve added that document to our test library going forward. (We’ve also fixed that bug and tested the fix, which will appear in a future update.)
After the developers had written code to handle the mappings as defined in the spreadsheets (which were essentially the specs for their work), the testers got to work testing this code.
One aspect of testing was the small documents for verifying specific elements and attributes. These were handled in an automated manner using tools such as Trippy and OHarness, as mentioned above.
Another aspect of this testing was the creation of complex “real-world documents” that contained combinations of functionality to test various scenarios that we’ve found typically occur in actual use of Word, Excel, or PowerPoint.
For example, many Excel users create spreadsheet documents that contain a large worksheet of raw data like this one:
… and that data is often summarized that data in pivot tables and/or formatted reports like these:
The test team would create documents like this one, then manually verify that the document could be saved as either an ODS or XLSX file without change in appearance or functionality. In this particular case, the test team verified that a variety of details were handled the same in Open XML and ODF, including:
As I mentioned earlier, the product teams each have a large corpus of test document that are used for automated testing of conformance. Binary documents and Open XML documents are opened and then saved as ODF, and each of these documents is validated against the ODF schemas. By analyzing the results of these tests, the testers can identify problems that need to be corrected, and then the tests are re-run.
The goal of this process is simple: to drive the number of non-conformant documents to zero. We reached that goal for the Office 2007 SP2 implementation of ODF, and as of this writing I don’t know of a way to make Word, Excel or PowerPoint write a non-conformant ODF document. It may theoretically be possible to do so – and if anyone happens to come across such a scenario please let me know – but we have verified that the hundreds of thousands of documents in our test libraries can be saved as fully conformant ODF 1.1 files from Office 2007 SP2. By conformant, I mean here fully schema-compliant and also conformant with our reading of the text of the ODF 1.1 spec.
When we add support for a new format, one area that requires intensive testing is security. Does our implementation of the new format create any new security risks that need to be mitigated? Is there any way that an ODF document can be corrupted (deliberately or accidentally) that could cause a security problem? The test teams were responsible for answering these questions.
The key tool used for this aspect of the test plan was Distributed File Fuzzing (DFF). The basic concept is that thousands of documents are corrupted in random ways, and these documents are opened on large numbers of PCs in a distributed environment. Data is collected on the ways in which these corrupted files fail to open, and this data is used to verify that there are not security problems caused by bad error handlers, buffer overruns, integer overflow, or other issues.
When issues are found in security testing, the process is the same as in the other types of testing: the testers log bugs, and the developers check whether the problem is in design or implementation, and based on those findings we either modify the design and re-code, or correct the code. The tests are then repeated, and this process continues until the number of open security issues reaches zero.
The final piece of the testing puzzle is interoperability testing: verifying that documents created in Office can be opened in other implementations, and vice versa.
This type of testing is nothing new for the test teams, because we do it every time we add a feature to Office. In the past, we focused primarily on interoperability between various versions of Office, but now that test matrix has been expanded to include the latest versions of major ODF implementations.
To verify interoperability with other ODF implementations, the test teams created documents from scratch in OpenOffice.org and Symphony, and then opened those documents in Office. They also created documents in Office and opened them in the other implementations.
In addition to these types of simple tests, we also wanted to verify that our implementation was not dependent on details of other implementations that aren’t actually standardized in the specification.
A good example of this sort of issue is the question of how parts are named and where they’re stored in the ZIP package that comprises an ODF document. I’ve blogged in the past about this same issue in Open XML – an implementation of the Open XML standard shouldn’t assume that the document start part is word/document.xml, just because Word happens to use that name and location.
In ODF, some of those details are standardized – the start part is always named content.xml, for example – but others are not. So the testers used ODE to manually modify documents that had been created by OpenOffice.org, to change certain details such as the name of the folder containing embedded images. They then opened these documents in Office, to verify that our implementation will be able to interoperate with implementations that have made different design decisions within the range of options that the ODF standard allows.
As you can see, there are many things to consider when creating and executing a test plan for support of a new document format in Office. At an abstract level, it’s just another test plan – we design, then code, then test, with ongoing revisions to all three as needed to reach our design goals. But the specifics of the ODF implementation test plan were geared toward the details of the ODF standard, as outlined above.
Due to the work our test teams did on the ODF 1.1 implementation in Office 2007 SP2, we are very confident that the implementation we produced adheres to the details of the design we had created, as documented on the implementer notes web site. I realize that some people may disagree with some of the design decisions we made in our implementation, and we welcome constructive debate of those details.
I’m posting this from The Hague, where I will be attending the ODF plugfest today and tomorrow. My colleague Peter Amstein – who led the technical work on our ODF implementation – is also here, and we’re looking forward to learning about how other implementers approach document format interoperability testing, and discussing how we can all work together on ODF interoperability going forward.
You write quit a bit on how you did the mapping between features in OOXML and the corresponding features of ODF.
But what did you do when you came across features that didn't map? As far as I remember, the concept of "anchoring" is fundamentally different between OOXML and ODF. What did you do in these situations where ODF contained a (core) feature that OOXML doesn't?
PingBack from http://identi.ca/notice/5326479
Good question, Jesper. That's exactly the kind of detail I'm planning to provide in an upcoming post about the details of the mapping process. Stay tuned. :-)
@Jesper: if we take the case of Tracked Changes, there's a limited form of it in ODF 1.1, more limited than Office's, but more complex to implement than OXML's: revisions are removed from the document's flow and kept in a separate branch (require more DOM manipulations than OXML's system), so it has been removed: MS Office 2007 can't support tracked changes outside of the document's flow. Doug made a complete blog post about it a few weeks back (and he and I had a little exchange about the proper place of changes tracking: he believes it should be kept in a document's flow, but I would rather see it kept outside for faster viewing, indexing and repairing - stuff that happens outside of an office suite, see).
If we take the case of an unsupported type of line numbering (as presented by the ODF 1.1 norm in ODF format), it is squashed too - not downgraded to a more 'basic' style, but merely and purely squashed, ignored, whatever. Doug acknowledged it's a bug that is being corrected and should be patched soon. How it will be patched, though, is another story (add support for feature? Downgrade to rougher version? Squash it and add a line to implementer's notes?)
Thansk for the post Doug.
I would like to know when does Microsoft plan to release a fix for the ODF formula handling bugs in Office 2007, specially the lack of square brackets in the XML output ( which break interoperability with all the other ODF imlementations ).
Thanks in advance.
@Mitch, our decision to not implement tracked changes had nothing to do with difficulty of implementation. Rather, ODF's current tracked-changes functionality just can't store the types of tracked changes that our users have come to expect. As for whether tracked changes should be stored in the document or not, the ODF spec says that's where to put them.
On the line-numbering issue, I think you may have mistakenly swapped my comments on the page-break question with my comments on that issue, from this thread. I’ll let you know when I have more information on line-numbering.
@Carlos, the square-bracket issue has been discussed at great length on this blog post and the comments that go with it. I refer you that post.
#Dennis, we wish you were here too. :-)
The lack of square handling brackets is not a bug.
It is actually valid ODF 1.1
@Doug: (devil's advocate mode) let's say I open, in Word 2007 SP2, an ODF file with tracked changes that were saved in, say, OpenOffice.org 3.1 with ODF 1.1 compatibility mode enabled.
Since Word supports better change tracking (a superset of ODF's defined functionality), then said changes should be converted into Word's internal representation of tracked changes, so that if I decide to save the file in, say, OOXML, then all the changes would be kept; if I try to save the file back to ODF 1.1, then Word could output a warning 'tracked changes in this document can't be saved to ODF in a reliable manner, this feature will be disabled'.
However, even though Word 2007 SP2 supports a superset of ODF 1.1's change tracking, changes are scrapped on load. Why?
From what I know of XML and DOM manipulation, and based on your post on tracked changes, a change in OOXML is tracked as a child node of the element it takes place in, and kept 'in position', while ODF creates a node including all changes details, moves it to a different branch, and saves this 'new' node along with another node containing the move's data.
If I look at the difficulty, tracking change in OOXML can be summed up with a child node creation (parameters include author, date, type of modification), while ODF's tracking change includes an independent node creation, moving it to a revision branch, storing previous location's details (which, in DOM manipulation, can be quite tricky since further changes may cause 'landmarks' to move).
So, while I can perfectly understand why Word won't save tracked changes into an ODF 1.1 file due to ODF tracked changes being a subset of Word's, I must wonder why importing them can't be done:
- Word supports the feature
- it's not technical difficulty that prevented the feature from being supported at least on import
- since ODF is not MSO's default format, and MSO prompts the user when saving to a different format may cause data/feature loss (saving to binary formats, for example), it's not even a quest for coherence,
So why were tracked changes not supported, at least on imports?
About the line numbering issue, I cite your comment on that thread: "As for your line numbering question, it looks like the ODF 1.1 spec uses a line numbering style that Word 2007 SP2 does not support. But I have to spend a little more time digging into that to figure out the details." you're right, you never said it was going to be patched. May I then infer that it won't be corrected?
So, @Jesper: if we take currently unsupported ODF features, they are not, and will not, be supported in the future, even if MSO supports them in its native format at least on import; they won't degrade gracefully, they'll just be silently destroyed. The best you can expect is fixing bugs in currently implemented features. The list currently includes (but is not limited to) tables in presentations, tracked changes, line numbering.
Mitch, change tracking simply doesn't work in any ODF implementation, as I've already demonstrated. You're correct that we decided not to import broken change tracking.
Regarding line numbering, no, I don't think it's fair for you to infer anything from my statement "I have to spend a little more time digging into that to figure out the details" other than exactly what it says. I'm on the road for another two weeks, so you'll just have to be patient. I may not get to it the first day I'm back in the office, either.
As for tables in presentations, they don't even exist in ODF 1.1. Why in the world are you pretending that we're "silently destroying" them?
@dmahugh: change tracking doesn't work outside of plain text - that, you demonstrated indeed, and it is described so in the specification. Still, if I track changes on "plain text" on OOo, it's stored inside the file and I can recover them from edit to edit in OOo while K Word displays the document with modifications (deletes and adds) accepted, like Word does. K Word doesn't support stored edits at all - but it, at least, applied all edits and didn't show stray text out of place. Like MS Word does, but on the other hand, Word supports tracked changes...
About the line numbering issue, what's up with it? Have you spent any more time on it? Can we have more details on it? Currently, only OOo and its ilk support them
About tables, if I create an Impress document with a table in it and save as ODF 1.1, I can reopen it in Impress with the table still inside (it's inside a frame tag inside content.xml); if I open the file in K Presenter, I get:
- a complete preview in the file opener (table is here)
- a gray box in the edit field (K Presenter doesn't support tables at all, so it can't display one for editing), probably representing the frame object.
So, although K Presenter doesn't support tables in presentation as a feature and won't save them, I can:
- see them when opening the file,
- see there was something when editing the file.
On the other hand, PowerPoint does support tables, but:
- doesn't show them when opening the file
- doesn't show a place holder for an unsupported object child (or maybe the frame tag itself is dropped?)
So, yes, the table is silently destroyed, while K Office 2.0 RC1, which uses a completely independent ODF 1.1 implementation of ODF from "the other" (OOo, Go-OO, NeoOffice, Symphony all use basically the same ODF parser), at least TRIES to display something - even though its feature set is more worthy of Office 95 than 2007.
Finally, ODF 1.0 didn't mention anything about tables in presentations (one could claim they weren't supported, then); ODF 1.1 specified that tables could be found inside another tag (a frame). If you consider that in ODF, a table is a table is a table, embedding a table inside a frame refers to the ODF namespace' main 'table' object, so tables are supported; it's not like in OXML where a Word table is not quite like an Excel table which is not quite like a PowerPoint table. Reference to a table object in the table namespace is thus a reference to the definition of other tables in ODF in general - which are all the same.
So, is it because the 'frame' tag isn't supported in PowerPoint's ODF import/export filter/translator/parser/magic blob that tables aren't supported at all in PowerPoint? Could be. If frames are supported but you consider having a table inside a frame is a standards violation, why then is the frame missing?
Mitch, if you have ideas for interop test scenarios that you feel would be beneficial to the community, it would be helpful if you could submit them to the wiki where everyone else in the community is defining such scenarios: http://plugtest.opendocsociety.org/doku.php
@dmahugh: before I start describing scenarios, some consistency would be nice, and I'll lay it thick here (I'll be very blunt).
Excel saves formulas in ODF with an OXML prefix (it's an extension, allowed by ODF and XML in general); other applications default to outputting these formulas as strings with a prefix (because they don't support it), while Excel will parse and run those formulas. So, if there's a hole in the specification, Excel will fill that hole with its own format.
On the other hand, OpenOffice.org 3.x saves formulas in OpenFormula format (current draft), which other applications default to outputting as strings with a prefix or parse, while Excel... destroys them. Reason given, 'degrading gracefully to text strings destroys visual fidelity' (priority over 'conserve data integrity'). So, for a data organization tool (a spreadsheet), visual fidelity has priority over data integrity. And yes, I consider a formula as actual data: what does 42 means without the formula? Or, more realistically, E? (note: I used OOo, but I could have used Gnumeric...)
Okay, let's just admit that the Office team has visual integrity in a document as an absolute priority, and won't hesitate to make use of the XML namespace capabilities of ODF to fill a void (that's what Excel seems to point at).
Tables in presentations: stored by existing implementations that have support for them as embedded in a layer tag; indeed, this is ill-defined in the specification. Although current drafts solve this point in a backward-compatible manner, they are destroyed in PowerPoint to 'conserve data integrity: element isn't specifically defined to be present here'. So, a void in the specification must not be extended through the use of ODF's XML namespace capabilities: let's remove the element, visual integrity be damned.
I find two glaring contradictions:
- priorities order is not consistent depending on the application used: visual over data integrity for spreadsheet, data over visual integrity for presentation
- high level priorities are at odd with application's main use: presentation should have higher visual priority, spreadsheet should have higher data integrity priority
So, before I can propose interoperability scenarios (one could be, "what does the competition do to ensure interoperability"), I'll simply ask you to tell me where the logic is in your interoperability decisions:
- if visual integrity's priority is over data integrity's, then Excel should only save values and no formulas in ODF. Then, Powerpoint's behaviour would be normal.
- if data integrity's priority is over visual integrity's, then Powerpoint should be able to open tables when found in ODF presentations. Then, Excel's behaviour (of exporting OXML formulas) would be normal.
- if priorities depend on an application's core use, then Excel should keep unknown formulas as string, and Powerpoint should display either tables or (at the VERY least) an empty layer element.
In fact, the only consistency I can find here is that every litigious decision taken seem to point at (I'll be even more blunt) giving users the shaft.
Now, tracked changes is another barrel of fish: while it wouldn't be very difficult to translate ODF tracked changes into Word's system (itself a superset of ODF's described functionality), you could argue that since the functionality can't be supported on both import and export equally, it ain't supported at all. I'd argue that import and export are functionally independent operations - but that's IMHO.
Now, this could all be a very big misunderstanding: at Microsoft, "interoperability" is a very recent priority; MS developers are kept inside a very closed world (I've known a few personally), and before 2007 'interoperability' was limited to 'how to fulfill the bare minimum or a requirement and give competition the shaft': see POSIX support in NT4 (no graphics, no network support, no access to external APIs), LDAP and ActiveDirectory, Internet Explorer, Java...
On the other hand, and this is probably the biggest difference, other 'document generators' makers have tried for an interoperable format for years now: first ODF drafts date back to before 2004. Then, although Microsoft lead development on XML, competitors used it first:
- OpenOffice.org used an XML-based file format in 1999
- many presentation, markup and style formats were made based on XML: XHTML, MathML, SVG; those often have reference implementations available as independent modules
My suggestion: drop the monolithic way of thinking!
- An import filter doesn't necessarily reflects exactly its complementary export filter
- a format filter should stand on its own, independently (or as close to it as technically feasible) from the application(s) that uses it
- collaboration doesn't mean "look at what we did!" but should be more like "I wanna do this; how did you proceed in that-very-similar-to-this scenario?" or at the very least "look at what we did! I think it's great, but what would you do to make it better?"
A good start, Doug, would have been to publish to the public at large, not a closed number of registered paying customers in NDA programs, a first version of your ODF import/export filters for Office (maybe as an add-on for SP1), with your first implementor's notes, and ask for feedback.
A great library of test documents is one thing, but look: as soon as the final version came out, one guy tried an obvious document your team missed, and found a couple of bugs. That should have been done during alpha tests.
Mitch, you've misrepresented a wide variety of things above, and I feel I've spent enough time responding to your misrepresentations for one week, so I'll just briefly respond to the final one:
> A good start, Doug, would have been to publish to the public at large, not a> closed number of registered paying customers in NDA programs, a first version> of your ODF import/export filters for Office (maybe as an add-on for SP1),> with your first implementor's notes, and ask for feedback.
We've never had an NDA program for "registered paying customers" (as opposed to non-registered non-paying ones, perhaps?) to see our built-in ODF support (there are no "import/export filters" involved). Instead, we did exactly what you suggest -- we had an open-to-the-public event (July 30 of last year) to describe our plans for ODF implementation, and I personally invited the entire ODF TC, sent out individual invitations to many persons in the ODF community, and posted an open invitation to the public on my blog: http://blogs.msdn.com/dmahugh/archive/2008/07/09/dii-workshop-on-sp2-and-odf.aspx Sorry you couldn't make it.
As a general observation, there are many people working very hard to improve ODF interoperability these days, and I stand by my recommendation that you put all of this time and energy to more productive use by joining them in those efforts. Long incoherent rants here, however satisfying they may be to you personally, aren't making any difference and aren't doing anyone any good.
Pochi giorni fa Doug Mahugh ha pubblicato un interessante post sul metodo utilizzato per affrontare il