There has been quite a bit of discussion lately in the blogosphere about various approaches to document format interoperability. It’s great to see all of the interest in this topic, and in this post I’d like to outline how we look at interoperability and standards on the Office team. Our approach is based on a few simple concepts:
It’s easy to get bogged down in the details when you start thinking through interoperability issues, so for this post I’m going to focus on a few simple diagrams that illustrate the basics of interoperability. (These diagrams were inspired by a recent blog post by Wouter Van Vugt.)
Interoperability without Standards
First, let’s consider how software interoperability works when it is not standards-based.
Consider the various ways that four applications can share data, as shown in the diagram to the right. There are six connections between these four applications, and each connection can be traversed in either direction, so there are 12 total types of interoperability involved. (For example., Application A can consume a data file produced by Application B, or vice versa.)
As the number of applications increases, this complexity grows rapidly. Double the number of applications to 8 total, and there will be 56 types of interoperability between them:
Let’s go back to the simple case of 4 applications that need to interoperate with one another, and take a look at another factor: software bugs. All complex software has bugs, and some bugs can present significant challenges to interoperability. Let’s consider the case that 3 of the 4 applications have bugs that affect interoperability, as shown in the diagram to the right.
The bugs will need to be addressed when data moves between these applications. Some bugs can present unsolvable roadblocks to interoperability, but for purposes of this discussion let’s assume that every one of these bugs has a workaround. That is, application A can take into account the known bug in application B and either implement the same buggy behavior itself, or try to fix up the problem when working with files that it knows came from application B.
Here’s where those workarounds will need to be implemented:
Note the complexity of this diagram. There are 6 connections between these 4 applications, and everyone one of them has a different set of workarounds for bugs along the path. Furthermore, any given connection may have different issues when data moves in different directions, leading to 12 interoperability scenarios, every one of which presents unique challenges. And what happens if one of the implementers fixes one of their bugs in a new release? That effectively adds yet another node to the diagram, increasing the complexity of the overall problem.
In the real world, interoperability is almost never achieved in this way. Standards-based interoperability is much better approach for everyone involved, whether that standard is an open one such as ODF (IS26300) or Open XML (IS29500), or a de-facto standard set by one popular implementation.
In the world of de-facto standards, one vendor ends up becoming the “reference implementation” that everyone else works to interoperate with. In actual practice, this de-facto standard may or may not even be written down – engineers can often achieve a high degree of interoperability simply by observing the reference implementation and working to follow it.
De-facto standards often (but not always) get written down to become public standards eventually. One simple example of this is the “Edison base” standard for screw-in light bulbs and sockets, which started as a proprietary approach but has long since been standardized by the IEC. In fact this is a much more common way for standards to become successful than the “green field” approach in which the standard is written down first before there are any implementations.
Once a standard becomes open and public, the process for maintaining it and the way that implementers achieve interoperability with one another changes a little.
The core premise of open standards-based interoperability is this: each application implements the published standard as written, and this provides a baseline for delivering interoperability. Standards don’t address all interoperability challenges, but the existence of a standard addresses many of the issues involved, and the other issues can be addressed through standards maintenance, transparency of implementation details, and collaborative interoperability testing.
In the standards-based scenario, the standard itself is the central mechanism for enabling interoperability between implementations:
This diagram is much simpler than the other diagram above that showed 56 possible connections between 8 implementations. The presence of the standard means that there are only 8 connections, and each connection only has to deal with the bugs in a single implementation.
How this all applies to Office 2007 SP2
I covered last summer the set of guiding principles that we used to guide the work we did to support ODF in Office 2007 SP2. These principles were applied in a specific order, and I’d like to revisit the top two guiding principles to explain how they support the view of interoperability that I’ve covered above.
Guiding Principle #1: Adhere to the ODF 1.1 Standard
In order to achieve the level of simplicity shown in the diagram above, the standard itself must be carefully written and implementers need to agree on the importance of adhering to the published version of the standard. That’s why we made “Adhere to the ODF 1.1 Standard” our #1 guiding principle. This is the starting point for enabling interoperability.
Recent independent tests have found that our implementation does in fact adhere to the ODF 1.1 standard, and I hope others will continue to conduct such tests and publish the results.
Guiding Principle #2: Be Predictable
The second guiding principle we followed in our ODF implementation was “Be Predictable.” I’ve described this concept in the past as “doing what an informed user would likely expect,” but I’d like to explain this concept in a little more detail here, because it’s a very important aspect of our approach to interoperability in general.
Being predictable is also known as the principle of least astonishment. The basic concept is that users don’t want to be surprised by inconsistencies and quirks in the software they use, and software designers should strive to minimize or eliminate any such surprises.
There are many ways that this concept comes into play when implementing a document format such as ODF or Open XML. One general category is mapping one set of options to a different set of options, and I used an example of this in the blog post mentioned above:
When OOXML is a superset of ODF, we usually map the OOXML-only constructs to a default ODF value. For example, ODF does not support OOXML’s doubleWave border style, so when we save as ODF we map that style to the default border style.
When OOXML is a superset of ODF, we usually map the OOXML-only constructs to a default ODF value. For example, ODF does not support OOXML’s doubleWave border style, so when we save as ODF we map that style to the default border style.
Our other option in this case would have to turn the text box and the border into a picture. That would have made the border look nearly identical when the user opened the file again, but we felt that users would have been astonished (in a bad way) when they discovered that they could no longer edit the text after saving and reopening the file.
What about Bugs and Deviations?
Of course, the existence of a published standard doesn’t prevent interoperability bugs from occurring. These bugs may include deviations from the requirements of the standard. In addition, they may include different interpretations of ambiguous sections of the standard.
The first step in addressing these sorts of real-world issues is transparency. It’s hard to work around bugs and deviations if you’re not sure what they are, or if you have to resort to guesswork and reverse engineering to locate them.
Our approach to the transparency issue has been to document the details of our implementation through published implementer notes. We’ve done that for our implementations of ODF 1.1 and ECMA-376, and going forward we’ll be doing the same for IS29500 and future versions of ODF when we support them.
The final piece of the puzzle is hands-on testing, to identify areas where implementations need to be adjusted to enable reliable interoperability.
This is where the de-facto standard approach meets the public standard. If the written standard is unclear or allows for multiple approaches to something, but all of the leading implementations have already chosen one particular approach, then it is easy for a new entrant to the field to see how to be interoperable. If other implementers have already chosen diverging approaches however, then it is not so clear what to do. Standards maintainers can help a great deal in this situation by clarifying and improving the written standard, and new implementers may want to wait on implementing that particular feature of the standard until the common approach settles out.
We did a great deal of interoperability testing for our ODF implementation before we released it, both internally and through community events such as the DII workshops. We’ve also worked with other implementers in a 1-on-1 manner, and going forward we’ll be participating in a variety of interoperability events. These are necessary steps in achieving the level of interoperability and predictability that customers expect these days.
In my next post, I’ll cover our testing strategy and methodology in more detail. What else would you like to know about how Office approaches document format interoperability?
Thanks for asking :]
When do you will release a fix for the formulas issues on SP2 (there's a lot of people asking me about it in Brazil... And I mean "important" people).
For information on our approach to formula support in SP2's implementation of ODF 1.1, see these posts:
As covered there, we'll be looking closely at Open Formula when it is approved and published. Personally, I think Open Formula looks very promising, and it will be good to have a published standard for formulas in future versions of ODF.
Did you have any thoughts on the topics covered in the blog post above? I'd be interested in knowing, for example, whether or not you agree that standards conformance is important for enabling standards-based interoperability.
Short answer: I do agree that standards conformance is the foundation for standards-based interoperability. Conformance is not enough, as we are reminded in many different ways, but a standard (evolved and improved as reality demands) is the proper foundation for resolving interoperabilty.
Longer story:In the case of Excel 2007 handling of table-cell formulas, there is a complicated tension and it would be good that it was better understood, whether or not people agree with how is was resolved in Excel 2007 ODF support.
It seems to me that the Be Predictable principle and the Fail Hard principle is in a tug-of-war with the Principle of least astonishment and the Fail Quietly/Softly principle. (That last tractor-puller was a surprise arrival in this match.)
The predictability is great for Excel 2007 ODF spreadsheets brought back into or interchange with Excel 2007 ODF. It also strikes me that there is no question that the Excel approach qualifies under the definition of ODF and is even further supported by thise formulas being drawn from a documented and open international standard. (Bugs we will ignore on both sides of this equation)
On the other hand, there is no question that users of other products are massively surprised by (1) their spreadsheets having their formulas lost when interchanged with Excel 2007 and (2) not being able to handle the formulase received in Excel 2007's ODF output.
Whether this is something that Microsoft is supposed to fix, as in Jomar Silva's view, is not that obvious to me.
I recall in my own experience how hard we worked to avoid solving problems that were not of our own creation. (We avoided dealing with authentication and authorization in one agreement for precisely that reason, rather than attempt yet-another private solution).
Here the problem is that the prevalent ODF spreadsheet implementations tend to use the same private namespace for an unstandardized formula-and-functions scheme. It is unclear whether those implementations conform with each other and whether there is any way to tell.
It's my impression that the Fail Hard decision was the chosen alternative over attempting to accept these other unstandardized formulas and have errors that would be unnoticed in the results or that, when noticed, would be inexplicable and difficult to resolve.
As long as I am creating my own fantasy about the state of affairs, it is also worth noting that when you already have a spreadsheet formula system that you know works, using that in the first Excel ODF support has a certain economy and helps one be confident in that initial implementation. It might even be a demonstration of the least-that-could-possible-work agility principle, and the future will determine whether the "but no simpler" line was crossed.
It is very unclear to me how much it would take to become confident in addressing interoperability with the uses of no-namespace and unstandardized namespaces and whether there is any reason to do that when the long-awaited OpenFormula addition to ODF is expected real-soon-now. Although there are those who think interoperability (with which anointed non-standardized implementation?) should have been a no-brainer, I speculate that the uncertain opportunity cost of pursuing that might have been frightening, especially with OpenFormula lurking in the wings and certain implementations claiming existing support for an unratified ODF 1.2.
I cannot fault the Microsoft approach as incorrect, and it is far too early to declare it to be unsuccessful. I was at the year-ago DII meeting where the guiding principles were announced and their application to spreadsheet formulas described. I applauded the principles and understood the reasoning for formulas. How this would impact various groups of users and non-users (who still want to interoperate) of Office 2007 did not surface in my consciousness. I little intelligence (pun intended) on that level of consideration.
Hi Doug, nice post ( and nice illustrations ).
Can you tell us about the advance of the fix in ODF formula handling in Office 2007 ( you know, the "square-bracket-thing" that prevents any real interoperability between spreadsheets generated by Microsoft(TM) Office 2007 and the Rest Of The World(TM) ODF spreadsheets ).
Thanks in advance.
There is a problem with your diagrams, in how theory maps to reality.
Remember: It is not about applications, it is about file formats.
There may be N (large value of N, >> 5) applications, but regardless of how big N is, the question is, are there any "islands" within which interoperabity (or more specifically, file format implementation compatibility, or standard interpretation, or whatever you want to call it) is "perfect", i.e. known to be good, and for which round-trip open/close preserves the complete contents of the XML encoding of the file?
The question then reduces to, "How many such 'islands' are there?" If the number is small, then the theoretical argument is just that - theoretical. If the number is small, the question is, is it small enough to justify the effort to achieve the compatiblity work, and to do so reliably? (Hint - the answer is: very few islands, like less than 5, and yes, the work is more than justified.)
But all of this ignores alternative approaches, which hit more of the "bullet points" in your Guiding Priniciples.
Are those Principles in order of importance?
If you are informed of an approach which goes further down the list, and in fact hits all but the second last one, would there not be a compelling argument that that would be a better approach?
The gist of the alternative method of preserving everything in an ODF spreadsheet without risking introducing errors (because of different ideas of what 1+2 results in, for instance), is to make the cell contents on imported cells used in foreign namespace formulas read-only (e.g. FOOBAR:=formula(foo,bar) means that "foo" and "bar" are protected from being modified, by default.)
The problem of interpreting formulas is only encountered when formulas are interpreted.
In reading in a spreadsheet, both the formulas AND THE LAST RESULTING VALUES are present. The previous values are by definition "right". If the input to the formula doesn't change, e.g. by being protected against being changed by the user, then the formula NEVER needs to be re-interpreted.
This would mean:
- ODF 1.1 is supported
- the results are predictable (a first for MS ;-))
- user intent is preserved
- visual fidelity is preserved
And, as a bonus, the foreign namespace formulas *and values* survive a round-trip completely intact.
So, what do you think of this proposal?
Do you not anticipate a demand for this sort of functionality in a spreadsheet application, claiming to "do" ODF 1.1?
@Dennis, the tractor-pull analogy is right on. The principles are indeed often in tension/opposition with one another on certain details.
There’s always room for people to debate a given approach is more or less astonishing than another approach, of course. But our Excel program managers felt strongly, based on their long experience with real customer issues, that if you open a spreadsheet someone sends you and see different results from the calculation than the sender saw that would be much worse experience than if you saw the same results they sent, but you could not easily recalculate them. With the current state of ODF 1.1 spreadsheet interop there is just no easy way for someone to know if they are seeing the same calculation results that someone else saw. And the astonishment could be severe if, for example, you accept someone’s bid to remodel your kitchen for $10,000 and then later they tell you that in their favorite spreadsheet application the total came to $15,000.
Thanks for your comment, Franco. Regarding formulas, note the responses to Jomar and Dennis above.
I think it's important that we all remember there is NO published standard for ODF spreadsheet formulas yet. Nor is there any de-facto standard that everyone agrees on. In truth, among all of the other ODF implementations real-world complex spreadsheets don’t interoperate very well either. The only way an implementer can try to make ODF spreadsheet interop work today (and by work I mean not just for trivial cases, but really work reliably for real-world complex spreadsheets) is by the “spaghetti diagram" method, with all of the complexity and risk of bugs that entails. No implementer we know of has attempted that, and the very point of my post was to explain why we don’t think this is a good approach, and why we think the standards-based approach is better.
In the case of spreadsheet formulas, help is on the way -- OpenFormula is under development for use with ODF 1.2. In the meantime, we should not pretend that all was well with spreadsheet interop before SP2 came along. And (in my opinion) we would be collectively better off to spend our energies on solving the problem instead of complaining about it.
FYI, I’d like to keep this thread on-topic around the concepts covered in this post, as outlined in my comment policy. There are already 118 comments on the two threads about formulas that I linked to above, so I’ll only be letting more comments on that topic appear here if they truly add something new to the discussion that has not already come up in those threads.
I appreciate the post, very good, because it raises these aspects which are often overlooked. Visually I would rather frame it in terms of convergence, a spiral.
You stressed the need to keep a strong compatibility with legacy formats in DIS 29500. So how does your interoperability process function when the competitor in your diagram above is your past product? How is it possible to "get it right" rather than to support 1983 legacy bugs? Y2K was as you know a strict standard compliance bug.
A format lock-in often apllies not only to the implementation but the "corpus of existing documents". The deliberate choice to implement formula totally different from competitors ("Nor is there any de-facto standard that everyone agrees on."), what will that imply for the future: No openformula because the MS-ODF legacy has to be supported? Or incompatibility of your old formats because of the literal approach?
I doubt someone would ever find a magic bullet to interoperability and user satisfaction. It is more kneading to converge. For convergence commercial and public pressure seems helpful, and of course a GATT style process to make that happen.
In terms of visual fidelity NLnet and Opendoc Society prepare an interesting project: http://www.officeshots.org/
Still, as a user, I have to say that discarding formulas in spreadsheets *sucks*, and it goes a long way towards "astonishing" (as in "least astonishment") in a very negative way anyone who is not aware of this limitation.
OpenFormula is still under development, but it's already used in a major implementation of ODF (OpenOffice), so it's not likely to evolve much in incompatible ways. There are ways to handle such a situation. You could have implemented it, and provided a service pack if changes do happen. Or if it really bothered you, then you could have included a checkbox & warning to enable/disable it according to the user's needs. It would not have been the first time Microsoft implemented a standard before its finalization (e.g. the IE team is already implementing parts of HTML5), if it's beneficial to the user.
Still no committee draft on OpenFormula spec in OASIS.
Will OASIS even allow ODF 1.2 to continue in the standardization proces when it is relying heavily on a specifcation still in draft form ?
The concept of an implementation round-tripping content that it doesn't understand is something we considered during initial planning of our ODF implementation. It was also one of the topics raised during the roundtable discussions at the DII event in Redmond last July.
The problem is that if you don't understand a piece of content, then you can't reliably know whether other changes you've made in areas that you do understand may have invalidated the content. Consider, for example, a spreadsheet like this:
Suppose we open that spreadsheet in an implementation that round-trips content it doesn't understand, and we insert a header row. What happens?
The cells all shift down, but the formula gets round-tripped without a change. So when you open the modified spreadsheet back in the original application, you see a different result than it had before:
@Andre, you're right, things get very messy when you have to deal with not only multiple products, but multiple versions of each one.
If your application can understand cell addressing (you know, the "atoms" of spreadsheets, duh), then the question can be reduced to:
Can we process relativistic moves (like inserting lines, columns, etc), by blindly manipulating the cell addresses, while leaving the formula results untouched?
The answer is yes.
The issue of round-tripping is not nearly so complicated as you make it sound, when the interpretation (of cells *values* and *formulas*) is factored out.
Adding a line, shifting rows, means updating the formula (from "=SUM(B1:B3)" to "=SUM(B2:B4)", while literally not touching the displayed result.
If the original cell B2, which was shifted to B3, had in fact been the text-formatted "2" instead of the integer '2', there still be no need to *calculate* the formula result - since it is a foreign namespace formula.
The underlying engine for formula triggers would need to be smart enough to know that cells are being "tracked", rather than a formula literally being modified. Caveat developer.
@dmahugh: technically, the choice to discard formulas and keep only the latest known good result is valid; however, there is no question that formulas etc. get destroyed on save if you open an ODF spreadsheet made with OOo Calc in MS Excel.
On the other hand, other spreadsheets seem to largely fall to another solution: formula is kept, but not interpreted, and preceded with its namespace. You'll tell me, this destroys visual fidelity (by default) because instead of the last known value, you now have a string the spreadsheet can't compute.
Stop me if I'm wrong though, the last known good value is saved in the file, can be read, and is "merely squashed" by the spreadsheet application.
Now, and THAT would have probably prevented a LOT of criticism, didn't you add a small alert box saying:
"This file contains formulas in a format that Excel can't parse."
Choice 1 "discard formulas, only keep values"
Choice 2 "discard values, display formulas for manual editing"
Check box "remember my choice"
text link to help page explaining the reasons and solutions
And that would have been ALL: have your cake, and eat it too. Frankly, how hard would it have been?
> In my next post, I’ll cover our testing strategy and methodology in more detail. What else would you like to know about how Office approaches document format interoperability?
Doug: could you shed some light if and how the MS Office-team and the ODF-converter team deal with interop testing between 2007sp2 and the plug-in?
Since MS provides technical and architectural guidance for the OpenXML/ODF converter project, I assume both teams somehow share the same methodology on testing two different products, so it could be helpful for other ODF-developers as well...
@A Nonymous, this sounds potentially fragile to me, because the consumer would not be taking full responsibility for the integrity of the content, but would still be modifying certain aspects of it.
@Mitch, your suggestion is essentially a combination of various approaches used in existing implementations. Have you suggested that the current versions of Symphony (1.2) and OpenOffice.org (3.1) should implement this multiple-choice approach as well?
In any event, I think the best path forward is for all of us to focus on making Open Formula as interoperable as possible. The wide variety of approaches that people have suggested indicates a need for an agreed-upon standard in this area.
@Bart, that's a good question, I'll plan to address it in my next post.
@Doug: yup, they should. The fact that they don't is no reason not to one-up them, is it? You could, like, innovate... If they do imitate you, then interoperability wins anyway!
Thing is, their approach (keep the formula) may allow a patient user to re-write a new formula based on the old one and get his (proper) results (in essence, to recreate user intent). Agreed, it's far from being the best solution, but there's no actual loss: result can be found again without external reference.
Current Excel system's would mean that a spreadsheet named LifeTheUniverseAndEverything.odc woud merely contain 42 in Excel - the question would be scrapped. And them Vogons are not patient.
Now, another (non-interactive, but maybe more workflow-disruptive) solution would be a new error message: #NAMESPACE:42 (namespace is not know; last known good value is 42), which would be technically correct (error:can't parse the formula), informative (error caused by unknown namespace), non-destructive (formula and namespace appear in edit mode, like in other spreadsheet programs, and last good value is kept after error message) and forward looking (the day that namespace is supported, formula can be computed again).
Or both solutions, I don't know? It could also be added to ODF 1.2 (how to manage formulas from different namespaces), and it doesn't prevent OpenFormula's development.
Of course, it would mean some careful thinking: how does one work with a spreadsheet that uses several formula namespaces? Although the only thing taken into account by a formula is the 'last known good value', there could be problems such as floating point precision (some may accept double 32-bit float precision, others be limited to single 32-bit precisions, others may use 256-bit precision...), or dates (oxml-f may accept 1902-02-29, odf-of certainly won't) to work out...
I do think those questions should be addressed as soon as possible; destroying a formula ain't a good solution, converting it from one format to another isn't either (ceil() and floor() in MS Office for example are mathematically wrong on negative values; a parameter exists in oooc and OF to deal with import, but there's no way to export floor() or ceil from those namespaces to MS formulas correctly), and merely keeping the last known good value... well, if last known good value is 256-bit precision and is read on a spreadsheet program that has single 32-bit precision, then if Excel's solution is used, not only will you lose the formula, you'll also lose the actual value (loss of precision).
I dunno, maybe I'm overlooking something, what do you think?