[Blog Map] This blog is inactive. New blog: EricWhite.com/blog
This is one in a series of posts on transforming Open XML WordprocessingML to XHtml. You can find the complete list of posts here.
Recently I went through the process of handling a variety of details in the XHtml converter – generating XHtml entities wherever possible, and making sure that whatever else, the XHtml converter doesn't throw exceptions, regardless of the documents that you throw at it (including invalid documents, to a certain extent). As I was doing so, I touched various pieces of the code, here and there, and I was struck by a couple of interesting (and good) dynamics of maintaining the code. First, changes to code are localized, and second, it's easy to write resilient code.
First, about the code: there are around 3500 lines of code in six modules. The code is written in the pure functional style. Variables and class members are not mutated after creation/initialization. If a variable or object is in scope, we can depend on its value not changing.
This means that for the most part, all of my code changes were local. This only makes sense – if you are writing pure functions/methods, then by definition, you will not be touching data outside the function/method.
The WordprocessingML => XHtml converter is written as a series of successive transformations. So long as I continue to produce intermediate results that are consistent with the design, I'm free to modify code as much as I like. The code is malleable, not brittle. I can as necessary validate that the transform is producing valid Open XML documents.
In Meiller Page-Jones's books on object-oriented design, he coins the term connaissance for the notion of interconnectedness of code. For example, a class has a member function that is used in a variety of other modules. A change in the name or signature of the function requires all uses of it to be updated. This is a variety of interconnectedness that the compiler catches. Another example: a class has member that has some odd-ball semantics, and another class relies on those odd-ball semantics, and if you change the behavior of the method without changing its signature, the compiler won't notice. Magic values are a great example of a horrible form of connaissance.
Magic values are a design 'feature' of some programming interfaces where values from two value domains are used in the same variable or collection. For example, you could have an array of integers where the integers 0-9999 are indexes into a data structure, but -1 and -2 have special meaning. It is better to split the semantic information into multiple fields of a named or anonymous type. If a value is an index, then it should always be an index.
In a couple of places in the XHtml transform I have written code that is locally impure but globally pure because the code is (much) more readable when written that way. One good example of this is when accepting tracked revisions for content controls. The semantics of this transform are pretty involved – a block-level content control may need to be demoted to a run-level content control when processing a tracked deleted paragraph mark. Pure functional code would require that the revised content control be created conditionally at both the block-level and run-level, and that content of the each conditionally created content control would need to be created conditionally. Due to the involved semantics, the resulting pure functional code would be a mess. When I make the decision to write locally impure code like that, I don’t take it lightly, and I make sure that I’m able to articulate why I’m writing imperative (not declarative) code. There are times and places when it’s appropriate to go to any length to write pure functional code without side effects – if you MUST take advantage of multiple cores in specific ways, or if you are writing a system that can relink while running. However, when programming with Open XML, these reasons don’t apply. For one thing, I’ve yet to create an Open XML processing system that isn’t IO bound, so using multiple cores won’t help us. And the second scenario is applicable in telephone switching systems (think erlang). Instead, I use functional code to reduce line count (by an order of magnitude in some cases), and easier debugging.
There are a very few places in the XHtml conversion where there is interconnected code that cross module boundaries. For example, two classes (MarkupSimplifier and HtmlConverter) each have a settings class (MarkupSimplifierSettings and HtmlConverterSettings). If I change a member in one of these classes, then I probably need to touch other modules. But these are mitigated in that the compiler catches issues. In addition, in a few places in the code, a query 'projects' a collection of some type. Other transformations take that collection and transform to other shapes. If I change the definition of that type, then I need to touch other pieces of code. These changes are relatively local, and again, the compiler will catch problems (so long as you don't use magic values).
In code that is written in a functional style, situations where the compiler will not catch issues of interconnectedness are vastly reduced. In many cases, a transform takes a valid WordprocessingML document and produces another valid WordprocessingML document (which is easy to validate). So long as I've satisfied those requirements, change happens locally, and the code is amenable to change.
Before discussing this issue, I want to articulate a couple of goals of my WordprocessingML => XHtml converter.
· First, as much as reasonably possible, it shouldn't throw exceptions, even if you give the converter an invalid document.
· Second, generate *some* XHtml that is reasonable for the document. Of course, the first goal is to generate accurate XHtml, but if there is some case that I haven't handled, or if you send an invalid document through the converter, then at least generate something reasonable.
I've tested the code on a fairly wide variety of documents, including many that were generated by applications other than Word. And sure enough, some of those are invalid. In many cases, Word is pretty good at opening invalid documents, and one of my goals is to make the XHtml converter at least as accepting as Word.
To meet this goal, wherever possible, I wrote code using the approach that I detailed in Querying for Optional Elements and Attributes. Sure, it does a small bit of extra frictional work, but the end result in robustness makes it worth it.
To test the WordprocessingML to XHtml converter, first I put together a set of test documents that would test all cases that I want to cover. Those test documents cause code coverage over all lines of code. Once that set of document could be converted to XHtml to my satisfaction, I ran the code over my collection of Open XML WordprocessingML documents. Over time, I’ve assembled a fairly large collection of documents – around 25,000. Running the code over this larger set of documents illuminated issues associated with edge-case documents and invalid documents. After all my specifically designed documents were converted to XHtml properly, and after running the converter over the larger set of sample documents without encountering exceptions, I could release the code with a fair amount of confidence in its quality.