Blog Map
[Blog Map] This blog is inactive. New blog: EricWhite.com/blog
This is one in a series of posts on transforming Open XML WordprocessingML to XHtml. You can find the complete list of posts here.
Last week, I blogged about a small project that I'm embarking on: to make a reasonably accurate transform from Open XML word-processing markup to XHTML. I wrote about the approach that I'll be taking, and my initial thoughts about how to proceed. I've done a bit of research, and this week, I'll lay out more details about the approach that I'll take.
One small note about this series of blog posts – these are going to be much more ad-hoc than my usual posts. If I go down the wrong path, then you'll see this J. Also, I'm not going to spend too much time writing and re-writing the posts.
One of the key aspects of the approach that I'll take is to use the power of CSS:
One key aspect of the approach that I'm going to take: I am not going to translate numbered/bulleted items from word-processing markup to li elements in the Xhtml. Instead, I'm going to generate paragraphs of a particular class, and format that class using CSS as appropriate, so that numbered items and bulleted lists are rendered properly. While numbered items that are formatted in a simple way translate to li elements in the Xhtml markup, the capabilities of numbered items in word-processing markup are rich (RICH!), and as soon as the markup uses more than the most rudimentary capabilities, the translation breaks down. This has been one of the biggest complaints about other projects that convert Open XML to html – that numbered items aren't translated properly. I could go down the road of translating rudimentary numbered items to li elements, and then translate the more rich variations into paragraphs, but this is messy. Instead, I believe that I'm going to discard using li elements altogether.
As I've researched how I'll implement this, I've decided on a few limitations:
I'm sure that I'll discover other places where I will want to place limits on the transform.
The last thing I'll present in this post is the skeleton for the conversion. The following code will do a simplistic transform of simple Open XML documents to simple XHTML. I can then build and extend this code, handling more and more sophisticated varieties of markup. For a detailed explanation of how this type of transform works, see the post, Recursive Pure Functional Transforms of XML.
using System;using System.Collections.Generic;using System.IO;using System.Linq;using System.Text;using System.Xml;using System.Xml.Linq;using DocumentFormat.OpenXml.Packaging;namespace HtmlConverter{ public static class Extensions { public static XDocument GetXDocument(this OpenXmlPart part) { XDocument partXDocument = part.Annotation<XDocument>(); if (partXDocument != null) return partXDocument; using (Stream partStream = part.GetStream()) using (XmlReader partXmlReader = XmlReader.Create(partStream)) partXDocument = XDocument.Load(partXmlReader); part.AddAnnotation(partXDocument); return partXDocument; } public static string StringConcatenate(this IEnumerable<string> source) { StringBuilder sb = new StringBuilder(); foreach (string s in source) sb.Append(s); return sb.ToString(); } } public static class W { public static XNamespace w = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"; public static XName body = w + "body"; public static XName document = w + "document"; public static XName p = w + "p"; public static XName pPr = w + "pPr"; public static XName r = w + "r"; public static XName rPr = w + "rPr"; public static XName t = w + "t"; public static XName tbl = w + "tbl"; public static XName tc = w + "tc"; public static XName tr = w + "tr"; public static XName txbxContent = w + "txbxContent"; public static XName val = w + "val"; public static XName pStyle = w + "pStyle"; public static XName b = w + "b"; } public static class Xhtml { public static XNamespace xhtml = "http://www.w3.org/1999/xhtml"; public static XName html = xhtml + "html"; public static XName head = xhtml + "head"; public static XName title = xhtml + "title"; public static XName body = xhtml + "body"; public static XName p = xhtml + "p"; public static XName h1 = xhtml + "h1"; public static XName h2 = xhtml + "h2"; public static XName A = xhtml + "A"; public static XName href = "href"; public static XName b = xhtml + "b"; public static XName table = xhtml + "table"; public static XName border = "border"; public static XName tr = xhtml + "tr"; public static XName td = xhtml + "td"; } public static class HtmlConverter { public static object ConvertToHtmlTransform(WordprocessingDocument wordDoc, XNode node) { XElement element = node as XElement; if (element != null) { if (element.Name == W.document) return new XElement(Xhtml.html, new XElement(Xhtml.head, new XElement(Xhtml.title, "Test.docx") ), element.Elements().Select(e => ConvertToHtmlTransform(wordDoc, e)) ); // transform the w:body element to the XHTML h:body element if (element.Name == W.body) return new XElement(Xhtml.body, element.Elements().Select(e => ConvertToHtmlTransform(wordDoc, e))); // transform every Heading1 styled paragraph to the XHTML h:h1 element if (element.Name == W.p && (string)element .Elements( W.pPr) .Elements(W.pStyle) .Attributes(W.val) .FirstOrDefault() == "Heading1") return new XElement(Xhtml.h1, element.Elements().Select(e => ConvertToHtmlTransform(wordDoc, e))); // transform every Heading2 styled paragraph to the XHTML h:h2 element if (element.Name == W.p && (string)element .Elements(W.pPr) .Elements(W.pStyle) .Attributes(W.val) .FirstOrDefault() == "Heading2") return new XElement(Xhtml.h2, element.Elements().Select(e => ConvertToHtmlTransform(wordDoc, e))); // transform w:p to h:p if (element.Name == W.p) return new XElement(Xhtml.p, element.Elements().Select(e => ConvertToHtmlTransform(wordDoc, e))); // transform every text run that is styled as bold to the XHTML h:b element if (element.Name == W.r && element.Elements(W.rPr).Elements(W.b).Any()) return new XElement(Xhtml.b, element.Elements(W.t).Select(e => (string)e).StringConcatenate()); // transform every text run that is not styled as bold to a text node that // contains the text of the paragraph. if (element.Name == W.r && !element.Elements(W.rPr).Elements(W.b).Any()) return new XText(element.Elements(W.t) .Select(e => (string)e).StringConcatenate()); // transform w:tbl to h:tbl if (element.Name == W.tbl) return new XElement(Xhtml.table, new XAttribute(Xhtml.border, 1), element.Elements().Select(e => ConvertToHtmlTransform(wordDoc, e))); // transform w:tr to h:tr if (element.Name == W.tr) return new XElement(Xhtml.tr, element.Elements().Select(e => ConvertToHtmlTransform(wordDoc, e))); // transform w:tc to h:td if (element.Name == W.tc) return new XElement(Xhtml.td, element.Elements().Select(e => ConvertToHtmlTransform(wordDoc, e))); // the following removes any nodes that haven't been transformed return null; } return null; } public static XElement ConvertToHtml(WordprocessingDocument wordDoc) { // TODO WE REALLY WANT TO DO THIS ON BLOCK LEVEL CONTENT, NOT JUST CHILD ELEMENTS // OF THE BODY ELEMENT return ConvertToHtml(wordDoc, wordDoc .MainDocumentPart .GetXDocument() .Element(W.document) .Element(W.body) .Elements()); } public static XElement ConvertToHtml(WordprocessingDocument wordDoc, IEnumerable<XElement> blockLevelContent) { if (blockLevelContent == null) throw new ArgumentException("blockLevelContent argument is null"); XElement firstBlockLevelElement = blockLevelContent.FirstOrDefault(); if (firstBlockLevelElement == null) throw new ArgumentException("blockLevelContent sequence is empty"); XDocument doc = firstBlockLevelElement.Document; XElement xhtml = (XElement)ConvertToHtmlTransform(wordDoc, doc.Root); return xhtml; } } class Program { static void Main(string[] args) { string fileName = "Test.docx"; FileInfo fi = new FileInfo(fileName); string baseName = fi.Name.Substring(0, fi.Name.Length - 4); string newFileName = baseName + "-Copy.docx"; File.Copy(fileName, newFileName); using (WordprocessingDocument doc = WordprocessingDocument.Open(newFileName, true)) { XElement html = HtmlConverter.ConvertToHtml(doc); html.Save("Test.html"); } } }}