Welcome to MSDN Blogs Sign in | Join | Help
Comparison of Html/CSS Tables to WordprocessingML Tables

[Blog Map]

(Update Nov 4, 2009: This is the 5th in a series of posts (#1, #2, #3, #4, #5) on doing a transform of WordprocessingML to XHtml.)

Html tables and WordprocessingML tables have a lot in common.  Both can present complex tables with horizontally and vertically merged cells, and both have a rich set of capabilities for formatting.  But there are differences in their models and capabilities.  This blog post presents those differences, specifically around three areas:

  • Table Layout
  • Formatting
  • Differences in capabilities at the table, row, and cell level

I'm currently in the process of coding a pure functional transform from WordprocessingML to XHtml.  Understanding the exact differences between the two types of tables enables writing this transform as accurately as possible.  In addition, if you understand CSS and Html tables, this blog post provides an easy way to learn about WordprocessingML tables.  (If you're a CSS expert, and see something I'm doing incorrectly, please correct me. J)

Note: In a previous post, I talked about a plan to transform WordprocessingML styles to CSS classes.  I've decided to not use CSS classes to represent WordprocessingML styles.  Instead, I'm going to generate a style attribute for each object (p, table, tr, td, etc.) that contains all necessary formatting for that object.  My rational for this decision is detailed in this post, in the "Differences in Formatting" section below.  This isn't a decision that I'm taking lightly, but I believe it is the correct one.  But we'll see…

Differences in Table Layout

On the surface, the layout of WordprocessingML and Html tables look very similar.  Of course, both can present a simple table that contains data:

Both can contain horizontally and vertically merged cells:

Both can represent an irregular layout:

However, WordprocessingML and XHtml tables use a somewhat different model for layout.

In WordprocessingML, you first establish a grid with some number of grid columns.  Left and right edges of cells will always be on a grid column.  The mechanism for horizontal cell spanning is that you specify the number of grid columns that a cell spans.  You can specify that the first cell in a row starts after skipping a certain number of grid columns.

In contrast, in XHtml, there is no underlying grid on which you layout cells.  Instead, the cells themselves form the grid.

To make this difference clear, let's look at a simple example.  Consider the following table with four cells, but the vertical rule between the top two cells isn't aligned with the vertical rule between the bottom two cells:

Here is the WordprocessingML that describes this table.  Notice the w:tblGrid, which describes the grid, and the w:gridSpan elements on the top left and bottom right cells.  While the grid describes three grid columns, there are only two cells per row.

<w:tbl>

  <w:tblPr>

    <w:tblStyle w:val="TableGrid"/>

    <w:tblW w:w="0" w:type="auto"/>

    <w:tblLook w:val="04A0"/>

  </w:tblPr>

  <w:tblGrid>

    <w:gridCol w:w="1368"/>

    <w:gridCol w:w="450"/>

    <w:gridCol w:w="1350"/>

  </w:tblGrid>

  <w:tr>

    <w:tc>

      <w:tcPr>

        <w:tcW w:w="1818" w:type="dxa"/>

        <w:gridSpan w:val="2"/>

      </w:tcPr>

      <w:p>

        <w:r>

          <w:t>Top Left</w:t>

        </w:r>

      </w:p>

    </w:tc>

    <w:tc>

      <w:tcPr>

        <w:tcW w:w="1350" w:type="dxa"/>

      </w:tcPr>

      <w:p>

        <w:r>

          <w:t>Top Right</w:t>

        </w:r>

      </w:p>

    </w:tc>

  </w:tr>

  <w:tr>

    <w:tc>

      <w:tcPr>

        <w:tcW w:w="1368" w:type="dxa"/>

      </w:tcPr>

      <w:p>

        <w:r>

          <w:t>Bottom Left</w:t>

        </w:r>

      </w:p>

    </w:tc>

    <w:tc>

      <w:tcPr>

        <w:tcW w:w="1800" w:type="dxa"/>

        <w:gridSpan w:val="2"/>

      </w:tcPr>

      <w:p>

        <w:r>

          <w:t>Bottom Right</w:t>

        </w:r>

      </w:p>

    </w:tc>

  </w:tr>

</w:tbl>

 

Following is markup for a similar table in XHtml.  There are three cells per row instead of two.  The first two rows (the only ones we see) each contain a cell with a colspan attribute, merging two cells into one.  The third row, with no border and a height of zero pixels, defines three cells.  This is a trick based on the semantics of XHtml tables.  When determining the widths of cells, the browser looks at all rows of the table, and then calculates the column width, taking widths of all cells of that column into consideration.  Using this approach, we need to specify column widths only once, in the last invisible row of the table.

<table style='border-collapse:collapse;border:none'>

 <tr>

  <td colspan="2"

      style="border:solid;border-width:1px;border-color:Black;padding:0in 5.4pt 0in 5.4pt">

    <p>Top Left</p>

  </td>

  <td style="border:solid;border-width:1px;border-color:Black;padding:0in 5.4pt 0in 5.4pt">

    <p>Top Right</p>

  </td>

 </tr>

 <tr>

  <td style="border:solid;border-width:1px;border-color:Black;padding:0in 5.4pt 0in 5.4pt">

    <p>Bottom Left</p>

  </td>

  <td colspan="2"

      style="border:solid;border-width:1px;border-color:Black;padding:0in 5.4pt 0in 5.4pt">

    <p>Bottom Right</p>

  </td>

 </tr>

 <tr style="max-height:0px">

  <td style='width:68.4pt;border:none'></td>

  <td style='width:22.5pt;border:none'></td>

  <td style='width:67.5pt;border:none'></td>

 </tr>

</table>

 

The differences in the model become even clearer when we specify that a grid column is skipped before placing the first cell.  The following table shows a row that contains one cell that is shifted to the right:

The WordprocessingML that describes this table follows.  The w:gridBefore element specifies that the one cell in the second row is to be placed in the second grid column.

<w:tbl>

  <w:tblPr>

    <w:tblStyle w:val="TableGrid"/>

    <w:tblW w:w="0" w:type="auto"/>

    <w:tblLook w:val="04A0"/>

  </w:tblPr>

  <w:tblGrid>

    <w:gridCol w:w="2000"/>

    <w:gridCol w:w="2000"/>

  </w:tblGrid>

  <w:tr>

    <w:tc>

      <w:tcPr>

        <w:tcW w:w="2000" w:type="dxa"/>

      </w:tcPr>

      <w:p>

        <w:r>

          <w:t>Top Left</w:t>

        </w:r>

      </w:p>

    </w:tc>

    <w:tc>

      <w:tcPr>

        <w:tcW w:w="2000" w:type="dxa"/>

      </w:tcPr>

      <w:p>

        <w:r>

          <w:t>Top Right</w:t>

        </w:r>

      </w:p>

    </w:tc>

  </w:tr>

  <w:tr>

    <w:trPr>

      <w:gridBefore w:val="1"/>

    </w:trPr>

    <w:tc>

      <w:tcPr>

        <w:tcW w:w="2000" w:type="dxa"/>

      </w:tcPr>

      <w:p>

        <w:r>

          <w:t>Bottom Right</w:t>

        </w:r>

      </w:p>

    </w:tc>

  </w:tr>

</w:tbl>

 

Here is how we would form this table in XHtml:

<table style='border-collapse:collapse;border:none'>

 <tr>

  <td style="border:solid;border-width:1px;border-color:Black;padding:0in 5.4pt 0in 5.4pt">

    <p>Top Left</p>

  </td>

  <td style="border:solid;border-width:1px;border-color:Black;padding:0in 5.4pt 0in 5.4pt">

    <p>Top Right</p>

  </td>

 </tr>

 <tr>

  <td style="border:none;padding:0in 5.4pt 0in 5.4pt">

    <p>&nbsp;</p>

  </td>

  <td style="border:solid;border-width:1px;border-color:Black;padding:0in 5.4pt 0in 5.4pt">

    <p>Bottom Right</p>

  </td>

 </tr>

 <tr style="max-height:0px">

  <td width="100" style='border:none'></td>

  <td width="100" style='border:none'></td>

 </tr>

</table>

 

In XHtml, we have no choice but to place a cell in the location where there is no cell visible.  We place a non-breaking space in that cell, as some browsers may collapse the cell if it contains no data.  We also specify padding.  The table then renders as desired.

There is a simple strategy that we can take when converting the WordprocessingML to XHtml, which is to generate XHtml cells based on the grid, not on cells.  We then specify appropriate colspan and style attributes to make the table render as we wish.

This subtle difference in abstraction is one of the most important differences between tables in WordprocessingML and XHtml.  By taking this difference into account, it is easy to craft an algorithm that will produce tables that will render as we wish in XHtml.  In addition to this difference in abstraction, there are a number of differences in formatting and capabilities.  I don't believe that I've isolated all of the differences, but I think I've found most of the important ones.  In some of the conversions, I didn't yet spend the time to find the correct CSS approach, so am still using an Html attribute approach.

Differences in Formatting

There are a number of analogous capabilities in formatting between tables in WordprocessingML and XHtml/CSS, but one of the key differences is that in WordprocessingML, there is a rich infrastructure of style inheritance.  Table styles can inherit from other table styles.  Paragraph styles can inherit from other paragraph styles.  Run styles can inherit from other run styles.  In contrast, in CSS, we can define classes, but we can't define that one class inherits from another class.  However, when specifying the class for an element such as a table, paragraph, or span, we can specify more than one class, and each class is applied in turn.  This is analogous to style inheritance, but the mechanisms are completely different.

It might seem that we could use the ability to specify multiple classes for an XHtml object to implement a form of style inheritance, but there is one important aspect of the semantics of WordprocessingML styles that make it impossible to use CSS classes to implement style inheritance.  Table styles in WordprocessingML have the capability to define what are called conditional table formatting properties.  These are properties that are applied in a specific order to a) the entire table, b) banded columns, c) banded rows, d) first and last row, e) first and last column, f) specific cells at the corners.  And, of course, conditional table formatting properties inherit from the same conditional formatting properties of the base style of a table style.  In theory, we could define styles for each of these conditional table formatting properties, and apply these styles in order of precedence to each cell in the table.  But let's say that we have one table style with a number of conditional formatting properties that derives from another style that also contains a number of conditional formatting properties.  When specifying the classes for a paragraph, it would look something like this:

<p class="BaseStyle BaseStyle_EntireTable BaseStyle_Banded_Columns BaseStyle_BandedRows (etc.)

          DerivedStyle DerivedStyle_EntireTable DerivedStyle_BandedColumns (etc.)>Some text.</p>

 

If we had a string of derived table styles, we could end up applying 30 or 40 (or many more!) classes to a single paragraph or run.  But even so, it won't work, because if the BaseStyle contains some property P, and a conditional formatting property overrides that property, and then DerivedStyle overrides the BaseStyle property P, and the conditional formatting property does not define that property, then the property that should apply is the one defined in the conditional formatting for the BaseStyle, not the property defined in the DerivedStyle.  It simply won't work.  We could start playing around with ordering of applications of classes, but I would hate to debug this.

We could go through the effort of defining classes for each uniquely styled cell in each table.  This would involve rolling up all inherited styles, and implementing the appropriate semantics for overriding properties at the table, paragraph, and run level, keeping a list of uniquely styled paragraphs and runs, then generating a CSS class for each unique combination of properties.  This does have the advantages (and disadvantages) of moving styling information away from the paragraphs and runs into the internal style sheet.  These classes would have a computer-generated, non-descriptive name, so they wouldn't be helpful to a person who is reading the XHtml.  In addition, it is highly unlikely that these classes could be re-used.  It's not worth the effort, I believe.

One approach would be to define a certain set of CSS classes, then override those classes with locally applied styling information in the style attribute.  But that defeats the whole purpose of having CSS classes in the first place.  With that approach, we still don't have separation of content and presentation, and as you can see, attempting to use CSS classes to represent styles is very complex and prone to bugs.

The approach that I've decided to take is to properly roll-up styling information from the WordprocessingML and store that styling information in the style attribute for each object, optimizing that styling information so that if a property is defined at a higher level, it isn't redefined.  For instance, if the paragraph specifies that a particular font is used, then the run doesn't also specify it.  This optimization can be done after assembling all formatting information for each paragraph and run.  This has the advantage that this conversion really is strictly a conversion of WordprocessingML to its presentation.  By not using CSS classes, it makes the conversion more straightforward.  It will be easier to debug.  I think it is useful for this conversion to simply be a transform of WordprocessingML to its presentation, without involving the complexities that CSS classes bring.  In effect, we're using XHtml and CSS used at the object level purely as a presentation engine.

Table Capabilities

Following is a partial list of features of WordprocessingML tables, and how they map to XHtml table features:

  • Both support visually right-to-left tables for languages such as Hebrew and Arabic.  The w:bidiVisual element translates to the dir attribute of the table element.
  • Both support alignment of the table with respect to the margins of the containing section or object.  To translate the w:tblInd element, create a div element with the align attribute set to some value (right, left, center).
  • Both support background shading.  However, with WordprocessingML, you can specify a pattern for background shading.  It could be possible to generate images, but this isn't a key scenario.  For phase one, the conversion will convert to shading with patterns to a solid color.
  • WordprocessingML contains the abstraction of themes.  In certain places, the conversion needs to retrieve font and color information from a theme.
  • Both support table and cell borders.  However, WordprocessingML contains two features not supported in XHtml.  WordprocessingML supports a large number of cell borders, including many 'clip art' varieties, such as "apples", "babyRattle", and "bats".  All of the clip art varieties will be converted to a single line border.  Commonly used styles such as solid, dotted, double lines, etc. will convert to the corresponding style in XHtml/CSS.  In addition, WordprocessingML supports diagonal borders.  These aren't commonly used, and I'm going to delay supporting them.
  • Cell margin (w:tblCellMargin) maps to the CSS padding attribute.  Cell margin is the space between the cell contents extent and the cell border.  Cell margin is typically expressed in terms of dxa, or 1/1440 of a point.  The CSS padding attribute can be expressed in inches, points, or other units of measure.
  • Cell spacing (w:tblCellSpacing) maps to the cellspacing attribute of the table object.  Cell spacing is the space between cell borders, but within the table.  Cell spacing is merged between adjacent cells.  Cell spacing in WordprocessingML is typically expressed in terms of dxa, or 1/1440 of a point.  The XHtml cellspacing attribute is in terms of pixels. 
  • Both models support flowing text around a Table.  In WordprocessingML, it is supported via floating tables (w:tblOverlap).  In XHtml and CSS, set the align attribute of table to left, and specify appropriate margins so that the table renders properly with the correct space between the table and surrounding text.

Row Capabilities

Following is a partial list of features of WordprocessingML rows, and how they map to XHtml row features:

  • In WordprocessingML, rows have the ability to be hidden.  Given my primary goal in simply rendering the table properly, the proper conversion is to remove hidden rows from the converted XHtml.
  • In WordprocessingML, rows can be centered, aligned left, or aligned right.  There is no corresponding capability in XHtml.  For phase one, the conversion will disregard row alignment.
  • In WordprocessingML, you can specify that a particular row is a row header, and should be repeated on each printed page.  Headers in XHtml tables provide the ability to format them separately.  They take on a bold appearance by default.  These capabilities are really not analogous, so for phase one, will not convert one to the other.
  • Table row height can be converted.  w:trHeight converts to the CSS height property of a row.

Cell Capabilities

Following is a partial list of features of WordprocessingML cells, and how they map to XHtml cell features:

  • The w:noWrap element translates to the noWrap attribute of the td element.
  • Background shading of cells can be converted.  The same issues apply as with table background shading.
  • Cell borders can be converted.  The same issues apply as with table borders.
  • WordprocessingML has the capability to alter kerning so that the text fits exactly in a cell.  The w:tcFitText element translates to the CSS fit-text property.
  • WordprocessingML supports setting the text flow direction.  This isn't supported in XHtml tables.
  • Horizontal and vertical alignment is supported in both models.

With this post, I've detailed much of what I think I need to know to transform Open XML WordprocessingML tables to XHtml tables using CSS for formatting.  I've also outlined the strategy that I think I'll follow given the slightly different layout model of tables in WordprocessingML and tables in XHtml.  As I code the transform, I'll revise this post so that I can remember the details of the transform of WordprocessingML tables to XHtml tables.

Open XML WordprocessingML Style Inheritance (Post #4)

[Blog Map]

(Update Nov 4, 2009: This is the 4th in a series of posts (#1, #2, #3, #4, #5) on doing a transform of WordprocessingML to XHtml.)

When working with WordprocessingML, nearly all of the information that we need to render paragraphs, tables, and numbered items is contained in styles, stored in the WordprocessingML Style Definitions part.  Styles are somewhat complicated because styles have inherited behavior – one style can be based on another style.  Rendering of text that has the derived style then is dependent on the derived style, it's base class, that base class's base class, and so on.  The Open XML specification refers to this list of styles that are derived from other styles as the 'style chain', which accurately describes the abstraction.

When determining the set of properties for rendering a paragraph or table, the first job is to 'roll up' all styles in the style chain, creating a single set of properties that we can apply to the paragraph or table.  This process of 'rolling up' styles is made somewhat more complicated because there are four different styles of semantics that we must apply to elements in the rolling-up process.

However, it's not too complicated, and after carefully defining the semantics of 'rolling-up' styles in the style chain, we can write a small bit of generalized code to do this – probably less than 100 lines of code.

You'll notice something about the semantics of style inheritance – by far, when rolling up the styles, the most common operation is to replace any elements in base styles with an element in a derived style.  In the code that I'm going to write which will roll-up styles, if the inheritance semantics are other than merging attributes or merging child elements, then the default behavior will be to do element replacement.  This will make the code as small and robust as possible.

This post probably isn't of very much interest to most people, but to the folks who are interested, it will be very important.  I'm in the process of writing a fairly compact conversion of Open XML to XHtml, and needed to work out the exact behavior of style inheritance.  After working it out, it made good sense to blog it to make life easier for others who need to work with rendering issues of WordprocessingML.

Merging Attributes

In some cases, we must iterate through attributes of a particular element, and if the element in the derived style has an attribute, we must apply that attribute, overriding the attribute in the base style.  In many cases, the base style may not define that particular attribute, so in that case, we must simply add the attribute to the element in the rolled-up style.  For example, we may have a style, SpaceBefore, which defines a style that has space before the paragraph, but no space after:

<w:style w:type="paragraph"

         w:customStyle="1"

         w:styleId="SpaceBefore">

  <w:name w:val="SpaceBefore"/>

  <w:basedOn w:val="Normal"/>

  <w:qFormat/>

  <w:rsid w:val="00A670C6"/>

  <w:pPr>

    <w:spacing w:before="200"

               w:after="0"/>

  </w:pPr>

</w:style>

 

We may have a style, SpaceBeforeAndAfter, which defines the w:spacing element with a w:after attribute, like this:

<w:style w:type="paragraph"

         w:customStyle="1"

         w:styleId="SpaceBeforeAndAfter">

  <w:name w:val="SpaceBeforeAndAfter"/>

  <w:basedOn w:val="SpaceBefore"/>

  <w:qFormat/>

  <w:rsid w:val="00A670C6"/>

  <w:pPr>

    <w:spacing w:after="200"/>

  </w:pPr>

</w:style>

 

After 'rolling-up' the style chain, the style that we must apply to a paragraph that has the SpaceBeforeAndAfter style would look like this:

<w:style w:type="paragraph"

         w:customStyle="1"

         w:styleId="SpaceBeforeAndAfter">

  <w:name w:val="SpaceBeforeAndAfter"/>

  <w:basedOn w:val="SpaceBefore"/>

  <w:qFormat/>

  <w:rsid w:val="00A670C6"/>

  <w:pPr>

    <w:spacing w:before="200"

               w:after="200"/>

  </w:pPr>

</w:style>

 

Merging Child Elements

In some cases, we must merge child elements.  We must iterate through all child elements of an element in the derived style, and if the base style doesn't contain a particular element, we must add that element to the 'rolled-up' style.  If the base style does contain the element of interest, then we must either merge attributes or replace the child elements, based on the semantics defined for that child element.  The w:pPr and w:rPr elements are examples of elements that require this type of inheritance.

Consider the style NotIndented, which defines paragraph properties (w:pPr) as follows:

<w:style w:type="paragraph"

         w:customStyle="1"

         w:styleId="NotIndented">

  <w:name w:val="NotIndented"/>

  <w:basedOn w:val="Normal"/>

  <w:qFormat/>

  <w:rsid w:val="00082E03"/>

  <w:pPr>

    <w:spacing w:after="0"/>

  </w:pPr>

</w:style>

 

The following style, Indented, derives from NotIndented:

<w:style w:type="paragraph"

         w:customStyle="1"

         w:styleId="Indented">

  <w:name w:val="Indented"/>

  <w:basedOn w:val="NotIndented"/>

  <w:qFormat/>

  <w:rsid w:val="00082E03"/>

  <w:pPr>

    <w:ind w:left="720"/>

  </w:pPr>

</w:style>

 

After rolling up all styles in the style chain, the style that we should apply to text styled as Indented would be defined as follows:

<w:style w:type="paragraph"

         w:customStyle="1"

         w:styleId="Indented">

  <w:name w:val="Indented"/>

  <w:basedOn w:val="NotIndented"/>

  <w:qFormat/>

  <w:rsid w:val="00082E03"/>

  <w:pPr>

    <w:spacing w:after="0"/>

    <w:ind w:left="720"/>

  </w:pPr>

</w:style>

 

Note that both the w:spacing and w:ind elements require that their attributes be merged.  In most cases, per the list below, elements are replaced (as opposed to merging of attributes).

Replacing Elements

In some cases, while rolling-up styles, we must replace an element and its attributes wholesale.  We don't need to iterate through attributes, replacing individual attributes.  The w:top (Paragraph Border Above Identical Paragraphs) element has these semantics.  Consider the following style that defines a single line, with a size of 4 eighth's of a point, and with a color of red (FF0000 in hex):

<w:style w:type="paragraph"

         w:customStyle="1"

         w:styleId="TopBorder1">

  <w:name w:val="TopBorder1"/>

  <w:basedOn w:val="Normal"/>

  <w:qFormat/>

  <w:rsid w:val="007850D3"/>

  <w:pPr>

    <w:pBdr>

      <w:top w:val="single"

             w:sz="4"

             w:space="1"

             w:color="FF0000"/>

    </w:pBdr>

  </w:pPr>

</w:style>

 

Here is a derived style, TopBorder2, which defines a top border, with a size of 18 eighth's of a point, and no color defined:

<w:style w:type="paragraph"

         w:customStyle="1"

         w:styleId="TopBorder2">

  <w:name w:val="TopBorder2"/>

  <w:basedOn w:val="TopBorder1"/>

  <w:qFormat/>

  <w:rsid w:val="00315108"/>

  <w:pPr>

    <w:pBdr>

      <w:top w:val="single"

             w:sz="18"

             w:space="1"/>

    </w:pBdr>

  </w:pPr>

</w:style>

 

After rolling up the styles in the style chain, the resulting style that should be applied to a paragraph styled TopBorder2 should be like this:

<w:style w:type="paragraph"

         w:customStyle="1"

         w:styleId="TopBorder2">

  <w:name w:val="TopBorder2"/>

  <w:basedOn w:val="TopBorder1"/>

  <w:qFormat/>

  <w:rsid w:val="00315108"/>

  <w:pPr>

    <w:pBdr>

      <w:top w:val="single"

             w:sz="18"

             w:space="1"/>

    </w:pBdr>

  </w:pPr>

</w:style>

 

Notice that the w:color attribute was not inherited from TopBorder1.  The w:top element, along with its attributes, was replaced wholesale.

Style Conditional Table Formatting Properties

There is one special case where merging semantics are slightly more complicated.  Table styles have a very powerful feature called conditional table formatting.  This feature allows us to specify a special set of formatting properties for the top row, the first column, the bottom row, banded columns, banded rows, cells at the top left, top right, etc.  Conditional table formatting is defined in the w:tblStylePr element.  The following table style (markup has been simplified) contains a w:tblStylePr element for the first row, and a w:tblStylePr element for the first column:

<w:style w:type="table"

         w:customStyle="1"

         w:styleId="LightListRedHeader">

  <w:name w:val="Light List Red Header"/>

  <w:basedOn w:val="LightList"/>

  <w:tblStylePr w:type="firstRow">

    <w:pPr>

      <w:spacing w:before="0"

                 w:after="0"

                 w:line="240"

                 w:lineRule="auto"/>

    </w:pPr>

    <w:rPr>

      <w:b/>

      <w:bCs/>

      <w:color w:val="FFFFFF"

               w:themeColor="background1"/>

    </w:rPr>

    <w:tblPr/>

    <w:tcPr>

      <w:shd w:val="clear"

             w:color="auto"

             w:fill="FF0000"/>

    </w:tcPr>

  </w:tblStylePr>

  <w:tblStylePr w:type="firstCol">

    <w:rPr>

      <w:b/>

      <w:bCs/>

    </w:rPr>

  </w:tblStylePr>

 

A table style definition most often will have several w:tblStylePr elements.  We can't simply merge child elements for the w:tblStylePr element.  We must first match the w:type attribute, and then merge child elements.

Summary of Style Inheritance Semantics

The table at the end of this post summarizes the semantics that we must apply when 'rolling-up' styles.

A fair number of elements in the style hierarchy exist solely for the user interface or other purposes.  We are only interested in rolling up those elements that impact presentation, so I'm eliminating elements that don't apply.  A few elements (name and basedOn) are used in the rolling-up process, so I am listing those.

Note that this is only part of the story around putting together the style information for a cell in a table.  After rolling up styles in a style chain into a single set of properties for a table, we must also roll up character formatting information, which involves rolling up run formatting information for the table, for paragraph styles, and for run styles.  Before rolling any of this up, we need to take the global run properties into consideration.  And when rolling up this information over the hierarchy (table, paragraph, run), we need to handle something called toggle properties.  Finally, where appropriate, we must retrieve information from the theme of the document.  Stay tuned…


Element

Ecma376

Semantics

style

2.7.3.17

Merge child elements

  name

2.7.3.9

Used when assembling inheritance information

  basedOn

2.7.3.3

Used when assembling inheritance information

  pPr

2.7.7.2

Merge child elements

  rPr

2.7.8.1

Merge child elements

  tblPr

2.7.5.4

Merge child elements

  tblStylePr

2.7.5.6

Merge child elements (Conditional Table Formatting Properties).  See the note about this element above.

  tcPr

2.7.5.9

Merge child elements

  trPr

2.7.5.11

Merge child elements

pPr

17.7.8.2

 

  adjustRightInd

2.3.1.1

Replace element

  autoSpaceDE

2.3.1.2

Replace element

  autoSpaceDN

2.3.1.3

Replace element

  bidi

2.3.1.6

Replace element

  cnfStyle

2.3.1.8

Replace element

  contextualSpacing

2.3.1.9

Replace element

  framePr

2.3.1.11

Replace element

  ind

2.3.1.12

Merge attributes

  jc

2.3.1.13

Replace element

  keepLines

2.3.1.14

Replace element

  keepNext

2.3.1.15

Replace element

  kinsoku

2.3.1.16

Replace element

  mirrorIndents

2.3.1.18

Replace element

  numPr

2.3.1.19

Replace element

  outlineLvl

2.3.1.20

Replace element

  overflowPunct

2.3.1.21

Replace element

  pageBreakBefore

2.3.1.23

Replace element

  pBdr

2.3.1.24

Merge child elements

  rPr

2.3.1.29

Merge child elements

  shd

2.3.1.31

Replace element

  snapToGrid

2.3.1.32

Replace element

  spacing

2.3.1.33

Merge attributes

  suppressAutoHyphens

2.3.1.34

Replace element

  suppressLineNumbers

2.3.1.35

Replace element

  suppressOverlap

2.3.1.36

Replace element

  tabs

2.3.1.38

Merge child elements

  textAlignment

2.3.1.39

Replace element

  textboxTightWrap

2.3.1.40

Replace element

  textDirection

2.3.1.41

Replace element

  topLinePunct

2.3.1.43

Replace element

  widowControl

2.3.1.44

Replace element

  wordWrap

2.3.1.45

Replace element

rPr

2.7.8.1

 

  b

2.3.2.1

Replace element

  bCs

2.3.2.2

Replace element

  bdr

2.3.2.3

Replace element

  caps

2.3.2.4

Replace element

  color

2.3.2.5

Replace element

  cs

2.3.2.6

Replace element

  dstrike

2.3.2.7

Replace element

  eastAsianLayout

2.3.2.8

Replace element

  effect

2.3.2.9

Replace element

  em

2.3.2.10

Replace element

  emboss

2.3.2.11

Replace element

  fitText

2.3.2.12

Replace element

  highlight

2.3.2.13

Replace element

  i

2.3.2.14

Replace element

  iCs

2.3.2.15

Replace element

  imprint

2.3.2.16

Replace element

  kern

2.3.2.17

Replace element

  lang

2.3.2.18

Merge attributes

  oMath

2.3.2.20

Replace element

  outline

2.3.2.21

Replace element

  position

2.3.2.22

Replace element

  rFonts

2.3.2.24

Replace element

  rtl

2.3.2.28

Replace element

  shadow

2.3.2.29

Replace element

  shd

2.3.2.30

Replace element

  smallCaps

2.3.2.31

Replace element

  snapToGrid

2.3.2.32

Replace element

  spacing

2.3.2.33

Replace element

  specVanish

2.3.2.34

Replace element

  strike

2.3.2.35

Replace element

  sz

2.3.2.36

Replace element

  szCs

2.3.2.37

Replace element

  u

2.3.2.38

Replace element

  vanish

2.3.2.39

Replace element

  vertAlign

2.3.2.40

Replace element

  w

2.3.2.41

Replace element

  webHidden

2.3.2.42

Replace element

tblPr

 

 

  bidiVisual

2.4.1

Replace element

  jc

2.4.23

Replace element

  shd

2.4.35

Replace element

  tblBorders

2.4.38

Merge child elements

  tblCellMar

2.4.39

Merge child elements

  tblCellSpacing

2.4.43

Replace element

  tblInd

2.4.48

Replace element

  tblLayout

2.4.49

Replace element

  tblLook

2.4.51

Replace element

  tblOverlap

2.4.53

Replace element

  tblpPr

2.4.54

Replace element

  tblStyleColBandSize

2.7.5.5

Replace element

  tblStyleRowBandSize

2.7.5.7

Replace element

  tblW

2.4.61

Replace element

tblStylePr

 

 

  pPr

2.7.5.1

Merge child elements

  rPr

2.7.5.2

Merge child elements

  tblPr

2.7.5.3

Merge child elements

  tcPr

2.7.5.9

Merge child elements

  trPr

2.7.5.10

Merge child elements

tcPr

 

 

  hideMark

2.4.15

Replace element

  noWrap

2.4.28

Replace element

  shd

2.4.33

Replace element

  tcBorders

2.4.63

Merge child elements

  tcFitText

2.4.64

Replace element

  tcMar

2.4.65

Merge child elements

  tcW

2.4.68

Replace element

  textDirection

2.4.69

Replace element

  vAlign

2.4.80

Replace element

trPr

 

 

  cantSplit

2.4.6

Replace element

  gridAfter

2.4.10

Replace element

  gridBefore

2.4.11

Replace element

  hidden

2.4.14

Replace element

  jc

2.4.22

Replace element

  tblCellSpacing

2.4.42

Replace element

  tblHeader

2.4.46

Replace element

  trHeight

2.4.77

Replace element

  wAfter

2.4.82

Replace element

  wBefore

2.4.83

Replace element

 

Transforming Open XML Word-Processing Documents to Html (Post #3)

[Blog Map]

(Update Nov 4, 2009: This is the 3nd in a series of posts (#1, #2, #3, #4, #5) on doing a transform of WordprocessingML to XHtml.)

Over the last couple of weeks, and over the next week, I've been designing and writing some code to convert Open XML word-processing documents to HTML (or Xhtml).  My first post described in broad strokes my goals, my motivations for writing this code, and some details about the approach that I'm considering.  My second post provided more detail about how I'll proceed, my first thoughts about my use of CSS, and specific limitations that I'll place on the conversion.  I also presented my rational for not converting numbered/bulleted items to li elements.  I also presented a skeleton of the conversion code.  As I've been reading through the Open XML specification, more specifics about how I should proceed have become clear to me.  In this post, I'm going to detail some of my conclusions.

First, here are some additional limitations that I'm going to apply to this conversion:

·        I'm not going to attempt to convert documents that contain sub-documents.  This almost certainly would not be one of the primary scenarios.  The conversion will throw an exception if the document contains the w:subDoc element.

·        There are a number of legacy elements that I might be able to ignore: w:dayLong, w:dayShort, w:monthShort, w:monthLong, w:yearLong, w:yearShort, w:pgNum.  Conforming applications should not be writing out these elements.  At some point in the near future, I'm going to write some code to crawl my collection of sample Open XML documents, and count how many documents contain these elements.  This will help me decide whether to do the work to support these elements.

·        I'm going to ignore the w:ruby (phonetic guide) element.  This could be interesting, but I'll reserve this for a later version if it's important.  If this is important to you, I'd be very appreciative if you'd let me know.

·        For phase one of this project, I'm going to do only a rudimentary conversion of DrawingML, specifically to convert images in drawings.  DrawingML contains very rich constructs.  Doing all such transformations and generating appropriate images is in and of itself a complicated project.  However, basic images are described in DrawingML, and we need to be able to generate web pages that contain appropriate references to basic images, so it's important to handle this aspect of DrawingML.  I'll defer the high-fidelity conversion of all aspects of DrawingML to a later project.

·        I'm going to ignore all w:object elements.  Rendering w:object elements is certainly not a main-line scenario.

·        For phase one, I'm not going to attempt to render MathML markup.  This is interesting, but as with DrawingML, non-trivial.  In the interest in getting something working in the next couple of weeks, I'm not going to include conversion of MathML in phase one.

·        As I mentioned in last week's post, I'm not going to convert text separated by physical tabs in phase one.  There is no clean way to approach this, so until the best approach is clear, I'm not going to convert them.

As I mentioned in the first post, I'm going to simplify the word-processing markup before transforming to HTML.  Here are some of the ways that I'll simplify:

·        I'll remove all rsid elements and attributes before doing the conversion to HTML.

·        I'll remove all comments, end notes, and foot notes before doing the conversion.

·        I'll coalesce superfluous runs – combine adjacent runs with identical formatting to a single run.

·        And as I mentioned in the first post, I'll accept all tracked changes (tracked revisions) before doing the conversion.

Now that I've detailed what I won't convert, here is what I will convert:

·        I'll convert all paragraphs and runs, including all text, formatted with the correct font, and with correct paragraph formatting, such as space before and after each paragraph.  This includes honoring all style inheritance, as well as honoring all places where styles defer decisions on font and colors to themes.

·        I'll convert all tables with a high degree of fidelity, including theme formatting, conditional formatting, and tables within tables.  I believe I'll be able to correctly transform both horizontally and vertically merged cells.

·        I'll convert all numbered/bulleted items to straight paragraphs, not li elements.  I detailed my rational for this decision last week.  I'm curious to see how this decision holds up in real-world situations.

·        I'll convert all images, which are represented by DrawingML.  This includes resizing, rotating, mirroring, and flipping images so that the resulting web page looks as close to the word-processing document as possible.  By far, the most important of these is resizing.

·        I'll render w:sectPr as a div element.

·        I'll appropriately render the cr, noBreakHyphen, tab, and br elements.

There are two varieties of hyperlinks as defined in Open XML: hyperlinks described in field codes, and hyperlinks in the simplified version that uses an external reference.  For simplicity, I'll convert all field code hyperlinks to the simplified version before doing the conversion to HTML.  Other than hyperlink field codes, I'll remove all other field code markup, leaving the rendering markup.  For example, markup for a typical field code looks like this:

<w:p>

  <w:r>

    <w:fldChar w:fldCharType="begin"/>

  </w:r>

  <w:r>

    <w:instrText xml:space="preserve"> DATE </w:instrText>

  </w:r>

  <w:r>

    <w:fldChar w:fldCharType="separate"/>

  </w:r>

  <w:r>

    <w:rPr>

      <w:noProof/>

    </w:rPr>

    <w:t>10/15/2009</w:t>

  </w:r>

  <w:r>

    <w:fldChar w:fldCharType="end"/>

  </w:r>

</w:p>

 

In the simplification process, I'll transform this markup to the following, which will be easy to render in HTML:

<w:p>

  <w:r>

    <w:t>10/15/2009</w:t>

  </w:r>