Welcome to MSDN Blogs Sign in | Join | Help
Using the SharePoint 2010 Managed Client Object Model

The SharePoint 2010 Client Object Model is a very cool feature of SharePoint 2010 Foundation Server that enables developers to write applications that access SharePoint data from a .NET application running on a client computer, from a Silverlight application, and from JavaScript running client-side in a SharePoint web part.  The client object model enables this without requiring the developer to write or install code on the SharePoint server.  The client object model makes it possible to build new categories of applications that integrate with Microsoft SharePoint Foundation 2010, including writing .NET applications, rich interactive web parts, and Silverlight applications.

Note:  This blog post is an MSDN article that I've written.  It will be published sometime in the near future.  It provides you with what you need to start working with the client object model.  Many thanks to Michael Cheng of the client object model team for technically reviewing this post, and providing me with many suggestions that improved the post.

Table of Contents

Using the Managed Client Object Model

How It Works

Creating a Windows Console Managed Client Object Model Application

The Managed Client Object Model

Object Identity

Trimming Result Sets

Creating and Populating a List

Using CAML to Query a List

Filtering the Child Collection returned by LoadQuery using LINQ

Using the LoadQuery Method

Increasing Performance by Nesting Includes in LoadQuery

Filtering the Child Collection returned by LoadQuery using LINQ

Updating Client Objects

Deleting Client Objects

Discovering the Schema for Fields

Accessing Large Lists

Asynchronous Processing

Other Resources

Introduction

A team leader has created a SharePoint team site with numerous lists that are necessary to manage her team’s mission.  She has the need to do ad-hoc bulk modifications to these lists – perhaps updating assignments and estimates based on an Open XML spreadsheet, or moving items from one SharePoint list to another.  She wants to write a small custom application to help her manage this.  A software company that sells a traditional rich client application wants to integrate SharePoint document libraries and lists into their application, and they want this integration to be seamless, or even invisible to their users.  A SharePoint developer wants to build a rich web part for SharePoint that brings list data into their custom Ajax web code.  He also wants to build an even richer Silverlight application that does the same thing.

What do these people have in common?  They can make use of the SharePoint Managed Client Object Model to accomplish their goals.  The SharePoint Managed Client Object Model is a capability of SharePoint 2010 that allows us to write client-side code to work with all of the common objects in our SharePoint sites.  Programs running on the client can add and remove lists, add, update, and delete list items, modify documents in document libraries, create sites, manage permissions of items, add and remove web parts from a page, and much more.

Prior to the availability of the Managed Client Object Model, these people didn’t have a lot of options.  They could use the SharePoint Web Services, but this can be a fairly difficult task.  If the SharePoint Web Services didn’t provide the capabilities that they needed, they could write server-side code to provide a new web service (an even more difficult task).  Some IT departments disallow server-side code, or allow only code that is written by the IT department, so sometimes that isn’t even an option.  The SharePoint Client Object Model enables new types of applications, and makes it much easier for developers to write client-side code that interacts with their SharePoint data.

Using the Managed Client Object Model

To use the Managed Client Object Model (sometimes referred to as the client object model), you write .NET managed code that uses an API that is similar to the SharePoint Foundation Server-Side Object Model that developers use on the server.  The client object model has classes for accessing site collection information, site information, list and list item information, and much more.

In the case of web parts, you use an ECMAScript programming interface that is similar to the .NET API.  For Silverlight, you use a subset of the API that is available through .NET on the client.  While much of the information presented in this article is relevant to the ECMAScript and Silverlight APIs, this article will focus primarily on using the client object model from a .NET client application.

The client object model consists of two assemblies containing five namespaces.  If you look at the classes available in those namespaces, you will see that there are very, very many of them.  Don’t worry; many of those classes are used internally by the object model.  We’re interested only in a subset of them, primarily classes that have direct counterparts to some familiar classes in the SharePoint Foundation Server-Side Object Model.  The following table shows a few of the classes and their corresponding classes in the server object model:

Client

Server

Microsoft.SharePoint.Client.ClientContext

Microsoft.SharePoint.SPContext

Microsoft.SharePoint.Client.Site

Microsoft.SharePoint.SPSite

Microsoft.SharePoint.Client.Web

Microsoft.SharePoint.SPWeb

Microsoft.SharePoint.Client.List

Microsoft.SharePoint.SPList

Microsoft.SharePoint.Client.ListItem

Microsoft.SharePoint.SPListItem

Microsoft.SharePoint.Client.Field

Microsoft.SharePoint.SPField

 

You’ll notice that the client object model uses the same legacy naming pattern for site collections and sites as the server object model.  The Site class represents site collections, and the Web class represents sites.  My preference for using these classes is to name the variables so that the variable name indicates whether it is a site collection or a site, even though we must use the Site and Web classes to declare them:

ClientContext clientContext = new ClientContext(siteUrl);

Site siteCollection = clientContext.Site;

Web site = clientContext.Web;

 

How It Works

An application that uses SharePoint data will interact with the API in a number of ways – call methods and get the return values, pass a CAML query and get the results, and set/get properties.  After you use the API to accomplish a specific task the client object model bundles up all of these uses of the API into some XML that it sends to the SharePoint server.  The server receives this request, and makes appropriate calls into the object model on the server, collects the responses, and forms them into Java Script Object Notation (JSON), and sends that JSON back to the client object model.  The client object model parses the JSON and presents the results to the application as .NET objects (or JavaScript objects in the case of ECMAScript).  The following diagram shows these interactions:

It is important to note that you control when the client object model initiates the process of sending the XML to the server and receiving the JSON back from the server.  We’ll see how shortly.

The bundling of multiple method calls into a single round trip to the server are dictated by the realities of network speed, network latency, and desired performance characteristics.  If the client object model sent data to the server with every method request, the performance of the system, and the increased network traffic would make the system unworkable.

As I mentioned, we must explicitly control when the client object model bundles method calls and sends a request to the server.  As part of this process, before initiating the round-trip to the server, we must explicitly specify what data we want to retrieve from the server.  This is the biggest difference between the client object model and the server object model.  But once you understand the model, it is pretty easy.  The easiest way to start understanding the difference is to see a simple application.

Creating a Windows Console Managed Client Object Model Application

This article uses Windows console applications for the sample code, but you can use the same approach with other application types.

To build the application, you need to add references to two assemblies, Microsoft.SharePoint.Client.dll and Microsoft.SharePoint.Client.Runtime.dll.  Installing SharePoint installs these assemblies on the SharePoint server.  The two assemblies are located at:

%ProgramFiles%\Common Files\Microsoft Shared\web server extensions\14\ISAPI

Note: For the pre-release versions of SharePoint 2010 Foundation Server and SharePoint 2010 Server, you must install the assembles as described here.  After the release, the procedure for installing the SharePoint 2010 Managed Client Object Model assemblies will change.  I will update this post and the MSDN article with the new procedures around the time of the final release of SharePoint Server 2010.

Copy those two assemblies and place them in a convenient location on your development client computer.  You will need to browse to those assemblies to add references to them when you are setting up projects that use the client object model.

To Build the Application

Start Microsoft Visual Studio 2010.

On the File menu, point to New, and then click Project.

In the New Project dialog box, in the Recent Template pane, expand Visual C#, and click Windows.

To the right of the Recent Template pane, Click Console Application.

By default, Visual Studio creates a project that targets the .NET Framework 4, but we must target the .NET Framework 3.5. From the list at the upper part of the File Open Dialog, Select .NET Framework 3.5.

In the Name box, type a name for your project, such as FirstClientApiApplication.

In the Location box, type the location where you want to place the project.

Click OK to create the solution

Add References to the Microsoft.SharePoint.Client and Microsoft.SharePoint.Client.Runtime Assemblies

The classes that you use in a client object model application are located in Microsoft.SharePoint.Client.dll and Microsoft.SharePoint.Client.Runtime.dll.  As I mentioned, before adding those references, you need to copy those assemblies from the SharePoint 2010 Server to your client development computer.

On the Project menu, click Add Reference to open the Add Reference Dialog box.

Select the Browse tab, navigate to the location where you placed Microsoft.SharePoint.Client.dll and Microsoft.SharePoint.Client.Runtime.dll, select both DLLs, and click OK.

Add the Sample Code to the Solution

In Visual Studio, replace the contents of the Program.cs source file with the following code:

using System;

using Microsoft.SharePoint.Client;

 

class DisplayWebTitle

{

    static void Main()

    {

        ClientContext clientContext =

            new ClientContext("http://intranet.contoso.com");

        Web site = clientContext.Web;

        clientContext.Load(site);

        clientContext.ExecuteQuery();

        Console.WriteLine("Title: {0}", site.Title);

    }

}

[Download Code]

 

Replace the URL in the ClientContext constructor with the URL to your SharePoint site.  Build and run the solution.  The example prints the title of the site.

Just as with the SharePoint Foundation Server-Side Object Model, you create a context for the SharePoint site that you want to access.  You can then retrieve a reference to the site from the context.

The call to ClientContext.ExecuteQuery causes the client object model to send the request to the server.  There will not be any network traffic until the application calls ClientContext.ExecuteQuery.

An important point to make about this example is that the call to the Load method doesn’t actually load anything.  Instead, it informs the client object model that when the application calls the ClientContext.ExecuteQuery method, you want to load the property values of the siteCollection object.

This is the model that all interactions with the server take:

  • You inform the client object model about the operations that you want to take.  This includes accessing the values of properties of objects (for example, objects of the List, ListItem, and Web classes), CAML queries that you want to run, and objects such as ListItem objects that you want to insert, update or delete.
  • You then call ClientContext.ExecuteQuery.  No network traffic takes place until you call ExecuteQuery. Until that point, your application is only registering its requests.

Now that you have seen a simple example that demonstrates the nature of client object model applications, i.e. you first set up a query, then you execute the query, which causes the client object model to send traffic on a round-trip to the server, let’s take a detailed look at the model, why it is designed the way it is, and how you code applications using the model.

The Managed Client Object Model

There are a number of aspects of the client object model that we must examine.  There are some specific approaches that the client object model takes to minimize network traffic.  There are powerful ways that we can construct queries.  There are techniques that we can use to increase performance on the server.  We need to see how to create, update, and delete client objects.  There are approaches that we can take to work with very large lists.  But before we dive into any of these subjects, we need to examine the issue of object identity.

Object Identity

The key idea behind object identity is that client objects refer to their corresponding SharePoint object both before and after calling ExecuteQuery, and they continue to refer to that same SharePoint object through multiple calls to ExecuteQuery.

This means that in the process of setting up the query, the client object model will return objects to us that we can use further in setting up the query before calling ExecuteQuery.  This allows us to write more complex queries before starting the round-trip to the server.  We can do more interesting things in a single query, and eliminate network traffic.

The following example gets the Announcements list object, and then gets all items of that list (using the simplest possible CAML query).

using System;

using Microsoft.SharePoint.Client;

 

class Program

{

    static void Main()

    {

        ClientContext clientContext =

            new ClientContext("http://intranet.contoso.com");

        List list = clientContext.Web.Lists.GetByTitle("Announcements");

        CamlQuery camlQuery = new CamlQuery();

        camlQuery.ViewXml = "<View/>";

        ListItemCollection listItems = list.GetItems(camlQuery);

        clientContext.Load(list);

        clientContext.Load(listItems);

        clientContext.ExecuteQuery();

        foreach (ListItem listItem in listItems)

            Console.WriteLine("Id: {0} Title: {1}", listItem.Id,

                listItem["Title"]);

    }

}

[Download Code]

 

Notice the sequence in this example:

  • It first gets a Microsoft.SharePoint.Client.List object using the ListCollection.GetByTitle method.  Remember, this List object has no data in it; it won’t have data in any of its properties until the application calls the ExecuteQuery method.
  • It then calls the GetItems method on the list object, even though that list object hasn’t been populated with data.
  • It finally calls the Load method on both the list and listItems objects, and then calls the ExecuteQuery method.

The key point about this is that the client object model remembers that the list object is the one that the application initialized using the ListCollection.GetByTitle method, and that the client object model should execute the CAML query on that same list object after the list object has been retrieved from the SharePoint database.  Any class that derives from Microsoft.SharePoint.Client.ClientObject has these semantics.

And, as mentioned, we can continue to use client objects to setup further queries after calling the ExecuteQuery method.  In the following example, the code loads the list and calls the ExecuteQuery method.  It then uses that list client object to call the List.GetItems method, and then calls ExecuteQuery again. The list object retained its identity through the call to ExecuteQuery.

using System;

using Microsoft.SharePoint.Client;

 

class Program

{

    static void Main()

    {

        ClientContext clientContext =

            new ClientContext("http://intranet.contoso.com");

        List list = clientContext.Web.Lists

            .GetByTitle("Announcements");

        clientContext.Load(list);

        clientContext.ExecuteQuery();

        Console.WriteLine("List Title: {0}", list.Title);

        CamlQuery camlQuery = new CamlQuery();

        camlQuery.ViewXml = "<View/>";

        ListItemCollection listItems = list.GetItems(camlQuery);

        clientContext.Load(listItems);

        clientContext.ExecuteQuery();

        foreach (ListItem listItem in listItems)

            Console.WriteLine("Id: {0} Title: {1}",

                listItem.Id, listItem["Title"]);

    }

}

[Download Code]

 

Some properties and methods return objects or value types that do not derive from the Microsoft.SharePoint.Client.ClientObject class.  We benefit from using client object identity to access methods and properties only when those methods and properties return client objects or collections of them.  For instance, some classes, such as FieldUrlValue and FieldLookupValue derive from the ClientValueObject class, and we can’t make use of properties that return those types until after the call to the ExecuteQuery method.  Some properties return .NET types such as string or integer, and we also can’t use properties or methods that return those until after the call to the ExecuteQuery method.  Since we can’t use the values of any properties until those values have been populated in the ExecuteQuery call, we can’t, for instance, find an item in a list, and use the value of one of that item’s fields to select items in a further query.  If we try to use a property before it has been populated by the ExecuteQuery method, the client object model will throw a PropertyOrFieldNotInitializedException.

Important note: Client object identity is valid only for a single ClientContext object.  If we initialize another ClientContext object to the same SharePoint site, we can’t use client objects from one client context with the other one.

We’ll make use of object identity behavior in a number of the examples that I present in this article.

This example doesn’t do any error handling.  If the Announcements list doesn’t exist, the client object model will throw an exception in the call to the ExecuteQuery method.  You should be prepared to catch exceptions when you write code that may fail if you ask for objects that may not exist.

Trimming Result Sets

SharePoint 2010 Server is often deployed in organizations with many thousands of users.  When building an application that accesses SharePoint 2010 Server across the network, it makes sense to build it so that that it uses the least amount of network traffic.  There are a number of ways that the client object model helps us do this.  The simplest approach is to use lambda expressions to specify exactly which properties the client object model should return to the application.

The following example shows how to specify that when the client object model loads the site object, it needs to load only the Title and Description properties.  This will reduce the size of the JSON response from the server back to the client object model.

using System;

using Microsoft.SharePoint.Client;

 

class Program

{

    static void Main()

    {

        ClientContext clientContext =

            new ClientContext("http://intranet.contoso.com");

        Web site = clientContext.Web;

        clientContext.Load(site,

            s => s.Title,

            s => s.Description);

        clientContext.ExecuteQuery();

        Console.WriteLine("Title: {0} Description: {1}",

            site.Title, site.Description);

    }

}

[Download Code]

 

If we don’t include these lambda expressions in the call to the Load method, then by default it loads a much larger number of properties (but not all of them).  The first two examples called Load without specifying which properties to load, so the JSON packet that the server returned was somewhat larger than it needed to be.  While in these small examples, it doesn’t make much difference, when loading thousands of list items, carefully specifying the required properties will reduce network traffic.

This use of lambda expressions provides a means for us to specify a list of our desired .NET properties to the Load method.  But reducing network traffic isn’t the only benefit we derive from the client object model’s use of lambda expressions.  Later on in this article, we’ll see how we can filter result sets using lambda expressions.

Next I'll show an example that creates a list and add some data to it.  Then we’ll have some sample data to work with for the rest of this article.

Creating and Populating a List

The following example creates a list, and adds a couple of fields and a few items to it.

using System;

using Microsoft.SharePoint.Client;

 

class Program

{

    static void Main()

    {

        ClientContext clientContext =

            new ClientContext("http://intranet.contoso.com");

        Web site = clientContext.Web;

 

        // Create a new list.

        ListCreationInformation listCreationInfo =

            new ListCreationInformation();

        listCreationInfo.Title = "Client API Test List";

        listCreationInfo.TemplateType = (int)ListTemplateType.GenericList;

        List list = site.Lists.Add(listCreationInfo);

 

        // Add fields to the list.

        Field field1 = list.Fields.AddFieldAsXml(

            @"<Field Type='Choice'

                     DisplayName='Category'

                     Format='Dropdown'>

                <Default>Specification</Default>

                <CHOICES>

                  <CHOICE>Specification</CHOICE>

                  <CHOICE>Development</CHOICE>

                  <CHOICE>Test</CHOICE>

                  <CHOICE>Documentation</CHOICE>

                </CHOICES>

              </Field>",

            true, AddFieldOptions.DefaultValue);

        Field field2 = list.Fields.AddFieldAsXml(

            @"<Field Type='Number'

                     DisplayName='Estimate'/>",

            true, AddFieldOptions.DefaultValue);

 

        // Add some data.

        ListItemCreationInformation itemCreateInfo =

            new ListItemCreationInformation();

        ListItem listItem = list.AddItem(itemCreateInfo);

        listItem["Title"] = "Write specs for user interface.";

        listItem["Category"] = "Specification";

        listItem["Estimate"] = "20";

        listItem.Update();

 

        listItem = list.AddItem(itemCreateInfo);

        listItem["Title"] = "Develop proof-of-concept.";

        listItem["Category"] = "Development";

        listItem["Estimate"] = "42";

        listItem.Update();

 

        listItem = list.AddItem(itemCreateInfo);

        listItem["Title"] = "Write test plan for user interface.";

        listItem["Category"] = "Test";

        listItem["Estimate"] = "16";

        listItem.Update();

 

        listItem = list.AddItem(itemCreateInfo);

        listItem["Title"] = "Validate SharePoint interaction.";

        listItem["Category"] = "Test";

        listItem["Estimate"] = "18";

        listItem.Update();

 

        listItem = list.AddItem(itemCreateInfo);

        listItem["Title"] = "Develop user interface.";

        listItem["Category"] = "Development";

        listItem["Estimate"] = "18";

        listItem.Update();

 

        clientContext.ExecuteQuery();

    }

}

[Download Code]

 

In many cases, where it is possible to create a client object, the application can call an Add method that takes as an argument an object that specifies creation information.  In this example, you can see the use of the ListCreationInformation class for creating a List object, and the use of the ListItemCreationInformation class for creating a ListItem object.  You often will set properties of the creation information class after instantiating it.  You can see that the code sets the Title and TemplateType properties of the ListItemCreationInformation object.  Note that to create a list, you call the ListCollection.Add method, but to create a ListItem, you call the List.AddItem method.  One is on the collection, and the other is on the singleton.

Creating fields in a list also doesn’t use a method named Add that takes a FieldCreationInformation object as an argument, because when we create fields, we are not really creating an instance of the Microsoft.SharePoint.Client.Field class; we are creating an instance of a class that derives from Field.  There are many, many options available for those derived classes, and this would significantly complicate the design of a FieldCreationInformation class; for this reason, the client object model doesn’t include such a class.  Instead, the simplest way to create a field is to specify a little bit of XML that defines the field, and pass that XML to the FieldCollection.AddFieldAsXml method.  There is a FieldCollection.Add method that we can use to create a field, but instead of taking a FieldCreationInformation object, it takes another Field object as a parameter that it uses as a prototype for the field to be created.  This is useful in some scenarios.

In the Discovering the Schema for Fields section of this article, I’ll show you an easy way to discover the XML that you need to specify for fields that you want to create.

It’s important to note that, of course, no objects are actually added to the SharePoint database until the application calls ExecuteQuery.

There is one more item of interest in this example.  Notice that after calling the List.AddItem method, the example sets three indexed properties.  We are setting the values of the fields that were just added to the list.  After setting these properties, the application must call the ListItem.Update method, informing the client object model that those objects have been modified.  The client object model will not work properly if you don’t do so.  We will see the use of the Update method in further examples, when I show how to modify existing client objects.

Now that we have some data, let’s discover some interesting ways to query and modify it.

Using CAML to Query a List

The following example shows how to query the list that we created in the last example using CAML.  This example prints the Development items from our test list.

using System;

using Microsoft.SharePoint.Client;

 

class Program

{

    static void Main(string[] args)

    {

        ClientContext clientContext =

            new ClientContext("http://intranet.contoso.com");

        List list = clientContext.Web.Lists

            .GetByTitle("Client API Test List");

        CamlQuery camlQuery = new CamlQuery();

        camlQuery.ViewXml =

            @"<View>

                <Query>

                  <Where>

                    <Eq>

                      <FieldRef Name='Category'/>

                      <Value Type='Text'>Development</Value>

                    </Eq>

                  </Where>

                </Query>

                <RowLimit>100</RowLimit>

              </View>";

        ListItemCollection listItems = list.GetItems(camlQuery);

        clientContext.Load(

             listItems,

             items => items

                 .Include(

                     item => item["Title"],

                     item => item["Category"],

                     item => item["Estimate"]));

        clientContext.ExecuteQuery();

        foreach (ListItem listItem in listItems)

        {

            Console.WriteLine("Title: {0}", listItem["Title"]);

            Console.WriteLine("Category: {0}", listItem["Category"]);

            Console.WriteLine("Estimate: {0}", listItem["Estimate"]);

            Console.WriteLine();

        }

    }

}

[Download Code]

 

This example produces the following output:

Title: Develop proof-of-concept.

Category: Development

Estimate: 42

 

Title: Develop user interface.

Category: Development

Estimate: 18

 

You may notice that there is a difference between the lambda expressions that we specify in this example and the lambda expressions in the example presented in the Trimming Result Sets section.  We must use the ClientObjectQueryableExtension.Include extension method to specify the properties that we want to load for each item in the collection that we’re loading.  The items parameter of the lambda expression is of type ListItemCollection, which of course doesn’t contain an indexed property that allows us to specify which properties to load for items in the collection.  Instead, we call the Include extension method, which allows us to specify which parameters of that child collection to load.  Parameters to lambda expressions in the Include extension method are of the type of the items of the collection, so this allows us to specify the properties that we want to load for each item in the collection.

Again, it isn’t entirely necessary to understand the exact semantics of this use of lambda expressions. Just remember two coding idioms:

If you are requesting that the client object model load certain properties of a client object (not a collection of them), then specify the properties in the lambda expression that you place directly in the Load method:

clientContext.Load(site,

    s => s.Title,

    s => s.Description);

 

If you are requesting that the client object model load certain properties of each of the items in a collection of client objects, then use the Include extension method, and pass the lambda expressions that specify your desired properties to the Include method:

clientContext.Load(

    listItems,

    items => items

        .Include(

            item => item["Title"],

            item => item["Category"],

            item => item["Estimate"]));

 

Note: I'm currently working on a blog post on exactly how lambda expressions work in the client object model. Now THAT is going to be a geeky post! J

Filtering the Child Collection returned by Load using LINQ

Because the ClientObjectQueryableExtension.Include extension method returns IQueryable<T>, we can chain from the Include method into the IQueryable<T>.Where extension method.  This provides a succinct way to filter the result set.  We should only use this capability when querying client object collections other than collections of ListItem objects, because while we could use this technique to filter collections of ListItem objects, the use of CAML will result in better performance.  This is important enough to say again:

Never use the IQueryable<T>.Where extension method when querying ListItem objects.  The reason is that under the covers, the client object model first evaluates the result of the CAML query, retrieves the results, and then filters the resulting collection using LINQ.  If you filter an extremely large list using LINQ instead of CAML, behind the scenes, the client object model will attempt to retrieve all items in the list before filtering with CAML and will either be issuing queries that will take too much in the way of system resources, or perhaps will not succeed, and the reason won't be evident unless you know how the client object model works internally.  So we need to continue to use CAML when querying list items.  There are some interesting ways that we can make CAML easier to use.  But that's another blog post.

The following example queries the client object model for all lists that are not hidden.  Notice that we must include a using directive for System.Linq.

using System;

using System.Linq;

using Microsoft.SharePoint.Client;

 

class Program

{

    static void Main(string[] args)

    {

        ClientContext clientContext =

            new ClientContext("http://intranet.contoso.com");

        ListCollection listCollection = clientContext.Web.Lists;

        clientContext.Load(

            listCollection,

            lists => lists

                .Include(

                    list => list.Title,

                    list => list.Hidden)

                .Where(list => ! list.Hidden)

             );

        clientContext.ExecuteQuery();

        foreach (var list in listCollection)

            Console.WriteLine(list.Title);

    }

}

[Download Code]

 

On my computer, this example produces the following output:

Announcements

Calendar

Client API Test List

Content and Structure Reports

Customized Reports

Eric's ToDo List

Eric's Wiki

Form Templates

Links

Reusable Content

Shared Documents

Site Assets

Site Collection Documents

Site Collection Images

Site Pages

Style Library

Tasks

Team Discussion

Workflow Tasks

 

Using the LoadQuery Method

The LoadQuery method is similar in functionality to the Load method, except that in certain circumstances, the client object model can process the queries more efficiently and use memory more efficiently.  And it also allows for a more flexible programming style.

LoadQuery has different semantics than the Load method.  Whereas the Load method populates the client object (or client object collection) with data from the server, the LoadQuery method populates and returns an entirely new collection.  This means that you can query the same object collection multiple times and retain separate result sets for each query.  For instance, you can query for all items in a project list that are assigned to a certain person, and separately query for all items that have an estimated hours that is greater than a certain threshold, and access both result sets simultaneously.  It also allows you to let these collections go out of scope, and thereby become eligible for garbage collection.  Collections that you load using the Load method can be garbage collected only when the client context variable itself goes out of scope.  Other than these distinctions, the LoadQuery method is very similar to the Load method.

The following example uses LoadQuery to retrieve a list of all the lists in the site.

using System;

using System.Collections.Generic;

using Microsoft.SharePoint.Client;

 

class Program

{

    static void Main(string[] args)

    {

        ClientContext clientContext =

            new ClientContext("http://intranet.contoso.com");

        Web site = clientContext.Web;

        ListCollection lists = site.Lists;

        IEnumerable<List> newListCollection = clientContext.LoadQuery(

            lists.Include(

                list => list.Title,

                list => list.Id,

                list => list.Hidden));

        clientContext.ExecuteQuery();

        foreach (List list in newListCollection)

            Console.WriteLine("Title: {0} Id: {1}",

                list.Title.PadRight(40), list.Id.ToString("D"));

    }

}

[Download Code]

 

Notice that the LoadQuery method returns an entirely new list collection that we can iterate through.  The new list collection has a type of IEnumerable<List> instead of ListCollection.

There is one aspect of the semantics of LoadQuery that you need to pay attention to.  In the above example, the original lists variable doesn’t have its property values populated after ExecuteQuery returns.  If you want that list to be populated, you must explicitly call Load on it, specifying which properties you want loaded.

Increasing Performance by Nesting Includes in LoadQuery

When calling the LoadQuery method, we can specify multiple levels of properties to load.  This allows the client object model to optimize its access to the SharePoint server internally by reducing the number of times the client object model must do a round-trip to the SharePoint server to retrieve the data that you need.  The following query retrieves all lists from the site, and all fields from each list.  It then prints them to the console, indicating if each list or field is hidden.

using System;

using System.Collections.Generic;

using Microsoft.SharePoint.Client;

 

class Program

{

    static void Main()

    {

        ClientContext clientContext =

            new ClientContext("http://intranet.contoso.com");

        IEnumerable<List> lists = clientContext.LoadQuery(

            clientContext.Web.Lists.Include(

                list => list.Title,

                list => list.Hidden,

                list => list.Fields.Include(

                    field => field.Title,

                    field => field.Hidden)));

        clientContext.ExecuteQuery();

        foreach (List list in lists)

        {

            Console.WriteLine("{0}List: {1}",

                list.Hidden ? "Hidden " : "", list.Title);

            foreach (Field field in list.Fields)

                Console.WriteLine("  {0}Field: {1}",

                    field.Hidden ? "Hidden " : "",

                    field.Title);

        }

    }

}

[Download Code]

 

This approach allows the server portion of the client object model to be more efficient than if the application first loaded a list of lists, and then loaded fields for each list.

Filtering the Child Collection returned by LoadQuery using LINQ

The LoadQuery method takes an object of type IQueryable<T> as its parameter, and this allows us to write LINQ queries instead of CAML to filter the results.  This example returns a collection of all document libraries that are not hidden.

using System;

using System.Linq;

using System.Collections.Generic;

using Microsoft.SharePoint.Client;

 

class Program

{

    static void Main(string[] args)

    {

        ClientContext clientContext =

            new ClientContext("http://intranet.contoso.com");

        ListCollection listCollection = clientContext.Web.Lists;

        IEnumerable<List> hiddenLists = clientContext.LoadQuery(

            listCollection

                .Where(list => !list.Hidden &&

                       list.BaseType == BaseType.DocumentLibrary));

        clientContext.ExecuteQuery();

        foreach (var list in hiddenLists)

            Console.WriteLine(list.Title);

    }

}

[Download Code] 

Updating Client Objects

Updating client objects using the client object model is pretty simple.  We retrieve the object(s), alter properties, call the Update method for each object we change, and then call the ExecuteQuery method.  The following example modifies items in the Client API Test List, increasing the estimate for all development items by 50% (a common operation):

using System;

using Microsoft.SharePoint.Client;

 

class Program

{

    static void Main(string[] args)

    {

        ClientContext clientContext =

            new ClientContext("http://intranet.contoso.com");

        List list = clientContext.Web.Lists

            .GetByTitle("Client API Test List");

        CamlQuery camlQuery = new CamlQuery();

        camlQuery.ViewXml =

            @"<View>

                <Query>

                  <Where>

                    <Eq>

                      <FieldRef Name='Category'/>

                      <Value Type='Text'>Development</Value>

                    </Eq>

                  </Where>

                </Query>

                <RowLimit>100</RowLimit>

              </View>";

        ListItemCollection listItems = list.GetItems(camlQuery);

        clientContext.Load(

             listItems,

             items => items.Include(

                 item => item["Category"],

                 item => item["Estimate"]));

        clientContext.ExecuteQuery();

        foreach (ListItem listItem in listItems)

        {

            listItem["Estimate"] = (double)listItem["Estimate"] * 1.5;

            listItem.Update();

        }

        clientContext.ExecuteQuery();

    }

}

[Download Code] 

Deleting Client Objects

Deleting client objects is just as easy.  However, there is one very important dynamic around deleting client objects from a client object collection.  We can’t simply iterate through the collection, deleting objects.  As soon as we delete the first object, it causes the iterator of the client object collection to malfunction.  The iterator may throw an exception, or it may quietly finish but not visit all items in the collection.  Instead, we need to materialize the collection into a List<T> using the ToList method, and then iterate through that list, deleting the client objects.

The following example deletes the test items from our Client API Test List.  It shows using the ToList method to materialize the collection before we iterate through it:

using System;

using System.Linq;

using System.Collections.Generic;

using Microsoft.SharePoint.Client;

 

class Program

{

    static void Main(string[] args)

    {

        ClientContext clientContext = new ClientContext("http://intranet.contoso.com");

        List list = clientContext.Web.Lists.GetByTitle("Client API Test List");

        CamlQuery camlQuery = new CamlQuery();

        camlQuery.ViewXml =

            @"<View>

                <Query>

                  <Where>

                    <Eq>

                      <FieldRef Name='Category'/>

                      <Value Type='Text'>Test</Value>

                    </Eq>

                  </Where>

                </Query>

                <RowLimit>100</RowLimit>

              </View>";

        ListItemCollection listItems = list.GetItems(camlQuery);

        clientContext.Load(

             listItems,

             items => items.Include(

                 item => item["Title"]));

        clientContext.ExecuteQuery();

        foreach (ListItem listItem in listItems.ToList())

            listItem.DeleteObject();

        clientContext.ExecuteQuery();

    }

}

[Download Code]

 

The following code snippet shows the incorrect approach:

clientContext.Load(

    listItems,

    items => items.Include(

        item => item["Title"]));

clientContext.ExecuteQuery();

 

// the following line doesn’t include the call to ToList

 

foreach (ListItem listItem in listItems)

    listItem.DeleteObject();

clientContext.ExecuteQuery();

 

Finally, just to clean up the Client API Test List, here is an example that deletes the list and its items:

using System;

using Microsoft.SharePoint.Client;

 

class DisplayWebTitle

{

    static void Main()

    {

        ClientContext clientContext =

            new ClientContext("http://intranet.contoso.com");

        clientContext.Web.Lists.GetByTitle("Client API Test List")

            .DeleteObject();

        clientContext.ExecuteQuery();

    }

}

[Download Code] 

Discovering the Schema for Fields

As promised, this section shows an easy way to discover the XML schema that you need to know to create the fields that you want to create in a list.  First, go into your SharePoint site, and create a list that contains columns that are configured as you want them.  Then you can then use the small example below to output the XML that will create those fields.

The following example prints the field schemas for the fields that I added to the Client API Test List:

using System;

using System.Linq;

using System.Xml.Linq;

using Microsoft.SharePoint.Client;

 

class Program

{

    static void Main(string[] args)

    {

        ClientContext clientContext =

            new ClientContext("http://intranet.contoso.com");

        List list = clientContext.Web.Lists

            .GetByTitle("Client API Test List");

        clientContext.Load(list);

        FieldCollection fields = list.Fields;

        clientContext.Load(fields);

        clientContext.ExecuteQuery();

        foreach (var f in fields)

        {

            XElement e = XElement.Parse(f.SchemaXml);

            string name = (string)e.Attribute("Name");

            if (name == "Category" || name == "Estimate")

            {

                e.Attributes("ID").Remove();

                e.Attributes("SourceID").Remove();

                e.Attributes("ColName").Remove();

                e.Attributes("RowOrdinal").Remove();

                e.Attributes("StaticName").Remove();

                Console.WriteLine(e);

                Console.WriteLine("===============");

            }

        }

    }

}

[Download Code]

 

When you run this after creating the list using the example program in the section Creating and Populating a List , it produces the following output:

<Field Type="Choice" DisplayName="Category" Format="Dropdown" Name="Category">

  <Default>Specification</Default>

  <CHOICES>

    <CHOICE>Specification</CHOICE>

    <CHOICE>Development</CHOICE>

    <CHOICE>Test</CHOICE>

    <CHOICE>Documentation</CHOICE>

  </CHOICES>

</Field>

===============

<Field Type="Number" DisplayName="Estimate" Name="Estimate" />

===============

 

The example removes attributes that you don’t need for creating the field.

Accessing Large Lists

SharePoint development guidelines indicate that you should not attempt to retrieve more than 2000 items in a single query.  If this is a possibility in your application, consider using the RowLimit element in your CAML queries to limit the amount of data that the client object model retrieves for your application.  Sometimes you must access all items in a list that may contain more than 2000 items.  If you must do so, then best practice is to page through the items 2000 at a time.  This section presents an approach to paging using the CamlQuery.ListItemCollectionPosition property.

using System;

using System.Linq;

using Microsoft.SharePoint.Client;

 

class Program

{

    static void Main()

    {

        ClientContext clientContext =

            new ClientContext("http://intranet.contoso.com");

 

        List list = clientContext.Web.Lists

            .GetByTitle("Client API Test List");

 

        // First, add 20 items to Client API Test List so that there are

        // enough records to show paging.

        ListItemCreationInformation itemCreateInfo =

            new ListItemCreationInformation();

        for (int i = 0; i < 20; i++)

        {

            ListItem listItem = list.AddItem(itemCreateInfo);

            listItem["Title"] = String.Format("New Item #{0}", i);

            listItem["Category"] = "Development";

            listItem["Estimate"] = i;

            listItem.Update();

        }

        clientContext.ExecuteQuery();

 

        // This example shows paging through the list ten items at a time.

        // In a real-world scenario, you would want to limit a page to

        // 2000 items.

        ListItemCollectionPosition itemPosition = null;

        while (true)

        {

            CamlQuery camlQuery = new CamlQuery();

            camlQuery.ListItemCollectionPosition = itemPosition;

            camlQuery.ViewXml =

                @"<View>

                    <ViewFields>

                      <FieldRef Name='Title'/>

                      <FieldRef Name='Category'/>

                      <FieldRef Name='Estimate'/>

                    </ViewFields>

                    <RowLimit>10</RowLimit>

                  </View>";

            ListItemCollection listItems = list.GetItems(camlQuery);

            clientContext.Load(listItems);

            clientContext.ExecuteQuery();

            itemPosition = listItems.ListItemCollectionPosition;

            foreach (ListItem listItem in listItems)

                Console.WriteLine("  Item Title: {0}", listItem["Title"]);

            if (itemPosition == null)

                break;

            Console.WriteLine(itemPosition.PagingInfo);

            Console.WriteLine();

        }

    }

}

[Download Code]

 

This example produces the following output:

  Item Title: Write specs for user interface.

  Item Title: Develop proof-of-concept.

  Item Title: Write test plan for user interface.

  Item Title: Validate SharePoint interaction.

  Item Title: Develop user interface.

  Item Title: New Item #0

  Item Title: New Item #1

  Item Title: New Item #2

  Item Title: New Item #3

  Item Title: New Item #4

Paged=TRUE&p_ID=10

 

  Item Title: New Item #5

  Item Title: New Item #6

  Item Title: New Item #7

  Item Title: New Item #8

  Item Title: New Item #9

  Item Title: New Item #10

  Item Title: New Item #11

  Item Title: New Item #12

  Item Title: New Item #13

  Item Title: New Item #14

Paged=TRUE&p_ID=20

 

  Item Title: New Item #15

  Item Title: New Item #16

  Item Title: New Item #17

  Item Title: New Item #18

  Item Title: New Item #19

 

Asynchronous Processing

If you are building an application that needs to attach to SharePoint sites that may or may not be available, or if you need to regularly invoke queries that may take a long time, you should consider using asynchronous processing.  This will allow you to continue to be responsive to your user while the query executes in a separate thread.  In your main thread, you can set a timer to let you know if the query has taken longer than your desired threshold, you can keep the user posted with the status of the query, and when the query finally completes, you can present the user with the results.

The ECMAScript version of the client object model and the Silverlight version (when it modifies the user interface) use asynchronous processing.  The topic Data Retrieval Overview in the SharePoint SDK contains examples of how to use asynchronous processing using ECMAScript and Silverlight.

When building a traditional .NET application, such as a Windows Forms or WPF application, you may want to use asynchronous processing.  The following example uses the BeginInvoke method to execute a query asynchronously.  You will notice that the code passes a statement lambda expression to BeginInvoke, which makes it convenient to code this pattern, because the statement lambda expression can reference automatic variables in the method that contains it.  You can see that the statement lambda expression has access to the newListCollection variable.  (C# closures make the language 'just work' the way we expect it to.)

using System;

using System.Collections.Generic;

using Microsoft.SharePoint.Client;

 

class Program

{

    static void Main(string[] args)

    {

        AsynchronousAccess asynchronousAccess = new AsynchronousAccess();

        asynchronousAccess.Run();

        Console.WriteLine("Before exiting Main");

        Console.WriteLine();

        Console.WriteLine("In a real application, the application can");

        Console.WriteLine("continue to be responsive to the user.");

        Console.WriteLine();

        Console.ReadKey();

    }

}

 

class AsynchronousAccess

{

    delegate void AsynchronousDelegate();

 

    public void Run()

    {

        Console.WriteLine("About to start a query that will take a long time.");

        Console.WriteLine();

        ClientContext clientContext =

            new ClientContext("http://intranet.contoso.com");

        ListCollection lists = clientContext.Web.Lists;

        IEnumerable<List> newListCollection = clientContext.LoadQuery(

            lists.Include(

                list => list.Title));

        AsynchronousDelegate executeQueryAsynchronously =

            new AsynchronousDelegate(clientContext.ExecuteQuery);

        executeQueryAsynchronously.BeginInvoke(

            arg =>

            {

                Console.WriteLine("Long running query has completed.");

                foreach (List list in newListCollection)

                    Console.WriteLine("Title: {0}", list.Title);

            }, null);

    }

}

[Download Code]

 

The example produces the following output:

About to start a query that will take a long time.

 

Before exiting Main

 

In a real application, the application can

continue to be responsive to the user.

 

Long running query has completed.

Title: Announcements

Title: Cache Profiles

Title: Calendar

Title: Client API Test List

Title: Content and Structure Reports

Title: Content type publishing error log

Title: Converted Forms

Title: Customized Reports

Title: Eric's ToDo List

Title: Eric's Wiki

Title: Form Templates

Title: Links

 

Other Resources

The SharePoint SDK contains some good resources and example code for the client object model that will be useful in a variety of scenarios:

SharePoint 2010 Developer Center

How to: Work with Web Sites

How to: Work with Users and Groups

How to: Break Role Assignment Inheritance

How to: Work with User Custom Actions

How to: Work with WebParts on a Page

Setting Up a Basic ASPX Page for ECMAScript

Using the Silverlight Object Model

Availability of Public Office 2010 and SharePoint 2010 Betas

Today at PDC, we announced the availability of our public Office 2010 and SharePoint 2010 betas.  The next few months are simply going to be fun.  We can finally talk about all the cool new features, and especially from my point of view, the new stuff for developers.  It's a whole new world.  We're making this beta available in seven languages - English, Spanish, Japanese, Simplified Chinese, Russian, French and German.

Gray Knowlton has a fun post on the Office 2010 and SharePoint 2010 PDC Keynote.  Gave me a sense of being there.  J  And Erika Ehrli covered the beta release nicely as well.

Here's some links to places to get started:

Download Microsoft Office Professional Plus 2010 Beta

Project 2010 Beta

Project Server 2010 Beta

SharePoint Designer 2010 Beta (32-bit)

SharePoint Designer 2010 Beta (64-bit)

SharePoint Foundation 2010 Beta

SharePoint Server 2010 Beta

Visio 2010 Beta

Visual Studio 2010 Beta 2

The new developer centers are live:

Office 2010 Beta Developer Center

SharePoint 2010 Beta Developer Center

Visual Studio 2010 Beta Developer Center

The Office Developer Documentation team made an announcement that is important for any developers who are getting started in 2010 development.  If you want to make sure that you are viewing the most up-to-date help for the Office 2010 client applications, read about the Developer Help Updates for Office 2010 Beta.

Assembling Paragraph and Run Properties for Cells in an Open XML WordprocessingML Table

[Blog Map]

(Update Nov 11, 2009: This is the 6th in a series of posts (#1, #2, #3, #4, #5, #6) on doing a transform of WordprocessingML to XHtml.)

When we want to render a paragraph and its runs inside of a cell, we need to assemble the paragraph and run properties from a number of places.  In a previous post, I explained how style inheritance works, and how you 'roll-up' styles from the style chain.  That is only part of the story.  This post details how we assemble styling information from:

  • Table styles
  • The formatting directly applied to tables, paragraphs, and runs
  • The global default paragraph and run properties.

In the process of assembling paragraph and run properties, we also need to correctly handle something called 'Toggle Properties'.

Note: next, I'm going to tackle the semantics of numbering styles, and then I believe I'm ready to start coding in earnest.  I've made a decision in this project to first implement a transform to XHtml without styling information.  The resulting XHtml will contain just the content of the document.  I decided to do this because it is useful in its own right, and we need it for another project.  Having code to extract the contents of a document in the most succinct form possible has a lot of uses.  Of course, this XHtml still can be rendered in a browser, and in some cases, it will be useful to do this.  Then, after publishing that code, I'll start implementing styling behavior.

Table Styles

A very powerful and cool feature of Word is that when you are applying a table style you pick and choose which aspects of the style you want to apply.  For consistency, you can apply the same style to all tables in your document, but some tables may have a total row, and other tables may not:

You can apply the same style to both, then pick and choose which aspects of the table style to apply.  When you are applying a table style, this is what the Ribbon in Word 2007 looks like:

You can see the range of check boxes in the Table Style Options section of the Ribbon, and how you can pick and choose which aspects of the style you want to apply.  This has ramifications for us when we are assembling styling information for a cell.  We know the table style for the table, and we also know the values of those check boxes, so we have to apply the various aspects of the table style per the user's preferences from those check boxes.

Before we dive into table styles in depth, we need to cover toggle properties, which play a part in how table styles work.

Toggle Properties

Toggle properties consist of a set of run properties that have a little twist in their semantics when assembling formatting information in preparation to rendering paragraphs in some fashion.  The w:b element (which styles a run as bold) is a good example of a toggle property.

Here's how toggle properties work:

Toggle properties only have their toggle behavior when associated with table styles, paragraph styles, and character styles.  If a run has been made bold per the table style, and the user applies a paragraph style that also has the w:b element, the net result is that original bolded text is now made not bold.  And if some portion of that paragraph has a bold character style applied to it, that portion is now made bold again.

This makes sense.  The table style designer designated that a cell be bold.  The paragraph style designer had the intention of making text in the paragraph stand out.  But the text is already bold, so that intent won't be satisfied, so to make it stand out, we reverse the boldness of the text.  The same reasoning also applies to a character style that has the w:b element.

It's just these three types of styles (table, paragraph, and character) that we need to process in this fashion.  If the user subsequently selects that text and presses the bold button on the toolbar, setting the properties on the run itself (not a style), we honor his or her intention, regardless of the boldness of the table, paragraph, or run styles.  Also, the global run properties completely override the toggling behavior (but not directly applied formatting).  If the w:b element is set on the global run property, effectively making the entire document bold, the entire document remains bold, unless formatting is set directly on a run.

The set of toggle properties are: §2.3.2.1 (Bold), §2.3.2.2 (Complex Script Bold), §2.3.2.4 (Display All Characters as Capital Letters), §2.3.2.11 (Embossing), §2.3.2.14 (Italics), §2.3.2.15 (Complex Script Italics), §2.3.2.16 (Imprinting), §2.3.2.21 (Display Character Outline), §2.3.2.29 (Shadow), §2.3.2.31 (Small Caps), §2.3.2.35 (Single Strikethrough), and §2.3.2.39 (Hidden Text).  The section numbers are for Ecma-376 version 1.

Assembling Styling Properties for Cells in a Table

Due to the richness of table styles, as shown above, table, row, cell, paragraph and run properties can be stored in multiple places in a table style.  Determining the properties for a table style involves rolling up those styles, in the exact same fashion as I described for rolling up style properties in the previous blog post.  While rolling up that information, we need to either merge attributes, merge child elements, or replace elements.

Shading of the table cells comes from the table cell properties (w:tcPr).  Formatting of the text in table cells comes from the paragraph properties (w:pPr) and run properties (w:rPr).  Other necessary properties for rendering come from the table properties (w:tblPr) and table row properties (w:trPr).  The process for assembling the correct table styling information for a cell is the same for each of these.  In the following section, I describe the process of assembling styling information for runs in a table per the table style, but the same approach applies to assembling styling information for the other aspects of a table style (table, row, cell, and paragraph properties).  When I write code to do this, of course, I'm going to write only one set of methods to do this assembling of styling information, and parameterize those methods so that I can use it for assembling all aspects of conditional table formatting properties.

To determine the run properties from a style for a cell in a table, we do the following, in order:

  • We first roll-up all table styles in the table style chain, per my post, Open XML WordprocessingML Style Inheritance.
  • We retrieve the value of the w:tblLook element from the table that we're rendering, which indicates which of the conditional table formatting properties we will apply to the table.
  • We create an empty list of the run style properties (the w:rPr element).  In the following steps, we will be adding run style properties to this list, based on the circumstances, and after assembling all the items in the list, we will roll them up to give us the appropriate styling information for the cell.  Note that in the following steps, if the w:tblStylePr element does not exist, it is not an error.  It just means that we don't need to do anything for that particular step.
  • We add to the list:
    • The run style property for the whole table style from w:tblStylePr[@w:type = 'wholeTable'].
    • If we should apply column banding, per the w:tblLook element
      • If the cell is an odd banded column cell, then add the run style property from w:tblStylePr[@w:type = 'band1Vert']
      • If the cell is an even banded column cell, then add the run style property from w:tblStylePr[@w:type = 'band2Vert']
    • If we should apply row banding, per the w:tblLook element
      • If the cell is an odd banded row cell, then add the run style property from w:tblStylePr[@w:type = 'band1Horz']
      • If the cell is an even banded column cell, then add the run style property from w:tblStylePr[@w:type = 'band2Vert']
    • If we should apply the first row formatting, per the w:tblLook element
      • If the cell is in the first row, then add the run style property from w:tblStylePr[@w:type = 'firstRow']
      • In addition, if the cell is in a row with the w:tblHeader element, then add the run style property from w:tblStylePr[@w:type = 'firstRow']
    • If we should apply the last row formatting, per the w:tblLook element
      • If the cell is in the last row, then add the run style property from w:tblStylePr[@w:type = 'lastRow']
    • If we should apply the first column formatting, per the w:tblLook element
      • If the cell is in the first column, then add the run style property from w:tblStylePr[@w:type = 'firstCol']
    • If we should apply the last column formatting, per the w:tblLook element
      • If the cell is in the last column, then add the run style property from w:tblStylePr[@w:type = 'lastCol']
    • If the cell is the top left cell, then add the run style property from w:tblStylePr[@w:type = 'nwCell']
    • If the cell is the top right cell, then add the run style property from w:tblStylePr[@w:type = 'neCell']
    • If the cell is the bottom left cell, then add the run style property from w:tblStylePr[@w:type = 'swCell']
    • If the cell is the bottom right cell, then add the run style property from w:tblStylePr[@w:type = 'seCell']
  • Now that we have a list of run properties, we roll them up.  We now have a set of style run properties that we can apply to the cell.

Note that this only gets the run properties for a table style.  Once we have rolled up the run properties for the table style, we assemble the following, in order:

  • The run properties for the table style (per the above procedure)
  • The run properties for the paragraph style for the paragraph that contains the run
  • The run properties for the run style applied to the run

We then roll these three up, implementing the toggling behavior for toggle properties that I described earlier.  Once we have done this process, we assemble the following, in order:

  • The global default run properties.
  • The rolled up run properties from the table styles.
  • The rolled up run properties from a directly applied run style.
  • The global defaults, with all properties except toggle properties removed.  (This will provide the behavior that global properties trump style toggle properties.)
  • The run properties that are applied directly to a run.

We roll these up, and we finally have the run properties that we can apply to the run.

When we're assembling the paragraph properties for a table style, we follow a similar procedure.  Once we have that rolled-up property, we need to assemble a new list of paragraph properties, in the following order:

  • The global default paragraph properties
  • The table style paragraph properties (per the above procedure)
  • The paragraph properties applied directly to a paragraph

We then roll up these three sets of paragraph properties, and we have the paragraph properties that we can apply to the paragraph in the cell.

This seems harder than it actually is.  While this is a bit involved, this is what enables the very cool table styling capabilities that we see in Word.  I just have to say, this is one of those cases where I really appreciate LINQ to XML.  I personally really would not want to write old-style imperative code to do this.

One more point about this – I mentioned in an earlier post about an approach of adding paragraph and run properties with ordering applied to every paragraph and run in the document.  I still think that this approach will work best.  It means that I can assemble the style paragraph properties for a cell, then add them to every paragraph in the cell.  I can assemble the style run properties for a cell, then add them to every run.  This means that I'll only need to compute the style paragraph properties for a particular cell once, not for every paragraph in the cell.  Same holds true for runs also.

SharePoint 2010 and Office 2010 Developer Training Courses Launched on Channel 9

Channel 9 launched new developer training courses for SharePoint 2010 and Office 2010.  I know the folks who produced this, and personally seen a lot of it.  This is good stuff.  The training consists of extensive recordings from top MVPs/experts on how to develop with both SharePoint and Office 2010.  As a developer, I'm personally particularly enthused about three developer aspects of SharePoint 2010:  Client Object Model, Sandboxed Solutions, and LINQ to SharePoint, but there's a lot more there than that.  If you haven't seen what the buzz is about, check it out.

SharePoint 2010 Developer Training SharePoint 2010 has evolved into a first-class developer platform.  Visual Studio 2010 integration is fantastic.

Office 2010 Developer Training: There are a number of new features and UI extensions that you'll want to know about.

Also while I'm at it, want to let you know about Windows Server 2008 R2 training.  I really appreciate W2008 R2.

Comparison of Html/CSS Tables to WordprocessingML Tables (Post #5)

[Blog Map]

(Update Nov 11, 2009: This is the 5th in a series of posts (#1, #2, #3, #4, #5, #6) on doing a transform of WordprocessingML to XHtml.)

Html tables and WordprocessingML tables have a lot in common.  Both can present complex tables with horizontally and vertically merged cells, and both have a rich set of capabilities for formatting.  But there are differences in their models and capabilities.  This blog post presents those differences, specifically around three areas:

  • Table Layout
  • Formatting
  • Differences in capabilities at the table, row, and cell level

I'm currently in the process of coding a pure functional transform from WordprocessingML to XHtml.  Understanding the exact differences between the two types of tables enables writing this transform as accurately as possible.  In addition, if you understand CSS and Html tables, this blog post provides an easy way to learn about WordprocessingML tables.  (If you're a CSS expert, and see something I'm doing incorrectly, please correct me. J)

Note: In a previous post, I talked about a plan to transform WordprocessingML styles to CSS classes.  I've decided to not use CSS classes to represent WordprocessingML styles.  Instead, I'm going to generate a style attribute for each object (p, table, tr, td, etc.) that contains all necessary formatting for that object.  My rational for this decision is detailed in this post, in the "Differences in Formatting" section below.  This isn't a decision that I'm taking lightly, but I believe it is the correct one.  But we'll see…

Differences in Table Layout

On the surface, the layout of WordprocessingML and Html tables look very similar.  Of course, both can present a simple table that contains data:

Both can contain horizontally and vertically merged cells:

Both can represent an irregular layout:

However, WordprocessingML and XHtml tables use a somewhat different model for layout.

In WordprocessingML, you first establish a grid with some number of grid columns.  Left and right edges of cells will always be on a grid column.  The mechanism for horizontal cell spanning is that you specify the number of grid columns that a cell spans.  You can specify that the first cell in a row starts after skipping a certain number of grid columns.

In contrast, in XHtml, there is no underlying grid on which you layout cells.  Instead, the cells themselves form the grid.

To make this difference clear, let's look at a simple example.  Consider the following table with four cells, but the vertical rule between the top two cells isn't aligned with the vertical rule between the bottom two cells:

Here is the WordprocessingML that describes this table.  Notice the w:tblGrid, which describes the grid, and the w:gridSpan elements on the top left and bottom right cells.  While the grid describes three grid columns, there are only two cells per row.

<w:tbl>

  <w:tblPr>

    <w:tblStyle w:val="TableGrid"/>

    <w:tblW w:w="0" w:type="auto"/>

    <w:tblLook w:val="04A0"/>

  </w:tblPr>

  <w:tblGrid>

    <w:gridCol w:w="1368"/>

    <w:gridCol w:w="450"/>

    <w:gridCol w:w="1350"/>

  </w:tblGrid>

  <w:tr>

    <w:tc>

      <w:tcPr>

        <w:tcW w:w="1818" w:type="dxa"/>

        <w:gridSpan w:val="2"/>

      </w:tcPr>

      <w:p>

        <w:r>

          <w:t>Top Left</w:t>

        </w:r>

      </w:p>

    </w:tc>

    <w:tc>

      <w:tcPr>

        <w:tcW w:w="1350" w:type="dxa"/>

      </w:tcPr>

      <w:p>

        <w:r>

          <w:t>Top Right</w:t>

        </w:r>

      </w:p>

    </w:tc>

  </w:tr>

  <w:tr>

    <w:tc>

      <w:tcPr>

        <w:tcW w:w="1368" w:type="dxa"/>

      </w:tcPr>

      <w:p>

        <w:r>

          <w:t>Bottom Left</w:t>

        </w:r>

      </w:p>

    </w:tc>

    <w:tc>

      <w:tcPr>

        <w:tcW w:w="1800" w:type="dxa"/>

        <w:gridSpan w:val="2"/>

      </w:tcPr>

      <w:p>

        <w:r>

          <w:t>Bottom Right</w:t>

        </w:r>

      </w:p>

    </w:tc>

  </w:tr>

</w:tbl>

 

Following is markup for a similar table in XHtml.  There are three cells per row instead of two.  The first two rows (the only ones we see) each contain a cell with a colspan attribute, merging two cells into one.  The third row, with no border and a height of zero pixels, defines three cells.  This is a trick based on the semantics of XHtml tables.  When determining the widths of cells, the browser looks at all rows of the table, and then calculates the column width, taking widths of all cells of that column into consideration.  Using this approach, we need to specify column widths only once, in the last invisible row of the table.

<table style='border-collapse:collapse;border:none'>

 <tr>

  <td colspan="2"

      style="border:solid;border-width:1px;border-color:Black;padding:0in 5.4pt 0in 5.4pt">

    <p>Top Left</p>

  </td>

  <td style="border:solid;border-width:1px;border-color:Black;padding:0in 5.4pt 0in 5.4pt">

    <p>Top Right</p>

  </td>

 </tr>

 <tr>

  <td style="border:solid;border-width:1px;border-color:Black;padding:0in 5.4pt 0in 5.4pt">

    <p>Bottom Left</p>

  </td>

  <td colspan="2"

      style="border:solid;border-width:1px;border-color:Black;padding:0in 5.4pt 0in 5.4pt">

    <p>Bottom Right</p>

  </td>

 </tr>

 <tr style="max-height:0px">

  <td style='width:68.4pt;border:none'></td>

  <td style='width:22.5pt;border:none'></td>

  <td style='width:67.5pt;border:none'></td>

 </tr>

</table>

 

The differences in the model become even clearer when we specify that a grid column is skipped before placing the first cell.  The following table shows a row that contains one cell that is shifted to the right:

The WordprocessingML that describes this table follows.  The w:gridBefore element specifies that the one cell in the second row is to be placed in the second grid column.

<w:tbl>

  <w:tblPr>

    <w:tblStyle w:val="TableGrid"/>

    <w:tblW w:w="0" w:type="auto"/>

    <w:tblLook w:val="04A0"/>

  </w:tblPr>

  <w:tblGrid>

    <w:gridCol w:w="2000"/>

    <w:gridCol w:w="2000"/>

  </w:tblGrid>

  <w:tr>

    <w:tc>

      <w:tcPr>

        <w:tcW w:w="2000" w:type="dxa"/>

      </w:tcPr>

      <w:p>

        <w:r>

          <w:t>Top Left</w:t>

        </w:r>

      </w:p>

    </w:tc>

    <w:tc>

      <w:tcPr>

        <w:tcW w:w="2000" w:type="dxa"/>

      </w:tcPr>

      <w:p>

        <w:r>

          <w:t>Top Right</w:t>

        </w:r>

      </w:p>

    </w:tc>

  </w:tr>

  <w:tr>

    <w:trPr>

      <w:gridBefore w:val="1"/>

    </w:trPr>

    <w:tc>

      <w:tcPr>

        <w:tcW w:w="2000" w:type="dxa"/>

      </w:tcPr>

      <w:p>

        <w:r>

          <w:t>Bottom Right</w:t>

        </w:r>

      </w:p>

    </w:tc>

  </w:tr>

</w:tbl>

 

Here is how we would form this table in XHtml:

<table style='border-collapse:collapse;border:none'>

 <tr>

  <td style="border:solid;border-width:1px;border-color:Black;padding:0in 5.4pt 0in 5.4pt">

    <p>Top Left</p>

  </td>

  <td style="border:solid;border-width:1px;border-color:Black;padding:0in 5.4pt 0in 5.4pt">

    <p>Top Right</p>

  </td>

 </tr>

 <tr>

  <td style="border:none;padding:0in 5.4pt 0in 5.4pt">

    <p>&nbsp;</p>

  </td>

  <td style="border:solid;border-width:1px;border-color:Black;padding:0in 5.4pt 0in 5.4pt">

    <p>Bottom Right</p>

  </td>

 </tr>

 <tr style="max-height:0px">

  <td width="100" style='border:none'></td>

  <td width="100" style='border:none'></td>

 </tr>

</table>

 

In XHtml, we have no choice but to place a cell in the location where there is no cell visible.  We place a non-breaking space in that cell, as some browsers may collapse the cell if it contains no data.  We also specify padding.  The table then renders as desired.

There is a simple strategy that we can take when converting the WordprocessingML to XHtml, which is to generate XHtml cells based on the grid, not on cells.  We then specify appropriate colspan and style attributes to make the table render as we wish.

This subtle difference in abstraction is one of the most important differences between tables in WordprocessingML and XHtml.  By taking this difference into account, it is easy to craft an algorithm that will produce tables that will render as we wish in XHtml.  In addition to this difference in abstraction, there are a number of differences in formatting and capabilities.  I don't believe that I've isolated all of the differences, but I think I've found most of the important ones.  In some of the conversions, I didn't yet spend the time to find the correct CSS approach, so am still using an Html attribute approach.

Differences in Formatting

There are a number of analogous capabilities in formatting between tables in WordprocessingML and XHtml/CSS, but one of the key differences is that in WordprocessingML, there is a rich infrastructure of style inheritance.  Table styles can inherit from other table styles.  Paragraph styles can inherit from other paragraph styles.  Run styles can inherit from other run styles.  In contrast, in CSS, we can define classes, but we can't define that one class inherits from another class.  However, when specifying the class for an element such as a table, paragraph, or span, we can specify more than one class, and each class is applied in turn.  This is analogous to style inheritance, but the mechanisms are completely different.

It might seem that we could use the ability to specify multiple classes for an XHtml object to implement a form of style inheritance, but there is one important aspect of the semantics of WordprocessingML styles that make it impossible to use CSS classes to implement style inheritance.  Table styles in WordprocessingML have the capability to define what are called conditional table formatting properties.  These are properties that are applied in a specific order to a) the entire table, b) banded columns, c) banded rows, d) first and last row, e) first and last column, f) specific cells at the corners.  And, of course, conditional table formatting properties inherit from the same conditional formatting properties of the base style of a table style.  In theory, we could define styles for each of these conditional table formatting properties, and apply these styles in order of precedence to each cell in the table.  But let's say that we have one table style with a number of conditional formatting properties that derives from another style that also contains a number of conditional formatting properties.  When specifying the classes for a paragraph, it would look something like this:

<p class="BaseStyle BaseStyle_EntireTable BaseStyle_Banded_Columns BaseStyle_BandedRows (etc.)

          DerivedStyle DerivedStyle_EntireTable DerivedStyle_BandedColumns (etc.)>Some text.</p>

 

If we had a string of derived table styles, we could end up applying 30 or 40 (or many more!) classes to a single paragraph or run.  But even so, it won't work, because if the BaseStyle contains some property P, and a conditional formatting property overrides that property, and then DerivedStyle overrides the BaseStyle property P, and the conditional formatting property does not define that property, then the property that should apply is the one defined in the conditional formatting for the BaseStyle, not the property defined in the DerivedStyle.  It simply won't work.  We could start playing around with ordering of applications of classes, but I would hate to debug this.

We could go through the effort of defining classes for each uniquely styled cell in each table.  This would involve rolling up all inherited styles, and implementing the appropriate semantics for overriding properties at the table, paragraph, and run level, keeping a list of uniquely styled paragraphs and runs, then generating a CSS class for each unique combination of properties.  This does have the advantages (and disadvantages) of moving styling information away from the paragraphs and runs into the internal style sheet.  These classes would have a computer-generated, non-descriptive name, so they wouldn't be helpful to a person who is reading the XHtml.  In addition, it is highly unlikely that these classes could be re-used.  It's not worth the effort, I believe.

One approach would be to define a certain set of CSS classes, then override those classes with locally applied styling information in the style attribute.  But that defeats the whole purpose of having CSS classes in the first place.  With that approach, we still don't have separation of content and presentation, and as you can see, attempting to use CSS classes to represent styles is very complex and prone to bugs.

The approach that I've decided to take is to properly roll-up styling information from the WordprocessingML and store that styling information in the style attribute for each object, optimizing that styling information so that if a property is defined at a higher level, it isn't redefined.  For instance, if the paragraph specifies that a particular font is used, then the run doesn't also specify it.  This optimization can be done after assembling all formatting information for each paragraph and run.  This has the advantage that this conversion really is strictly a conversion of WordprocessingML to its presentation.  By not using CSS classes, it makes the conversion more straightforward.  It will be easier to debug.  I think it is useful for this conversion to simply be a transform of WordprocessingML to its presentation, without involving the complexities that CSS classes bring.  In effect, we're using XHtml and CSS used at the object level purely as a presentation engine.

Table Capabilities

Following is a partial list of features of WordprocessingML tables, and how they map to XHtml table features:

  • Both support visually right-to-left tables for languages such as Hebrew and Arabic.  The w:bidiVisual element translates to the dir attribute of the table element.
  • Both support alignment of the table with respect to the margins of the containing section or object.  To translate the w:tblInd element, create a div element with the align attribute set to some value (right, left, center).
  • Both support background shading.  However, with WordprocessingML, you can specify a pattern for background shading.  It could be possible to generate images, but this isn't a key scenario.  For phase one, the conversion will convert to shading with patterns to a solid color.
  • WordprocessingML contains the abstraction of themes.  In certain places, the conversion needs to retrieve font and color information from a theme.
  • Both support table and cell borders.  However, WordprocessingML contains two features not supported in XHtml.  WordprocessingML supports a large number of cell borders, including many 'clip art' varieties, such as "apples", "babyRattle", and "bats".  All of the clip art varieties will be converted to a single line border.  Commonly used styles such as solid, dotted, double lines, etc. will convert to the corresponding style in XHtml/CSS.  In addition, WordprocessingML supports diagonal borders.  These aren't commonly used, and I'm going to delay supporting them.
  • Cell margin (w:tblCellMargin) maps to the CSS padding attribute.  Cell margin is the space between the cell contents extent and the cell border.  Cell margin is typically expressed in terms of dxa, or 1/1440 of a point.  The CSS padding attribute can be expressed in inches, points, or other units of measure.
  • Cell spacing (w:tblCellSpacing) maps to the cellspacing attribute of the table object.  Cell spacing is the space between cell borders, but within the table.  Cell spacing is merged between adjacent cells.  Cell spacing in WordprocessingML is typically expressed in terms of dxa, or 1/1440 of a point.  The XHtml cellspacing attribute is in terms of pixels. 
  • Both models support flowing text around a Table.  In WordprocessingML, it is supported via floating tables (w:tblOverlap).  In XHtml and CSS, set the align attribute of table to left, and specify appropriate margins so that the table renders properly with the correct space between the table and surrounding text.

Row Capabilities

Following is a partial list of features of WordprocessingML rows, and how they map to XHtml row features:

  • In WordprocessingML, rows have the ability to be hidden.  Given my primary goal in simply rendering the table properly, the proper conversion is to remove hidden rows from the converted XHtml.
  • In WordprocessingML, rows can be centered, aligned left, or aligned right.  There is no corresponding capability in XHtml.  For phase one, the conversion will disregard row alignment.
  • In WordprocessingML, you can specify that a particular row is a row header, and should be repeated on each printed page.  Headers in XHtml tables provide the ability to format them separately.  They take on a bold appearance by default.  These capabilities are really not analogous, so for phase one, will not convert one to the other.
  • Table row height can be converted.  w:trHeight converts to the CSS height property of a row.

Cell Capabilities

Following is a partial list of features of WordprocessingML cells, and how they map to XHtml cell features:

  • The w:noWrap element translates to the noWrap attribute of the td element.
  • Background shading of cells can be converted.  The same issues apply as with table background shading.
  • Cell borders can be converted.  The same issues apply as with table borders.
  • WordprocessingML has the capability to alter kerning so that the text fits exactly in a cell.  The w:tcFitText element translates to the CSS fit-text property.
  • WordprocessingML supports setting the text flow direction.  This isn't supported in XHtml tables.
  • Horizontal and vertical alignment is supported in both models.

With this post, I've detailed much of what I think I need to know to transform Open XML WordprocessingML tables to XHtml tables using CSS for formatting.  I've also outlined the strategy that I think I'll follow given the slightly different layout model of tables in WordprocessingML and tables in XHtml.  As I code the transform, I'll revise this post so that I can remember the details of the transform of WordprocessingML tables to XHtml tables.

Open XML WordprocessingML Style Inheritance (Post #4)

[Blog Map]

(Update Nov 11, 2009: This is the 4th in a series of posts (#1, #2, #3, #4, #5, #6) on doing a transform of WordprocessingML to XHtml.)

When working with WordprocessingML, nearly all of the information that we need to render paragraphs, tables, and numbered items is contained in styles, stored in the WordprocessingML Style Definitions part.  Styles are somewhat complicated because styles have inherited behavior – one style can be based on another style.  Rendering of text that has the derived style then is dependent on the derived style, it's base class, that base class's base class, and so on.  The Open XML specification refers to this list of styles that are derived from other styles as the 'style chain', which accurately describes the abstraction.

When determining the set of properties for rendering a paragraph or table, the first job is to 'roll up' all styles in the style chain, creating a single set of properties that we can apply to the paragraph or table.  This process of 'rolling up' styles is made somewhat more complicated because there are four variations of semantics that we must apply to elements in the rolling-up process.

However, it's not too complicated, and after carefully defining the semantics of 'rolling-up' styles in the style chain, we can write a small bit of generalized code to do this – probably less than 100 lines of code.

You'll notice something about the semantics of style inheritance – by far, when rolling up the styles, the most common operation is to replace any elements in base styles with an element in a derived style.  In the code that I'm going to write which will roll-up styles, if the inheritance semantics are other than merging attributes or merging child elements, then the default behavior will be to do element replacement.  This will make the code as small and robust as possible.

This post probably isn't of very much interest to most people, but to the folks who are interested, it will be very important.  I'm in the process of writing a fairly compact conversion of Open XML to XHtml, and needed to work out the exact behavior of style inheritance.  After working it out, it made good sense to blog it to make life easier for others who need to work with rendering issues of WordprocessingML.

Merging Attributes

In some cases, we must iterate through attributes of a particular element, and if the element in the derived style has an attribute, we must apply that attribute, overriding the attribute in the base style.  In many cases, the base style may not define that particular attribute, so in that case, we must simply add the attribute to the element in the rolled-up style.  For example, we may have a style, SpaceBefore, which defines a style that has space before the paragraph, but no space after:

<w:style w:type="paragraph"

         w:customStyle="1"

         w:styleId="SpaceBefore">

  <w:name w:val="SpaceBefore"/>

  <w:basedOn w:val="Normal"/>

  <w:qFormat/>

  <w:rsid w:val="00A670C6"/>

  <w:pPr>

    <w:spacing w:before="200"

               w:after="0"/>

  </w:pPr>

</w:style>

 

We may have a style, SpaceBeforeAndAfter, which defines the w:spacing element with a w:after attribute, like this:

<w:style w:type="paragraph"

         w:customStyle="1"

         w:styleId="SpaceBeforeAndAfter">

  <w:name w:val="SpaceBeforeAndAfter"/>

  <w:basedOn w:val="SpaceBefore"/>

  <w:qFormat/>

  <w:rsid w:val="00A670C6"/>

  <w:pPr>

    <w:spacing w:after="200"/>

  </w:pPr>

</w:style>

 

After 'rolling-up' the style chain, the style that we must apply to a paragraph that has the SpaceBeforeAndAfter style would look like this:

<w:style w:type="paragraph"

         w:customStyle="1"

         w:styleId="SpaceBeforeAndAfter">

  <w:name w:val="SpaceBeforeAndAfter"/>

  <w:basedOn w:val="SpaceBefore"/>

  <w:qFormat/>

  <w:rsid w:val="00A670C6"/>

  <w:pPr>

    <w:spacing w:before="200"

               w:after="200"/>

  </w:pPr>

</w:style>

 

Merging Child Elements

In some cases, we must merge child elements.  We must iterate through all child elements of an element in the derived style, and if the base style doesn't contain a particular element, we must add that element to the 'rolled-up' style.  If the base style does contain the element of interest, then we must either merge attributes or replace the child elements, based on the semantics defined for that child element.  The w:pPr and w:rPr elements are examples of elements that require this type of inheritance.

Consider the style NotIndented, which defines paragraph properties (w:pPr) as follows:

<w:style w:type="paragraph"

         w:customStyle="1"

         w:styleId="NotIndented">

  <w:name w:val="NotIndented"/>

  <w:basedOn w:val="Normal"/>

  <w:qFormat/>

  <w:rsid w:val="00082E03"/>

  <w:pPr>

    <w:spacing w:after="0"/>

  </w:pPr>

</w:style>

 

The following style, Indented, derives from NotIndented:

<w:style w:type="paragraph"

         w:customStyle="1"

         w:styleId="Indented">

  <w:name w:val="Indented"/>

  <w:basedOn w:val="NotIndented"/>

  <w:qFormat/>

  <w:rsid w:val="00082E03"/>

  <w:pPr>

    <w:ind w:left="720"/>

  </w:pPr>

</w:style>

 

After rolling up all styles in the style chain, the style that we should apply to text styled as Indented would be defined as follows:

<w:style w:type="paragraph"

         w:customStyle="1"

         w:styleId="Indented">

  <w:name w:val="Indented"/>

  <w:basedOn w:val="NotIndented"/>

  <w:qFormat/>

  <w:rsid w:val="00082E03"/>

  <w:pPr>

    <w:spacing w:after="0"/>

    <w:ind w:left="720"/>

  </w:pPr>

</w:style>

 

Note that both the w:spacing and w:ind elements require that their attributes be merged.  In most cases, per the list below, elements are replaced (as opposed to merging of attributes).

Replacing Elements

In some cases, while rolling-up styles, we must replace an element and its attributes wholesale.  We don't need to iterate through attributes, replacing individual attributes.  The w:top (Paragraph Border Above Identical Paragraphs) element has these semantics.  Consider the following style that defines a single line, with a size of 4 eighth's of a point, and with a color of red (FF0000 in hex):

<w:style w:type="paragraph"

         w:customStyle="1"

         w:styleId="TopBorder1">

  <w:name w:val="TopBorder1"/>

  <w:basedOn w:val="Normal"/>

  <w:qFormat/>

  <w:rsid w:val="007850D3"/>

  <w:pPr>

    <w:pBdr>

      <w:top w:val="single"

             w:sz="4"

             w:space="1"

             w:color="FF0000"/>

    </w:pBdr>

  </w:pPr>

</w:style>

 

Here is a derived style, TopBorder2, which defines a top border, with a size of 18 eighth's of a point, and no color defined:

<w:style w:type="paragraph"

         w:customStyle="1"

         w:styleId="TopBorder2">

  <w:name w:val="TopBorder2"/>

  <w:basedOn w:val="TopBorder1"/>

  <w:qFormat/>

  <w:rsid w:val="00315108"/>

  <w:pPr>

    <w:pBdr>

      <w:top w:val="single"

             w:sz="18"

             w:space="1"/>

    </w:pBdr>

  </w:pPr>

</w:style>

 

After rolling up the styles in the style chain, the resulting style that should be applied to a paragraph styled TopBorder2 should be like this:

<w:style w:type="paragraph"

         w:customStyle="1"

         w:styleId="TopBorder2">

  <w:name w:val="TopBorder2"/>

  <w:basedOn w:val="TopBorder1"/>

  <w:qFormat/>

  <w:rsid w:val="00315108"/>

  <w:pPr>

    <w:pBdr>

      <w:top w:val="single"

             w:sz="18"

             w:space="1"/>

    </w:pBdr>

  </w:pPr>

</w:style>

 

Notice that the w:color attribute was not inherited from TopBorder1.  The w:top element, along with its attributes, was replaced wholesale.

Style Conditional Table Formatting Properties

There is one special case where merging semantics are slightly more complicated.  Table styles have a very powerful feature called conditional table formatting.  This feature allows us to specify a special set of formatting properties for the top row, the first column, the bottom row, banded columns, banded rows, cells at the top left, top right, etc.  Conditional table formatting is defined in the w:tblStylePr element.  The following table style (markup has been simplified) contains a w:tblStylePr element for the first row, and a w:tblStylePr element for the first column:

<w:style w:type="table"

         w:customStyle="1"

         w:styleId="LightListRedHeader">

  <w:name w:val="Light List Red Header"/>

  <w:basedOn w:val="LightList"/>

  <w:tblStylePr w:type="firstRow">

    <w:pPr>

      <w:spacing w:before="0"

                 w:after="0"

                 w:line="240"

                 w:lineRule="auto"/>

    </w:pPr>

    <w:rPr>

      <w:b/>

      <w:bCs/>

      <w:color w:val="FFFFFF"

               w:themeColor="background1"/>

    </w:rPr>

    <w:tblPr/>

    <w:tcPr>

      <w:shd w:val="clear"

             w:color="auto"

             w:fill="FF0000"/>

    </w:tcPr>

  </w:tblStylePr>

  <w:tblStylePr w:type="firstCol">

    <w:rPr>

      <w:b/>

      <w:bCs/>

    </w:rPr>

  </w:tblStylePr>

 

A table style definition most often will have several w:tblStylePr elements.  We can't simply merge child elements for the w:tblStylePr element.  We must first match the w:type attribute, and then merge child elements.

Summary of Style Inheritance Semantics

The table at the end of this post summarizes the semantics that we must apply when 'rolling-up' styles.

A fair number of elements in the style hierarchy exist solely for the user interface or other purposes.  We are only interested in rolling up those elements that impact presentation, so I'm eliminating elements that don't apply.  A few elements (name and basedOn) are used in the rolling-up process, so I am listing those.

Note that this is only part of the story around putting together the style information for a cell in a table.  After rolling up styles in a style chain into a single set of properties for a table, we must also roll up character formatting information, which involves rolling up run formatting information for the table, for paragraph styles, and for run styles.  Before rolling any of this up, we need to take the global run properties into consideration.  And when rolling up this information over the hierarchy (table, paragraph, run), we need to handle something called toggle properties.  Finally, where appropriate, we must retrieve information from the theme of the document.  Stay tuned…


Element

Ecma376

Semantics

style

2.7.3.17

Merge child elements

  name

2.7.3.9

Used when assembling inheritance information

  basedOn

2.7.3.3

Used when assembling inheritance information

  pPr

2.7.7.2

Merge child elements

  rPr

2.7.8.1

Merge child elements

  tblPr

2.7.5.4

Merge child elements

  tblStylePr

2.7.5.6

Merge child elements (Conditional Table Formatting Properties).  See the note about this element above.

  tcPr

2.7.5.9

Merge child elements

  trPr

2.7.5.11

Merge child elements

pPr

17.7.8.2

 

  adjustRightInd

2.3.1.1

Replace element

  autoSpaceDE

2.3.1.2

Replace element

  autoSpaceDN

2.3.1.3

Replace element

  bidi

2.3.1.6

Replace element

  cnfStyle

2.3.1.8

Replace element

  contextualSpacing

2.3.1.9

Replace element

  framePr

2.3.1.11

Replace element

  ind

2.3.1.12

Merge attributes

  jc

2.3.1.13

Replace element

  keepLines

2.3.1.14

Replace element

  keepNext

2.3.1.15

Replace element

  kinsoku

2.3.1.16

Replace element

  mirrorIndents

2.3.1.18

Replace element

  numPr

2.3.1.19

Replace element

  outlineLvl

2.3.1.20

Replace element

  overflowPunct

2.3.1.21

Replace element

  pageBreakBefore

2.3.1.23

Replace element

  pBdr

2.3.1.24

Merge child elements

  rPr

2.3.1.29

Merge child elements

  shd

2.3.1.31

Replace element

  snapToGrid

2.3.1.32

Replace element

  spacing

2.3.1.33

Merge attributes

  suppressAutoHyphens

2.3.1.34

Replace element

  suppressLineNumbers

2.3.1.35

Replace element

  suppressOverlap

2.3.1.36

Replace element

  tabs

2.3.1.38

Merge child elements

  textAlignment

2.3.1.39

Replace element

  textboxTightWrap

2.3.1.40

Replace element

  textDirection

2.3.1.41

Replace element

  topLinePunct

2.3.1.43

Replace element

  widowControl

2.3.1.44

Replace element

  wordWrap

2.3.1.45

Replace element

rPr

2.7.8.1

 

  b

2.3.2.1

Replace element

  bCs

2.3.2.2

Replace element

  bdr

2.3.2.3

Replace element

  caps

2.3.2.4

Replace element

  color

2.3.2.5

Replace element

  cs

2.3.2.6

Replace element

  dstrike

2.3.2.7

Replace element

  eastAsianLayout

2.3.2.8

Replace element

  effect

2.3.2.9

Replace element

  em

2.3.2.10

Replace element

  emboss

2.3.2.11

Replace element

  fitText

2.3.2.12

Replace element

  highlight

2.3.2.13

Replace element

  i

2.3.2.14

Replace element

  iCs

2.3.2.15

Replace element

  imprint

2.3.2.16

Replace element

  kern

2.3.2.17

Replace element

  lang

2.3.2.18

Merge attributes

  oMath

2.3.2.20

Replace element

  outline

2.3.2.21

Replace element

  position

2.3.2.22

Replace element

  rFonts

2.3.2.24

Replace element

  rtl

2.3.2.28

Replace element

  shadow

2.3.2.29

Replace element

  shd

2.3.2.30

Replace element

  smallCaps

2.3.2.31

Replace element

  snapToGrid

2.3.2.32

Replace element

  spacing

2.3.2.33

Replace element

  specVanish

2.3.2.34

Replace element

  strike

2.3.2.35

Replace element

  sz

2.3.2.36

Replace element

  szCs

2.3.2.37

Replace element

  u

2.3.2.38

Replace element

  vanish

2.3.2.39

Replace element

  vertAlign

2.3.2.40

Replace element

  w

2.3.2.41

Replace element

  webHidden

2.3.2.42

Replace element

tblPr

 

 

  bidiVisual

2.4.1

Replace element

  jc

2.4.23

Replace element

  shd

2.4.35

Replace element

  tblBorders

2.4.38

Merge child elements

  tblCellMar

2.4.39

Merge child elements

  tblCellSpacing

2.4.43

Replace element

  tblInd

2.4.48

Replace element

  tblLayout

2.4.49

Replace element

  tblLook

2.4.51

Replace element

  tblOverlap

2.4.53

Replace element

  tblpPr

2.4.54

Replace element

  tblStyleColBandSize

2.7.5.5

Replace element

  tblStyleRowBandSize

2.7.5.7

Replace element

  tblW

2.4.61

Replace element

tblStylePr

 

 

  pPr

2.7.5.1

Merge child elements

  rPr

2.7.5.2

Merge child elements

  tblPr

2.7.5.3

Merge child elements

  tcPr

2.7.5.9

Merge child elements

  trPr

2.7.5.10

Merge child elements

tcPr

 

 

  hideMark

2.4.15

Replace element

  noWrap

2.4.28

Replace element

  shd

2.4.33

Replace element

  tcBorders

2.4.63

Merge child elements

  tcFitText

2.4.64

Replace element

  tcMar

2.4.65

Merge child elements

  tcW

2.4.68

Replace element

  textDirection

2.4.69

Replace element

  vAlign

2.4.80

Replace element

trPr

 

 

  cantSplit

2.4.6

Replace element

  gridAfter

2.4.10

Replace element

  gridBefore

2.4.11

Replace element

  hidden

2.4.14

Replace element

  jc

2.4.22

Replace element

  tblCellSpacing

2.4.42

Replace element

  tblHeader

2.4.46

Replace element

  trHeight

2.4.77

Replace element

  wAfter

2.4.82

Replace element

  wBefore

2.4.83

Replace element

 

Transforming Open XML Word-Processing Documents to Html (Post #3)

[Blog Map]

(Update Nov 11, 2009: This is the 3rd in a series of posts (#1, #2, #3, #4, #5, #6) on doing a transform of WordprocessingML to XHtml.)

Over the last couple of weeks, and over the next week, I've been designing and writing some code to convert Open XML word-processing documents to HTML (or Xhtml).  My first post described in broad strokes my goals, my motivations for writing this code, and some details about the approach that I'm considering.  My second post provided more detail about how I'll proceed, my first thoughts about my use of CSS, and specific limitations that I'll place on the conversion.  I also presented my rational for not converting numbered/bulleted items to li elements.  I also presented a skeleton of the conversion code.  As I've been reading through the Open XML specification, more specifics about how I should proceed have become clear to me.  In this post, I'm going to detail some of my conclusions.

First, here are some additional limitations that I'm going to apply to this conversion:

·        I'm not going to attempt to convert documents that contain sub-documents.  This almost certainly would not be one of the primary scenarios.  The conversion will throw an exception if the document contains the w:subDoc element.

·        There are a number of legacy elements that I might be able to ignore: w:dayLong, w:dayShort, w:monthShort, w:monthLong, w:yearLong, w:yearShort, w:pgNum.  Conforming applications should not be writing out these elements.  At some point in the near future, I'm going to write some code to crawl my collection of sample Open XML documents, and count how many documents contain these elements.  This will help me decide whether to do the work to support these elements.

·        I'm going to ignore the w:ruby (phonetic guide) element.  This could be interesting, but I'll reserve this for a later version if it's important.  If this is important to you, I'd be very appreciative if you'd let me know.

·        For phase one of this project, I'm going to do only a rudimentary conversion of DrawingML, specifically to convert images in drawings.  DrawingML contains very rich constructs.  Doing all such transformations and generating appropriate images is in and of itself a complicated project.  However, basic images are described in DrawingML, and we need to be able to generate web pages that contain appropriate references to basic images, so it's important to handle this aspect of DrawingML.  I'll defer the high-fidelity conversion of all aspects of DrawingML to a later project.

·        I'm going to ignore all w:object elements.  Rendering w:object elements is certainly not a main-line scenario.

·        For phase one, I'm not going to attempt to render MathML markup.  This is interesting, but as with DrawingML, non-trivial.  In the interest in getting something working in the next couple of weeks, I'm not going to include conversion of MathML in phase one.

·        As I mentioned in last week's post, I'm not going to convert text separated by physical tabs in phase one.  There is no clean way to approach this, so until the best approach is clear, I'm not going to convert them.

As I mentioned in the first post, I'm going to simplify the word-processing markup before transforming to HTML.  Here are some of the ways that I'll simplify:

·        I'll remove all rsid elements and attributes before doing the conversion to HTML.

·        I'll remove all comments, end notes, and foot notes before doing the conversion.

·        I'll coalesce superfluous runs – combine adjacent runs with identical formatting to a single run.

·        And as I mentioned in the first post, I'll accept all tracked changes (tracked revisions) before doing the conversion.

Now that I've detailed what I won't convert, here is what I will convert:

·        I'll convert all paragraphs and runs, including all text, formatted with the correct font, and with correct paragraph formatting, such as space before and after each paragraph.  This includes honoring all style inheritance, as well as honoring all places where styles defer decisions on font and colors to themes.

·        I'll convert all tables with a high degree of fidelity, including theme formatting, conditional formatting, and tables within tables.  I believe I'll be able to correctly transform both horizontally and vertically merged cells.

·        I'll convert all numbered/bulleted items to straight paragraphs, not li elements.  I detailed my rational for this decision last week.  I'm curious to see how this decision holds up in real-world situations.

·        I'll convert all images, which are represented by DrawingML.  This includes resizing, rotating, mirroring, and flipping images so that the resulting web page looks as close to the word-processing document as possible.  By far, the most important of these is resizing.

·        I'll render w:sectPr as a div element.

·        I'll appropriately render the cr, noBreakHyphen, tab, and br elements.

There are two varieties of hyperlinks as defined in Open XML: hyperlinks described in field codes, and hyperlinks in the simplified version that uses an external reference.  For simplicity, I'll convert all field code hyperlinks to the simplified version before doing the conversion to HTML.  Other than hyperlink field codes, I'll remove all other field code markup, leaving the rendering markup.  For example, markup for a typical field code looks like this:

<w:p>

  <w:r>

    <w:fldChar w:fldCharType="begin"/>

  </w:r>

  <w:r>

    <w:instrText xml:space="preserve"> DATE </w:instrText>

  </w:r>

  <w:r>

    <w:fldChar w:fldCharType="separate"/>

  </w:r>

  <w:r>

    <w:rPr>

      <w:noProof/>

    </w:rPr>

    <w:t>10/15/2009</w:t>

  </w:r>

  <w:r>

    <w:fldChar w:fldCharType="end"/>

  </w:r>

</w:p>

 

In the simplification process, I'll transform this markup to the following, which will be easy to render in HTML:

<w:p>

  <w:r>

    <w:t>10/15/2009</w:t>

  </w:r>

</w:p>

 

Perhaps the most complicated thing to sort out is applying paragraph, table, numbering, and character styling per all the rules as laid out in the specification.  For a single paragraph in a cell, there are properties that must be applied from a) document defaults, b) table formatting, including conditional table formatting, c) paragraph formatting, and d) run formatting.  All of these properties must be applied in a very specific order.  Further, fonts may be defined in themes, and if so, need to be rendered with those fonts as appropriate.

Section 2.7.2 of the Ecma 376 specification (17.7.2 of IS29500) has the following diagram indicating precedence of application of styles:

Here is a description of the process:

·        First, the document defaults are applied to all runs and paragraphs in the document.

·        Next, the table style properties are applied to each table in the document, following the conditional formatting inclusions and exclusions specified per table.

·        Next, numbered item and paragraph properties are applied to each paragraph formatted with a numbering style.

·        Next, paragraph and run properties are applied to each 1 paragraph as defined by the paragraph style.

·        Next, run properties are applied to each run with a specific character style applied.

·        Finally, we apply direct formatting (paragraph or run properties not from styles). If this direct formatting includes numbering, that numbering + the associated paragraph properties are applied.

I could write the code so that as I process each run, I go back through all locations where there are elements that might impact styling, and collect that information, coalesce that information, and then render the HTML.  This might be the most efficient way to proceed in terms of processing time and memory consumption.  However, it means that a specific rule for styling as dictated by the Open XML specification will be rendered in code that is scattered throughout the conversion code, which is not the easiest to debug.

Instead, I've decided on an approach where I process markup per the rules in section Ecma 2.7.2 (IS29500 17.7.2) of the specification, as well as the rules in 2.7.5 (IS29500 17.7.5), and add all styling information to every paragraph and run in the document.  As I add this styling information, all add my own custom attributes that tell the order that this styling information must be applied.  I'll also add an attribute that describes where that styling information came from.  To make this clear, suppose I have a document that has the following document defaults that define the paragraph and run properties that should be applied to all paragraphs and runs in the document:

<w:docDefaults>

  <w:rPrDefault>

    <w:rPr>

      <w:rFonts w:asciiTheme="minorHAnsi"

                w:eastAsiaTheme="minorHAnsi"

                w:hAnsiTheme="minorHAnsi"

                w:cstheme="minorBidi"/>

      <w:sz w:val="22"/>

      <w:szCs w:val="22"/>

      <w:lang w:val="en-US"

              w:eastAsia="en-US"

              w:bidi="ar-SA"/>

    </w:rPr>

  </w:rPrDefault>

  <w:pPrDefault>

    <w:pPr>

      <w:spacing w:after="200"

                 w:line="276"

                 w:lineRule="auto"/>

    </w:pPr>

  </w:pPrDefault>

</w:docDefaults>

 

And suppose I have a paragraph with paragraph properties, that contains a run that has run properties:

<w:p>

  <w:pPr>

    <w:spacing w:after="0"/>

  </w:pPr>

  <w:r>

    <w:t xml:space="preserve">This </w:t>

  </w:r>

  <w:r>

    <w:rPr>

      <w:rFonts w:ascii="Courier New"

                w:hAnsi="Courier New"

                w:cs="Courier New"/>

      <w:b/>

    </w:rPr>

    <w:t>is</w:t>

  </w:r>

  <w:r>

    <w:t xml:space="preserve"> a test.</w:t>

  </w:r>

</w:p>

 

So before doing the final transform to HTML, I'll transform the paragraph and run to contain properties both from the document defaults, and properties that were directly applied:

<w:p>

  <w:pPr

    ptoxml:Order="0"

    ptoxml:Source="document defaults: style part: /w:styles/w:docDefaults/w:pPrDefault/w:pPr">

    <w:spacing w:after="200"

               w:line="276"

               w:lineRule="auto"/>

  </w:pPr>

  <w:pPr>

    <w:spacing w:after="0"/>

  </w:pPr>

  <w:r>

    <w:t xml:space="preserve">This </w:t>

  </w:r>

  <w:r>

    <w:rPr

      ptoxml:Order="0"

      ptoxml:Source="document defaults: style part: /w:styles/w:docDefaults/w:rPrDefault/w:rPr">

      <w:rFonts w:asciiTheme="minorHAnsi"

                w:eastAsiaTheme="minorHAnsi"

                w:hAnsiTheme="minorHAnsi"

                w:cstheme="minorBidi"/>

      <w:sz w:val="22"/>

      <w:szCs w:val="22"/>

      <w:lang w:val="en-US"

              w:eastAsia="en-US"

              w:bidi="ar-SA"/>

    </w:rPr>

    <w:rPr>

      <w:rFonts w:ascii="Courier New"

                w:hAnsi="Courier New"

                w:cs="Courier New"/>

      <w:b/>

    </w:rPr>

    <w:t>is</w:t>

  </w:r>

  <w:r>

    <w:t xml:space="preserve"> a test.</w:t>

  </w:r>

</w:p>

 

This means that I'll be creating an intermediate XML document that is larger than the original – memory use will be somewhat higher.  But there are two important advantages of this approach.  First, the code to deal with each area of styling markup will be localized.  Second, the code will be significantly easier to debug.

For example, there will be only a few lines of code, located in one place, to deal with document default properties (Ecma 2.7.4, IS29500 17.7.4).  This code will duplicate the default paragraph properties on every paragraph in the document, and will duplicate the default run properties on every run in the document.  Then, I'll process the table styling properties in a similar fashion: I'll duplicate the paragraph and run properties, placing them on every paragraph and run in the table.  Then I'll process paragraph and run properties, per the diagram above.  As I add each of these paragraph and run properties to their respective locations, I'll assign an order of application to them.  When it comes time to render the paragraphs and runs in HTML, I can simply roll up all paragraph and run properties per their assigned order, coalesce them, and then render the HTML.  If I'm not getting the results I want, then debugging becomes a simple matter of looking at the paragraph and run properties before coalescing, and seeing where I went wrong in assembling the pile of formatting properties.

One key advantage of this approach is that the code and its algorithm will have a direct correlation to the text of the specification.  The code should be relatively simple to read.  One interesting point to note is that with this approach, it will take extra work to generate CSS classes for paragraph and character styles.  It would be easiest to generate the CSS as in-line only, forgoing generation of classes.  This is the approach that I'm going to take first.  Then, I'll take a look at how much work it would be to generate CSS styles, and determine if the payoff is worth the cost.  As I mentioned in my last post on this topic, my primary goal is to provide a reasonably accurate conversion for use in certain developer scenarios.  It isn't a primary goal to provide perfect separation between content and presentation, which would be the primary advantage of generation of CSS styles.

My next work item is to write the code to propagate formatting properties per this approach.  Following that, I'll write the code to roll up (coalesce) the formatting properties.  And finally, I can write the code to do the transform to HTML (or XHtml).

Transforming Open XML Word-Processing Documents to XHtml (Post #2)

[Blog Map] 

(Update Nov 11, 2009: This is the 2nd in a series of posts (#1, #2, #3, #4, #5, #6) on doing a transform of WordprocessingML to XHtml.)

Last week, I blogged about a small project that I'm embarking on: to make a reasonably accurate transform from Open XML word-processing markup to XHTML.  I wrote about the approach that I'll be taking, and my initial thoughts about how to proceed.  I've done a bit of research, and this week, I'll lay out more details about the approach that I'll take.

One small note about this series of blog posts – these are going to be much more ad-hoc than my usual posts.  If I go down the wrong path, then you'll see this J.  Also, I'm not going to spend too much time writing and re-writing the posts.

One of the key aspects of the approach that I'll take is to use the power of CSS:

  • I'll generate a CSS style for every block-level style that is used in the word-processing document.  I'll try, as far as possible, to generate the appropriate CSS style that will render the word-processing document accurately.  These styles will be applied to p and div elements.
  • I'll generate a CSS style for in-line styles.  These styles will be applied to span elements.
  • I'll preface the generated style names with Ptoxml (PowerTools for Open XML), i.e. PtoxmlNormal, PtoxmlH1, etc.  If this generated markup is embedded in other HTML, this will prevent class name collisions.
  • The classes for block-level and in-line styles will be generated in an internal style sheet.
  • If there is direct formatting applied to a paragraph or to a run within a paragraph, I'll generate the appropriate CSS as an in-line style.  My goal here is not to generate a document where the content is perfectly separated from the presentation.  Instead, my goal is to provide a conversion of a small chunk of word-processing markup to usable XHTML that can be used programmatically in a variety of contexts.  Open XML word-processing markup has 'cascading' semantics – a paragraph can be of a specific style, and the user can override aspects of that style for a paragraph.  This is a direct parallel to the semantics of CSS – a paragraph can be of a specific class, and can be overridden for a paragraph.

One key aspect of the approach that I'm going to take: I am not going to translate numbered/bulleted items from word-processing markup to li elements in the Xhtml.  Instead, I'm going to generate paragraphs of a particular class, and format that class using CSS as appropriate, so that numbered items and bulleted lists are rendered properly.  While numbered items that are formatted in a simple way translate to li elements in the Xhtml markup, the capabilities of numbered items in word-processing markup are rich (RICH!), and as soon as the markup uses more than the most rudimentary capabilities, the translation breaks down.  This has been one of the biggest complaints about other projects that convert Open XML to html – that numbered items aren't translated properly.  I could go down the road of translating rudimentary numbered items to li elements, and then translate the more rich variations into paragraphs, but this is messy.  Instead, I believe that I'm going to discard using li elements altogether.

As I've researched how I'll implement this, I've decided on a few limitations:

  • Multi-column layout will be converted to single column layout.  We're more concerned about accurately surfacing the content than exact representation.
  • Themes will be converted to straight CSS styles, both at the class level, and where over-ridden, at the in-line level.  The abstraction of themes won't be carried over to the XHTML.
  • In cases where there is no direct correspondence between CSS styles and the specific representation in word-processing markup, I'll simplify the representation to whatever can be represented in CSS.  For instance, there are a lot of varieties of underline styles in word-processing markup.  All underline styles will be transformed to a simple underline in the generated CSS and XHTML.
  • Open XML word-processing markup has an abstraction of tabs – you can display the ruler above your document, below the ribbon, and insert tabs, then place tabs in your document.  There is no corresponding abstraction in CSS/XHTML.  This could be approximated using spaces, but at best, it will be inaccurate – text won't align properly vertically.  My personal experience is that these days, people prefer tables for laying out text instead of tabs, and tables do translate properly.  I think that for phase 1, I'll not attempt any sort of hacked conversion, but I've not yet decided on how to convert these.  It could be neat to convert tabbed text to tables with invisible grids with merged cells, but I'm not sure how this would work practically.  One problem is that numbered/bulleted items in word-processing markup makes use of physical tabs – if I don't have a way to render tabs, there will be a small loss in fidelity of placement of text of numbered items.

I'm sure that I'll discover other places where I will want to place limits on the transform.

The last thing I'll present in this post is the skeleton for the conversion.  The following code will do a simplistic transform of simple Open XML documents to simple XHTML.  I can then build and extend this code, handling more and more sophisticated varieties of markup.  For a detailed explanation of how this type of transform works, see the post, Recursive Pure Functional Transforms of XML.

using System;

using System.Collections.Generic;

using System.IO;

using System.Linq;

using System.Text;

using System.Xml;

using System.Xml.Linq;

using DocumentFormat.OpenXml.Packaging;

 

namespace HtmlConverter

{

    public static class Extensions

    {

        public static XDocument GetXDocument(this OpenXmlPart part)

        {

            XDocument partXDocument = part.Annotation<XDocument>();

            if (partXDocument != null)

                return partXDocument;

            using (Stream partStream = part.GetStream())

            using (XmlReader partXmlReader = XmlReader.Create(partStream))

                partXDocument = XDocument.Load(partXmlReader);

            part.AddAnnotation(partXDocument);

            return partXDocument;

        }

 

        public static string StringConcatenate(this IEnumerable<string> source)

        {

            StringBuilder sb = new StringBuilder();

            foreach (string s in source)

                sb.Append(s);

            return sb.ToString();

        }

    }

 

    public static class W

    {

        public static XNamespace w =

            "http://schemas.openxmlformats.org/wordprocessingml/2006/main";

        public static XName body = w + "body";

        public static XName document = w + "document";

        public static XName p = w + "p";

        public static XName pPr = w + "pPr";

        public static XName r = w + "r";

        public static XName rPr = w + "rPr";

        public static XName t = w + "t";

        public static XName tbl = w + "tbl";

        public static XName tc = w + "tc";

        public static XName tr = w + "tr";

        public static XName txbxContent = w + "txbxContent";

        public static XName val = w + "val";

        public static XName pStyle = w + "pStyle";

        public static XName b = w + "b";

    }

 

    public static class Xhtml

    {

        public static XNamespace xhtml = "http://www.w3.org/1999/xhtml";

        public static XName html = xhtml + "html";

        public static XName head = xhtml + "head";

        public static XName title = xhtml + "title";

        public static XName body = xhtml + "body";

        public static XName p = xhtml + "p";

        public static XName h1 = xhtml + "h1";

        public static XName h2 = xhtml + "h2";

        public static XName A = xhtml + "A";

        public static XName href = "href";

        public static XName b = xhtml + "b";

        public static XName table = xhtml + "table";

        public static XName border = "border";

        public static XName tr = xhtml + "tr";

        public static XName td = xhtml + "td";

    }

 

    public static class HtmlConverter

    {

        public static object ConvertToHtmlTransform(WordprocessingDocument wordDoc,

            XNode node)

        {

            XElement element = node as XElement;

            if (element != null)

            {

                if (element.Name == W.document)

                    return new XElement(Xhtml.html,

                        new XElement(Xhtml.head,

                            new XElement(Xhtml.title, "Test.docx")

                        ),

                        element.Elements().Select(e => ConvertToHtmlTransform(wordDoc, e))

                    );

 

                // transform the w:body element to the XHTML h:body element

                if (element.Name == W.body)

                    return new XElement(Xhtml.body,

                        element.Elements().Select(e => ConvertToHtmlTransform(wordDoc, e)));

 

                // transform every Heading1 styled paragraph to the XHTML h:h1 element

                if (element.Name == W.p && (string)element

                    .Elements(W.pPr)

                    .Elements(W.pStyle)

                    .Attributes(W.val)

                    .FirstOrDefault() == "Heading1")

                    return new XElement(Xhtml.h1,

                        element.Elements().Select(e => ConvertToHtmlTransform(wordDoc, e)));

 

                // transform every Heading2 styled paragraph to the XHTML h:h2 element

                if (element.Name == W.p && (string)element

                    .Elements(W.pPr)

                    .Elements(W.pStyle)

                    .Attributes(W.val)

                    .FirstOrDefault() == "Heading2")

                    return new XElement(Xhtml.h2,

                        element.Elements().Select(e => ConvertToHtmlTransform(wordDoc, e)));

 

 

                // transform w:p to h:p

                if (element.Name == W.p)

                    return new XElement(Xhtml.p,

                        element.Elements().Select(e => ConvertToHtmlTransform(wordDoc, e)));

 

                // transform every text run that is styled as bold to the XHTML h:b element

                if (element.Name == W.r &&

                    element.Elements(W.rPr).Elements(W.b).Any())

                    return new XElement(Xhtml.b,

                        element.Elements(W.t).Select(e => (string)e).StringConcatenate());

 

                // transform every text run that is not styled as bold to a text node that

                // contains the text of the paragraph.

                if (element.Name == W.r &&

                    !element.Elements(W.rPr).Elements(W.b).Any())

                    return new XText(element.Elements(W.t)

                        .Select(e => (string)e).StringConcatenate());

 

                // transform w:tbl to h:tbl

                if (element.Name == W.tbl)

                    return new XElement(Xhtml.table,

                        new XAttribute(Xhtml.border, 1),

                        element.Elements().Select(e => ConvertToHtmlTransform(wordDoc, e)));

 

                // transform w:tr to h:tr

                if (element.Name == W.tr)

                    return new XElement(Xhtml.tr,

                        element.Elements().Select(e => ConvertToHtmlTransform(wordDoc, e)));

 

                // transform w:tc to h:td

                if (element.Name == W.tc)

                    return new XElement(Xhtml.td,

                        element.Elements().Select(e => ConvertToHtmlTransform(wordDoc, e)));

 

                // the following removes any nodes that haven't been transformed

                return null;

            }

            return null;

        }

 

 

        public static XElement ConvertToHtml(WordprocessingDocument wordDoc)

        {

            // TODO WE REALLY WANT TO DO THIS ON BLOCK LEVEL CONTENT, NOT JUST CHILD ELEMENTS

            // OF THE BODY ELEMENT

            return ConvertToHtml(wordDoc, wordDoc

                .MainDocumentPart

                .GetXDocument()

                .Element(W.document)

                .Element(W.body)

                .Elements());

        }

 

        public static XElement ConvertToHtml(WordprocessingDocument wordDoc,

            IEnumerable<XElement> blockLevelContent)

        {

            if (blockLevelContent == null)

                throw new ArgumentException("blockLevelContent argument is null");

            XElement firstBlockLevelElement = blockLevelContent.FirstOrDefault();

            if (firstBlockLevelElement == null)

                throw new ArgumentException("blockLevelContent sequence is empty");

            XDocument doc = firstBlockLevelElement.Document;

            XElement xhtml = (XElement)ConvertToHtmlTransform(wordDoc, doc.Root);

            return xhtml;

        }

    }

 

    class Program

    {

        static void Main(string[] args)

        {

            string fileName = "Test.docx";

            FileInfo fi = new FileInfo(fileName);

            string baseName = fi.Name.Substring(0, fi.Name.Length - 4);

            string newFileName = baseName + "-Copy.docx";

            File.Copy(fileName, newFileName);

            using (WordprocessingDocument doc =

                WordprocessingDocument.Open(newFileName, true))

            {

                XElement html = HtmlConverter.ConvertToHtml(doc);

                html.Save("Test.html");

            }

        }

    }

}

 

Generating Documents from SharePoint Lists using Open XML Content Controls

[Blog Map]  

It's often the case that a department manager needs to regularly send a nicely formatted status report to her general manager or that a team leader needs to send a weekly status report to a number of interested parties.  To collaborate with others in their organizations, both the manager and the team leader can maintain status information in SharePoint lists.  The question for developers is how to include the information in the lists in a document such as a status report.

I've written an article entitled Generating Documents from SharePoint with Open XML Content Controls, published in the October issue of MSDN Magazine that details how you can put together a simple, flexible, and powerful system for generating Open XML word-processing documents using SharePoint lists as sources for data in tables of word-processing documents.

The key aspect of a successful approach is to make it so that the folks who need to generate reports can do so without involving the services of a software developer.  We can adopt an approach where the user can configure the sources of data for tables in a document using content controls.  The user can surround a table with a content control, and set the content control tag to the name of the SharePoint list.  The user can insert content controls into cells in a nicely formatted table, and set the tags to the names of the columns of the SharePoint list.  We then have enough information to retrieve the necessary data from the SharePoint list (or lists) and generate a document that contains that data.  If the department manager or team leader subsequently adds a new column to their SharePoint list, they can add a new column to their table, insert a new content control, set the tag, and thereafter, their reports will contain data from the new column in the SharePoint list.

To make this clear, a typical SharePoint list looks like this:

 

The template Open XML word-processing document might look like the following.  Notice that there is a content control surrounding the entire table, with a tag and title of "Team Members", and there is a content control in a table cell with tag and title of "TeamMemberName".  If we were to put the insertion point in the Role cell, you would see that there is a content control there.  There are similar content controls around and in the Current Work Items table.

The resulting generated report looks like this:

 

One aspect of the approach that I took is that I abstracted the operations involving the content controls into a ContentControlManager class.  You can call a method, GetContentControls, which returns some XML that describes the content controls in the document.  In the example that I present with the article, GetContentControls returns the following XML:

<ContentControls>

  <Table Name="Team Members">

    <Field Name="TeamMemberName" />

    <Field Name="Role" />

  </Table>

  <Table Name="Item List">

    <Field Name="ItemName" />

    <Field Name="Description" />

    <Field Name="EstimatedHours" />

    <Field Name="AssignedTo" />

  </Table>

</ContentControls>

 

This gives us the information that we need to use the SharePoint object model to retrieve the necessary data.  After retrieving that data, we can construct a small XML tree that looks like this:

<ContentControls>

  <Table Name="Team Members">

    <Field Name="TeamMemberName" />

    <Field Name="Role" />

    <Row>

      <Field Name="TeamMemberName"

             Value="Bob" />

      <Field Name="Role"

             Value="Developer" />

    </Row>

    <Row>

      <Field Name="TeamMemberName"

             Value="Susan" />

      <Field Name="Role"

             Value="Program Manager" />

    </Row>

    <Row>

      <Field Name="TeamMemberName"

             Value="Jack" />

      <Field Name="Role"

             Value="Test" />

    </Row>

  </Table>

  <Table Name="Item List">

    <Field Name="ItemName" />

    <Field Name="Description" />

    <Field Name="EstimatedHours" />

    <Field Name="AssignedTo" />

    <Row>

      <Field Name="ItemName"

             Value="Learn SharePoint 2010" />

      <Field Name="Description"

             Value="This should be fun!" />

      <Field Name="EstimatedHours"

             Value="80" />

      <Field Name="AssignedTo"

             Value="All" />

    </Row>

    <Row>

      <Field Name="ItemName"

             Value="Finalize Import Module Specification" />

      <Field Name="Description"

             Value="Make sure to handle all document formats." />

      <Field Name="EstimatedHours"

             Value="35" />

      <Field Name="AssignedTo"

             Value="Susan" />

    </Row>

    <Row>

      <Field Name="ItemName"

             Value="Write Test Plan" />

      <Field Name="Description"

             Value="Include regression testing items." />

      <Field Name="EstimatedHours"

             Value="20" />

      <Field Name="AssignedTo"

             Value="Jack" />

    </Row>

  </Table>

</ContentControls>

 

We can pass that XML to the SetContentControls method in the ContentControlManager class, which will modify the document so that the tables in the document are populated with the data from the XML.  This is useful functionality in its own right, and could be used in other scenarios than generating documents from SharePoint lists.

The MSDN article contains a download that contains the ContentControlManager class, as well as the SharePoint code to populate tables with data from SharePoint lists.

Transforming Open XML Word-Processing Documents to XHtml

[Blog Map]

(Update Nov 11, 2009: This is the 1st in a series of posts (#1, #2, #3, #4, #5, #6) on doing a transform of WordprocessingML to XHtml.)

Over the next couple of weeks, I'm going to spend some time writing some LINQ to XML code to transform pen XML word-processing documents to XHtml.  Just for fun, as I go, I'm going to post my progress, posting the code, talking about the issues I come across, and in general, being transparent about this development process.  I welcome your thoughts and opinions.  And shortly, we'll have a useful chunk of code that we can use in a variety of cool scenarios.

This code will be part of the PowerTools for Open XML project, so will be released under the Ms-PL license, so I'll be posting zip files there.

A few example scenarios:

  • Convert word docs to Html, populate a SharePoint wiki.
  • Find some text in a word doc, grab the three paragraphs before and after, change the formatting of the Open XML text that we found, and convert the chunk to Html, then display the results, with the found text highlighted in some fashion.
  • Build a specialized Html converter for my blog, which puts a 'Copy Code' button above each code snippet.

So, as I start, here are some thoughts and ideas I have about this project:

I'll try to transform to XHTML that is validated against the schema, unless I run into blocking issues.

There already is code plus an XSLT style sheet that can convert Open XML word-processing docs to HTML.  This is the CodePlex/OpenXmlViewer project.  I have different goals from that project – that project aims for high fidelity (the resulting HTML looks as close as possible to the original word-processing document), and is (I think) primarily used as a browser plug-in.  My goals – less effort spent on full fidelity, and more on making this easy and convenient for developers to modify and enhance for specific development efforts.

Also, I want to be able to convert a small selected chunk of a word-processing document, whereas the OpenXmlViewer project converts entire documents.

Finally, I want a chunk of code that is super-easy for a C# developer to customize and incorporate into another application.

I'm going to write this code as a pure functional transform that uses recursion.  After a fairly long selection process, I've settled on this approach for a variety of reasons.  I don't want to use XSLT, as I'm going to add extension points where developers can interject their own custom transformations for specific pieces.  For example, a developer can provide a lambda expression (delegate) for images – the lambda gets an image as an argument – you can do what you want with the image – post it on a server, or whatever, and then return the link to the transform.  This will give wide latitude in how you deal with images.

I've discarded the approach of using annotations for doing document-centric transforms, as it has performance issues when used for large documents.  (Actually, I'm not completely sure about this, but I've had the sense of this as I've written various transforms on larger documents.)  In contrast, pure functional recursive transforms are blindingly fast.  The code that I wrote in the recursive style to accept revisions does no less than seven successive transformations, producing entirely new trees, and the code is fast.  On a Dell D600, 2Ghz, single core, it processes extremely large documents (800 pages) in less than a second.

The disadvantage of using recursive LINQ transforms is that not too many developers are comfortable with this style of development.  However, if I do this properly, developers won't need to plumb the depths of the transform, and instead can use it as a 'black box'.  Besides, if I make this process transparent, maybe more developers will understand the power of this approach.

One key aspect of the approach I'll take: I'll accept all tracked revisions before doing the conversion.  This will make my code much simpler to write.  The resulting code will be more robust.  I've been postponing writing the HTML converter until I had a revision accepter that I am satisfied with.

I'm also considering doing an initial transformation to the simplest word-processing markup possible.  For example, if there are two adjacent runs that have the same formatting, I can combine them into a single run.  I'll also discard superfluous markup, such as proofing errors.  If I simplify the markup, then there are more possibilities for straight one-to-one conversions between the Open XML markup and HTML.

I think that it would be useful to preserve bookmarks and internal links, and construct the corresponding markup in HTML.

For this initial version, I'm going to discard comments.  It could be interesting to build a conversion that surfaces comments, but this isn't one of the main scenarios.  In most cases, we don't want comments placed into the resulting HTML, I think.

I probably will also discard footnotes and endnotes in the resulting HTML.  These are interesting, but probably only to a small subset of developers.  If there is a lot of demand for this, then I can enhance the code later to incorporate these conversions.  But I'd have to decide how I would want them to be rendered in HTML, and this is a more complicated decision.

I want this code to have the highest fidelity that I can accomplish without jumping through too many hoops.  Key goals – preservation of textual content – if there's text in the source document, the text shows up in the same place, with the same font, in the resulting HTML.  Images should convert and show up in the same place, and if there is an easy way to make the text flow in the same fashion as the source document, the HTML will do so.

Because I'm writing this in the pure functional recursive style, you can almost prove that a) the code can't fail, because I'll make every effort to reduce 'points of possible failure', and b) the code can only produce valid XHtml.  This adds robustness and reliability to applications that use this code.

Finally, I want to write this transform in the smallest amount of code possible.  My off-the-cuff estimate is that the conversion should be 1000 lines of code or less.  But we'll see how large the code becomes as I progress.

Anyway, on to more research and coding.  I'll post the next update in about a week.  This is going to be fun!

New Open XML Developer Center on MSDN

As most developers who work with Microsoft technologies know, there are Developer Centers on MSDN.  Each developer center, run by a team devoted to the technology, contains the best resources for that technology, including links to SDK downloads, documentation, whitepapers, code samples on Code Gallery and CodePlex, bloggers and their posts, forums, and more.  These developer centers are great resources, and should often be the first stop in your search for a solution to your developer challenges.

I'm happy to announce that we've revamped our Open XML Developer Center.  The new dev center is much better than the old one, including links to many more resources, blog posts, tools, screen-casts, visual How-To articles (articles with an accompanying screen-cast), and more.  There are also some appearance changes, but most important is the list of content, assembled in one place, that can help you find what you need.

The most important feature of the new Open XML Developer Center is the navigation panel:

This is a complete list of all the pages in the Developer Center.  That's it – 16 pages, so it's easy to navigate. (Some of them have a lot of material on them, so you'll need to scroll to see it all.)

Here's my quick summary of what's available: