Reposting from http://openxmldeveloper.org/articles/MOSSconvert.aspx.

Article submitted by: Robert Orleth, Microsoft

MOSS (Microsoft Office Sharepoint Server) has a feature that lets you convert OpenXML files to web pages, all you need to to is select “Convert Document” from the dropdown when you look at a document library:

However, the conversion to HTML has the problem that there is no support for embedded images in HTML, all it can do is link to images. Meaning, in a straight conversion from OpenXML to HTML, links to images (that you can create in Word by selecting selected the dropdown from the insert button and click “Link To File”) can get transferred into the HTML but embedded images (that were inserted using the default option of “Insert”) get dropped.

Thanks to the structure of OpenXML, it would actually be easy to extract the images out of the .docx file and upload them, but in the current infrastructure of MOSS, there is no way to directly link those images to the document in a way that:

  • Follows in the same workflow as the document
  • Takes the same scheduling settings at the document

Hence this is not done as part of the page conversion process in MOSS. Now, in some cases (e.g. when you’re not using workflow or scheduling) you may be willing to live with those shortcomings, or work around them (e.g. by manually scheduling the images separately). For that case, you can write a little tool that does extract the images from a .docx file that lives in a SharePoint document library, uploads those images to the same library and then fixes up the .docx to now point to the uploaded versions of the image. So when you open the changed .docx file, it should look precisely the same as before, but the images are no longer embedded but linked, and if you run the conversion to web page, you’ll get the full web page, including your images.

This article presents the full source code to a simple version of such a tool. You should be able to just copy and paste the code fragments below into a single file that you hook up to your liking (either in a command line application, or something that has a fancier UI with knobs and switches to control some of the policy decisions that I can’t make for you. Another attractive option may be to create your own workflow action that manages the whole conversion process, including embedded images – you can use this as a starting point to get there.

To get started - first you need to instantiate the SharePoint object that represents the file and get its stream. The using statements you’ll need are:

using Metro = System.IO.Packaging;
using Microsoft.SharePoint;

 

The corresponding references you need to add are to “Microsoft.SharePoint” (which after MOSS installation lives in %ProgramFiles%\common files\microsoft shared\Web Server Extensions\12\ISAPI) and WindowsBase, which is offered as an option in VisualStudio when you are running on Vista or on Windows 2003/XP and have the .Net Framework 3.0 installed.

The code to get the stream is this (assuming fileName is a string containing the fully qualified path to the file, provided by whatever frame you hook this into, might be as simple as passed in as command line parameter):

using (SPSite site = new SPSite(fileName))
using (SPWeb web = site.OpenWeb())
{
    SPFile webFile = web.GetFile(fileName);
    Stream fileStream = webFile.OpenBinaryStream();

 

Note the using statements (that should be closed at the far end of the file, once you’re done with the SPFile object). They make sure that the site and web object are explicitly closed when you’re done with them, as you’ll effectively introduce a memory leak if you don’t do that, which is a problem e.g. if you’re building a workflow.

Great – now we have the stream. So far only SharePoint APIs used, but this is about to change, we’ll read it into a ZipPackage object, because that’s what it is:

Metro.ZipPackage pack = (Metro.ZipPackage) Metro.ZipPackage.Open(fileStream, 
    System.IO.FileMode.Open, System.IO.FileAccess.ReadWrite);

 

We need to work on the relationships of the Word content, so we’ll enumerate those. The enumeration by type takes the document ID. I am defining this (and a couple of things to be used further down the line):

private const string wordDocId = 
    "http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument";
private const string imageId = 
    "http://schemas.openxmlformats.org/officeDocument/2006/relationships/image";
const string WordMlNamespaceUri = 
    "http://schemas.openxmlformats.org/wordprocessingml/2006/main";
const string RelationShipUri = 
    "http://schemas.openxmlformats.org/officeDocument/2006/relationships";
private const string linkedImageRelId = "r:link";
private const string embeddedImageRelId = "r:embed";

 

I have the definitions – get the collection of Word relationships (I’ve only ever seen single instances, but then there might be more):

Metro.PackageRelationshipCollection packRels = 
    pack.GetRelationshipsByType(wordDocId);

 

Looping through the parts, we’ll grab the Word document part using its URI – there’s a neat little helper function to manage the URI handling:

foreach (Metro.PackageRelationship r in packRels)
{
Uri uri = Metro.PackUriHelper.CreatePartUri(r.TargetUri);
    Metro.ZipPackagePart wordDoc = (Metro.ZipPackagePart)
        pack.GetPart(uri);

 

and we’ll enumerate the images that are referenced by this Word part (imageId definition see above):

    Metro.PackageRelationshipCollection docRels = 
        wordDoc.GetRelationshipsByType(imageId);

 

As I’ll be relying on the enumeration of the relationships of the images and want to make changes to the collection, I’ll introduce two storage helper objects so I can first loop through the relationships and store what I need to do and then go and do it. externalFileMapping maps the internal relationship ID to the absolute external path, and externalRelationships maps which previously internal relationship ID is not which external relationship ID:

Dictionary<string, string> externalFileMapping =
    new Dictionary<string, string>();
Dictionary<string, string> externalRelationships =
    new Dictionary<string, string>();

 

Now, looping through the image relationships, I’ll take the internal ones, upload the image and store the resulting external file URL. Technically, the assumption I am making that all external links are reachable from the web page is incorrect to begin with (imagine you put a link to c:\mypic.jpg – that’s not going to resolve well), but resolving that includes asking the user what location the external links have and what form that should take very much depends on your environment (such as what machine the document was created on ).

To store the image outside of the OpenXML file, I need to generate a filename for the image. This can actually be a non-trivial exercise, depending on how you intend to manage the image files that get created. Here I only take the internal name and I’ll stuff the file into the same document library as the .docx file, but your application may require images to live somewhere else, and you might want to add a logic like a serial number, or appending a conversion timestamp or something to the like that makes that image file name unique – in case you want to run this tool multiple times. Making it unique, however, has the problem that now you should clean up the now no longer referenced image files to avoid bloating your storage and that means you need to have some way of identifying which images are now obsolete. There are many ways to solve that problem (e.g. naming conventions, storing the image file names as SharePoint properties on the document, storing created images in a per-document folder, etc.), here I pick the simplest one as the purpose is to demonstrate how the OpenXML file can be manipulated to provide a basis to solve the scenario.

Looping through the image relationship of the document part that we’re looking at:

foreach (Metro.PackageRelationship i in docRels)
{
    // only interested in internal relationships
    if (i.TargetMode == System.IO.Packaging.TargetMode.Internal)
    {
        Uri internalImageUri = Metro.PackUriHelper.ResolvePartUri(uri,i.TargetUri);

        // grab the part that contains the internal image
        Metro.ZipPackagePart imagePart = (Metro.ZipPackagePart)
            pack.GetPart(internalImageUri);

        // as external file name just take the internal name and 
        // put if in the same folder as the original file
        string imageFileName = internalImageUri.OriginalString.Substring(
            internalImageUri.OriginalString.LastIndexOf('/')+1);

        SPFile imageFile = webFile.ParentFolder.Files.Add(imageFileName,
            imagePart.GetStream(), true);

        externalFileMapping.Add(i.Id, imageFile.ServerRelativeUrl);
    }
}

 

OK, now that we’re done looping through the relationships, uploaded the images and storing which internal ID maps to which now-external file, we can touch the collection and actually add the new external relationships:

foreach (KeyValuePair externalFileEntry in
        externalFileMapping)
{
    Metro.PackageRelationship extRel = wordDoc.CreateRelationship(
        new Uri(externalFileEntry.Value, UriKind.Relative), 
        System.IO.Packaging.TargetMode.External, imageId);
    externalRelationships.Add(externalFileEntry.Key, extRel.Id);
}

 

I need to squirrel away which internal ID now has which external relationship ID – that’s the purpose of the externalRelationships helper dictionary.

So, now we have the images extracted and we have the .docx file fixed up to have relationships that point to those images. What’s missing ? Oh yeah, we need to change every reference that so far is referencing the internal image to now reference the external image. In theory, this could be completely transparent to the body of the document if there was indeed no difference in the way external and internal images are treated, but there are subtle differences – an image that’s embedded says:

    <a:blip r:embed="rIdx"/>

 

(with x being a number), whereas a linked image says:

    <a:blip r:link="rIdx"/>

 

And one that’s both linked AND embedded says:

    <a:blip r:embed="rIdx" r:link="rIdy"/>

 

Incidentally, this is what we want as outcome – we don’t want to delete the internal image, but we want to make it possible for the conversion to find the external one.

So now we need to find all the r:embed image attributes on images that do not have an r:link attribute and put an r:link pointing to our shiny new external relationship. Here’s how that works – I start by loading an XmlDocument object from the document stream:

XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load(wordDoc.GetStream());

 

Now, in order to keep all the namespace straight, I need a namespace manager object that I load with the proper namespace definitions:

XmlNamespaceManager nsm = new XmlNamespaceManager(xmlDoc.NameTable);

nsm.AddNamespace("w", WordMlNamespaceUri);
nsm.AddNamespace("r", RelationShipUri);
nsm.AddNamespace("a", 
    "http://schemas.openxmlformats.org/drawingml/2006/main");
nsm.AddNamespace("pic", 
    "http://schemas.openxmlformats.org/drawingml/2006/picture");
nsm.AddNamespace("wp", 
    "http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing");

 

The xpath query that gets me the list of nodes that I am interested in here for this main scenario is w:drawing/wp:inline/a:graphic/a:graphicData/pic:pic/pic:blipFill/a:blip, so:

XmlNodeList linkNodes = xmlDoc.SelectNodes(
    "//w:drawing/wp:inline/a:graphic/a:graphicData/pic:pic/pic:blipFill/a:blip", nsm);

 

Now loop through each of the nodes, check whether it has just an r:embed attribute and if so, add an r:link attribute pointing to the now external ID that I’ve stored in the externalRelationships helper object (see below for the definition of the little helper function getAttributeValue(), it’s basically just a macro):

foreach (XmlNode linkDataNode in linkNodes)
{
    string linkedId = getAttributeValue(linkDataNode, linkedImageRelId);
    string embeddedId = getAttributeValue(linkDataNode, embeddedImageRelId);

    if (String.IsNullOrEmpty(linkedId) && !String.IsNullOrEmpty(embeddedId))    {
        if (externalRelationships.ContainsKey(embeddedId))
        {
            XmlAttribute externalLinkAttr = xmlDoc.CreateAttribute(
                linkedImageRelId, RelationShipUri);
            externalLinkAttr.Value = externalRelationships[embeddedId];
            linkDataNode.Attributes.Append(externalLinkAttr);
        }
    }
}

 

The little helper function getAttributeValue is reproduced here:

private static string getAttributeValue(XmlNode node, string name)
{
    string value = string.Empty;
    XmlAttribute attribute = node.Attributes[name];
    if (attribute != null && attribute.Value != null)
    {
        value = attribute.Value;
    }
    return value;
}

 

If you’re dealing with “upgraded” documents (i.e. documents that were created in earlier versions of Word, and then saved as .docx), images embedded in there are linked differently, the XPath query to get at the node of the relationship is //w:pict/v:shape/v:imagedata, and the relationship attribute is called r:id – regardless of whether it’s an internal or an external relationship. So all we need to do here is to change what r:id points to:

nsm.AddNamespace("v", "urn:schemas-microsoft-com:vml");

XmlNodeList compatibleLinkNodes = 
    xmlDoc.SelectNodes("//w:pict/v:shape/v:imagedata", nsm);

foreach (XmlNode linkDataNode in compatibleLinkNodes)
{
    XmlAttribute attribute = linkDataNode.Attributes["r:id"];

    if (attribute != null && attribute.Value != null)    {
        if (!String.IsNullOrEmpty(attribute.Value) && 
            externalRelationships.ContainsKey(attribute.Value))
        {
            attribute.Value = externalRelationships[attribute.Value];
        }
    }
}

 

All done – in memory. Saving this back:

xmlDoc.Save(wordDoc.GetStream());

 

And on the outer loops:

pack.Flush();
webFile.SaveBinary(fileStream);
fileStream.Close();

 

After running this on your document with embedded images, you can easily convert that document to a web page – though you’ll need to manage the images on your own, which may or may not be a problem in your context. I hope this helps somebody out to get started – feedback appreciated!