Recently on Gray's blog he posted about a scanning tool that can be used to detect custom XML markup in Word Open XML files (*.docx, *.docm, *.dotm, and *.dotx). The tool is built using the Open XML SDK and we wanted to take the opportunity to show you how the solution works in this blog.
The scenario we want to support is, given a directory, find all Word Open XML documents that contain custom XML markup (w:customXml elements). In order to accomplish this scenario we will need to take the following actions:
If you want to jump straight into the code, feel free to download this solution here.
The solution uses the December 2009 CTP of the Open XML SDK 2.0 for Microsoft Office, which you can learn more about in this introduction to the Open XML SDK. You can certainly use version 1.0 of the Open XML SDK, but version 2.0 makes things a lot easier, especially with all the improvements added to the December 2009 CTP as described in the CTP announcement blog post.
For the sake of simplicity we are going to build a command line solution that expects one argument, which will represent the directory that will be scanned by the tool. Given this directory and all of its subdirectories, we are going to look for all Word Open XML documents, which have the following extensions: .docx, .docm, .dotx, and .dotm. Performing this task is pretty simple with the class DirectoryInfo and the method GetFiles. The only issue is that the GetFiles method only allows you to search for one extension at a time. Files or directories that cannot be scanned, for example due to file permissions, will be reported in a separate list. Here is a code snippet to solve this issue and look for all files given multiple extension types:
Now that we have all the Word files to scan, our next step is to open each of these documents with the Open XML SDK. The Open XML SDK should be able to handle most Word Open XML files. However, there are occasions where the Word document may have issues that prevent it from being opened with the SDK. For example, IRM documents cannot be opened with the Open XML SDK. To ensure our solution continues to function despite these types of issues we can simply wrap the SDK Open method with a try and catch. Any errors detected will be reported in the errors list. Here is the code snippet to accomplish this task:
At this point we have the file opened with the Open XML SDK. The next step is to get all the parts within the package. For this task we are going to leverage some source code from Eric White's post on how to create a list of all parts in an Open XML document. Essentially we are going to use two methods GetAllParts and AddPart to recursively find all XML based parts within a package. Here is the code snippet necessary to accomplish this task:
Now that we have all the XML related parts contained within our Word document, the next step is to scan each of those parts for Custom XML markup. This task should be pretty easy with the Open XML SDK. All files with detected custom XML markup will be reported in the results list. Here is the code snippet necessary to accomplish this task:
The NumberOccurrencesCustomXMLMarkup method simply looks for the following Custom XML Markup related SDK objects:
Pretty easy stuff!
The last step in the solution is to report the results as a text based log file. Here is the code snippet to accomplish this task:
Running this code on a directory we end up with a tab delimited log file that shows all the files that contain Custom XML markup. Here is a screenshot of how the log file looks like when opened in Microsoft Excel:
Hopefully this solution shows you how easy it is to interrogate an Open XML file with the Open XML SDK.
Brian Jones + Zeyad Rajabi