One of the more common scenarios related to a Wordprocessing document is the need to sanitize a document in order to remove personally identifiable information. What do I mean by personally identifiable information? Well, I am talking about, among other things, comments, revisions, personal information such as author name, and hidden text. This type of content may need to be stripped out of a document before the document gets sent outside a corporation.
This scenario is so important to Office that we added a Document Inspector feature in Office 2007, which is able to find and remove these types of personally identifiable information. You can find this feature by clicking the Office button | Prepare | Inspect Document. Here is what the feature looks like:
How do I perform the same actions programmatically, let's say on the server? Well, here is where the Open XML SDK can help. Today I am going to show you how to remove comments within a Wordprocessing document. This post is similar to Eric's post on using LINQ to remove comments from a document, except I will show you a solution that builds on top of version 2 of the Open XML SDK.
Imagine I have a document that has multiple comments, where some of the comments may even contain images. If you crack open the package you will notice that a Wordprocessing document that contains comments will have the following content:
Here is a screenshot of an example document with comments:
To remove comments from a Wordprocessing document we need to take the following actions:
My post will talk about using version 2 of the SDK.
If you just want to jump straight into the code, feel free to download this solution here.
The following code snippet accomplishes all six tasks discussed in the Solution section above. This code snippet builds upon some of the topics discussed in the Traversing in the Open XML SDK DOM and Open XML SDK... The Basics posts. In particular, the Descendants() method is used to find specific elements associated with comments and the generic OpenXmlElement class is used for manipulation. Another thing to note is that deleting a part via the Open XML SDK, not only deletes the part, but all parts referenced by that part as well.
Putting everything together and running my code, I will end up with a document that is completely devoid of comments. Sweet!
Here is a screenshot of the final document: