Blog Map
[Table of Contents] [Next Topic] [Blog Map] This blog is inactive. New blog: EricWhite.com/blog
Our next goal is to retrieve the text of the paragraphs in the document. Text is stored in the "t" nodes that are contained in "r" nodes that are children of the paragraph node. Text may be broken up into multiple "t" nodes, so we have to concatenate all of the text in the "t" nodes.
Even though we could modify our query to include the code to extract the text of each paragraph, for demonstration purposes, we're going to approach the problem in a different way. We're going to write a new query that uses our first query as its source. Due to lazy evaluation, this is basically as efficient as if we were to simply modify the first query. The approach creates more short-lived objects on the heap, but if this approach makes our code more clear, it is a good tradeoff.
We can add the query to our program, as follows:
Dim defaultStyle As String = _ CStr( _ ( _ From style in styleDoc.Root _ .Elements(w + "style") _ Where( _ CStr(style.Attribute(w + "type")) = "paragraph" And _ CStr(style.Attribute(w + "default")) = "1") _ ) _ .First() _ .Attribute(w + "styleId") _ ) Dim paragraphs = _ mainPartDoc.Root _ .Element(w + "body") _ .Descendants(w + "p") _ .Select(Function(p) _ New With { _ .ParagraphNode = p, _ .Style = GetParagraphStyle(p, defaultStyle) _ } _ ) Dim paragraphsWithText = _ paragraphs.Select(Function(p) _ New With { _ .ParagraphNode = p.ParagraphNode, _ .Style = p.Style, _ .Text = p.ParagraphNode _ .Elements(w + "r") _ .Descendants(w + "t") _ .StringConcatenate(Function(s) CStr(s)) _ } _ )
The above code uses the StringConcatenate aggregate operator that we showed in the aggregation topic.
One of the features of Open XML is that a user can turn on the "Track Changes" feature, and the document will track all changes to text. The above code would only work if there were no tracked changes. However, it is easy to modify our code so that we retrieve the correct text for each paragraph regardless of whether there are tracked changes or not. To do this, we need to find all of the children of the w:p element that have the name w:r or w:ins, and ignore all other elements. We can modify the last of the three above queries, as follows:
Dim paragraphsWithText = _ paragraphs.Select(Function(p) _ New With { _ .ParagraphNode = p.ParagraphNode, _ .Style = p.Style, _ .Text = p.ParagraphNode _ .Elements() _ .Where(Function(z) z.Name = w + "r" or z.Name = w + "ins") _ .Descendants(w + "t") _ .StringConcatenate(Function(s) CStr(s)) _ } _ )
This approach introduces a small issue. In LINQ to XML, all names are atomized; that is, if two XName objects are in the same namespace, and if they have the same local name, they will share the same instance. It takes a little bit of work for the implicit conversion operator in LINQ to XML to atomize a name. In certain scenarios in LINQ to XML, atomization can be a significant percentage of processor time. You can easily minimize this. This post describes atomization in more detail. So if we pre-atomize our names, our query will execute faster, at least in theory. In practice, I can't say that I've ever been in a situation where this would make a difference, but when processing huge files, it might. But whatever, in general, when I have code like this, I pre-atomize my XName objects:
Dim r As XName = w + "r"Dim ins As XName = w + "ins" Dim paragraphsWithText = _ paragraphs.Select(Function(p) _ New With { _ .ParagraphNode = p.ParagraphNode, _ .Style = p.Style, _ .Text = p.ParagraphNode _ .Elements() _ .Where(Function(z) z.Name = r or z.Name = ins) _ .Descendants(w + "t") _ .StringConcatenate(Function(s) CStr(s)) _ } _ )
The complete program now looks like this.
Imports System.IOImports System.XmlImports System.TextImports DocumentFormat.OpenXml.Packaging Module Module1 <System.Runtime.CompilerServices.Extension()> _ Public Function GetPath(ByVal el As XElement) As String Return el _ .AncestorsAndSelf _ .InDocumentOrder _ .Aggregate("", Function(seed, i) seed & "/" & i.Name.LocalName) End Function <System.Runtime.CompilerServices.Extension()> _ Function StringConcatenate(Of T) _ (ByVal source As IEnumerable(Of T), ByVal projectionFunc As Func(Of T, String)) _ As String Return source.Aggregate(New StringBuilder, _ Function(sb, i) sb.Append(projectionFunc(i)), _ Function(sb) sb.ToString) End Function Public Function LoadXDocument(ByVal part As OpenXmlPart) _ As XDocument Using streamReader As StreamReader = New StreamReader(part.GetStream()) Using xmlReader As XmlReader = xmlReader.Create(streamReader) Return XDocument.Load(xmlReader) End Using End Using End Function Public Function GetParagraphStyle(ByVal para As XElement, _ ByVal defaultStyle As String) As String Dim w As XNamespace = _ "http://schemas.openxmlformats.org/wordprocessingml/2006/main" Dim paraStyle = CStr(para.Elements(w + "pPr") _ .Elements(w + "pStyle") _ .Attributes(w + "val") _ .FirstOrDefault()) If (paraStyle Is Nothing) Then Return defaultStyle Else Return paraStyle End If End Function Sub Main() Dim w As XNamespace = _"http://schemas.openxmlformats.org/wordprocessingml/2006/main" Dim filename As String = "SampleDoc.docx" Using wordDoc As WordprocessingDocument = _ WordprocessingDocument.Open(filename, True) Dim mainPart As MainDocumentPart = _ wordDoc.MainDocumentPart Dim styleDefinitionPart As StyleDefinitionsPart = _ mainPart.StyleDefinitionsPart Dim commentsPart As WordprocessingCommentsPart = _ mainPart.WordprocessingCommentsPart Dim mainPartDoc As XDocument = LoadXDocument(mainPart) Dim styleDoc As XDocument = LoadXDocument(styleDefinitionPart) Dim commentsDoc As XDocument = LoadXDocument(commentsPart) Dim defaultStyle As String = _ CStr( _ ( _ From style in styleDoc.Root _ .Elements(w + "style") _ Where( _ CStr(style.Attribute(w + "type")) = "paragraph" And _ CStr(style.Attribute(w + "default")) = "1") _ ) _ .First() _ .Attribute(w + "styleId") _ ) Dim paragraphs = _ mainPartDoc.Root _ .Element(w + "body") _ .Descendants(w + "p") _ .Select(Function(p) _ New With { _ .ParagraphNode = p, _ .Style = GetParagraphStyle(p, defaultStyle) _ } _ ) Dim r As XName = w + "r" Dim ins As XName = w + "ins" Dim paragraphsWithText = _ paragraphs.Select(Function(p) _ New With { _ .ParagraphNode = p.ParagraphNode, _ .Style = p.Style, _ .Text = p.ParagraphNode _ .Elements() _ .Where(Function(z) z.Name = r or z.Name = ins) _ .Descendants(w + "t") _ .StringConcatenate(Function(s) CStr(s)) _ } _ ) For Each p In paragraphsWithText Console.WriteLine("{0} {1}", _ p.Style.PadRight(12), _ p.Text) Next End Using End SubEnd Module
[Table of Contents] [Next Topic] [Blog Map]