Blog - Title

Retrieving the Text of the Paragraphs - VB

Retrieving the Text of the Paragraphs - VB

  • Comments 0

[Table of Contents] [Next Topic]

Our next goal is to retrieve the text of the paragraphs in the document. Text is stored in the "t" nodes that are contained in "r" nodes that are children of the paragraph node. Text may be broken up into multiple "t" nodes, so we have to concatenate all of the text in the "t" nodes.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOC
Even though we could modify our query to include the code to extract the text of each paragraph, for demonstration purposes, we're going to approach the problem in a different way.  We're going to write a new query that uses our first query as its source.  Due to lazy evaluation, this is basically as efficient as if we were to simply modify the first query.  The approach creates more short-lived objects on the heap, but if this approach makes our code more clear, it is a good tradeoff.

We can add the query to our program, as follows:

Dim defaultStyle As String = _
    CStr( _
            ( _
                From style in styleDoc.Root _
                    .Elements(w + "style") _
                Where( _
                    CStr(style.Attribute(w + "type")) = "paragraph" And _
                    CStr(style.Attribute(w + "default")) = "1") _
            ) _
            .First() _
            .Attribute(w + "styleId") _
        )
 
Dim paragraphs = _
    mainPartDoc.Root _
        .Element(w + "body") _
        .Descendants(w + "p") _
        .Select(Function(p) _
            New With { _
                .ParagraphNode = p, _
                .Style = GetParagraphStyle(p, defaultStyle) _
            } _
        )
 
Dim paragraphsWithText = _
    paragraphs.Select(Function(p) _
        New With { _
            .ParagraphNode = p.ParagraphNode, _
            .Style = p.Style, _
            .Text = p.ParagraphNode _
                .Elements(w + "r") _
                .Descendants(w + "t") _
                .StringConcatenate(Function(s) CStr(s)) _
        } _
    )
 

The above code uses the StringConcatenate aggregate operator that we showed in the aggregation topic.

One of the features of Open XML is that a user can turn on the "Track Changes" feature, and the document will track all changes to text.  The above code would only work if there were no tracked changes.  However, it is easy to modify our code so that we retrieve the correct text for each paragraph regardless of whether there are tracked changes or not.  To do this, we need to find all of the children of the w:p element that have the name w:r or w:ins, and ignore all other elements.  We can modify the last of the three above queries, as follows:

Dim paragraphsWithText = _
    paragraphs.Select(Function(p) _
        New With { _
            .ParagraphNode = p.ParagraphNode, _
            .Style = p.Style, _
            .Text = p.ParagraphNode _
                .Elements() _
                .Where(Function(z) z.Name = w + "r" or z.Name = w + "ins") _
                .Descendants(w + "t") _
                .StringConcatenate(Function(s) CStr(s)) _
        } _
    )
 

This approach introduces a small issue.  In LINQ to XML, all names are atomized; that is, if two XName objects are in the same namespace, and if they have the same local name, they will share the same instance.  It takes a little bit of work for the implicit conversion operator in LINQ to XML to atomize a name.  In certain scenarios in LINQ to XML, atomization can be a significant percentage of processor time.  You can easily minimize this.  This post describes atomization in more detail.  So if we pre-atomize our names, our query will execute faster, at least in theory.  In practice, I can't say that I've ever been in a situation where this would make a difference, but when processing huge files, it might.  But whatever, in general, when I have code like this, I pre-atomize my XName objects:

Dim r As XName = w + "r"
Dim ins As XName = w + "ins"
 
Dim paragraphsWithText = _
    paragraphs.Select(Function(p) _
        New With { _
            .ParagraphNode = p.ParagraphNode, _
            .Style = p.Style, _
            .Text = p.ParagraphNode _
                .Elements() _
                .Where(Function(z) z.Name = r or z.Name = ins) _
                .Descendants(w + "t") _
                .StringConcatenate(Function(s) CStr(s)) _
        } _
    )
 

The complete program now looks like this.

Imports System.IO
Imports System.Xml
Imports System.Text
Imports DocumentFormat.OpenXml.Packaging
 
Module Module1
    <System.Runtime.CompilerServices.Extension()> _
    Public Function GetPath(ByVal el As XElement) As String
        Return el _
            .AncestorsAndSelf _
            .InDocumentOrder _
            .Aggregate("", Function(seed, i) seed & "/" & i.Name.LocalName)
    End Function
 
    <System.Runtime.CompilerServices.Extension()> _
    Function StringConcatenate(Of T) _
            (ByVal source As IEnumerable(Of T), ByVal projectionFunc As Func(Of T, String)) _
            As String
        Return source.Aggregate(New StringBuilder, _
            Function(sb, i) sb.Append(projectionFunc(i)), _
            Function(sb) sb.ToString)
    End Function
 
    Public Function LoadXDocument(ByVal part As OpenXmlPart) _
            As XDocument
        Using streamReader As StreamReader = New StreamReader(part.GetStream())
            Using xmlReader As XmlReader = xmlReader.Create(streamReader)
                Return XDocument.Load(xmlReader)
            End Using
        End Using
    End Function
 
    Public Function GetParagraphStyle(ByVal para As XElement, _
                                      ByVal defaultStyle As String) As String
        Dim w As XNamespace = _
            "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
        Dim paraStyle = CStr(para.Elements(w + "pPr") _
                       .Elements(w + "pStyle") _
                       .Attributes(w + "val") _
                       .FirstOrDefault())
        If (paraStyle Is Nothing) Then
            Return defaultStyle
        Else
            Return paraStyle
        End If
    End Function
 
    Sub Main()
        Dim w As XNamespace = _
"http://schemas.openxmlformats.org/wordprocessingml/2006/main"
        Dim filename As String = "SampleDoc.docx"
        Using wordDoc As WordprocessingDocument = _
            WordprocessingDocument.Open(filename, True)
            Dim mainPart As MainDocumentPart = _
                wordDoc.MainDocumentPart
            Dim styleDefinitionPart As StyleDefinitionsPart = _
                mainPart.StyleDefinitionsPart
            Dim commentsPart As WordprocessingCommentsPart = _
                mainPart.WordprocessingCommentsPart
            Dim mainPartDoc As XDocument = LoadXDocument(mainPart)
            Dim styleDoc As XDocument = LoadXDocument(styleDefinitionPart)
            Dim commentsDoc As XDocument = LoadXDocument(commentsPart)
 
            Dim defaultStyle As String = _
                CStr( _
                        ( _
                            From style in styleDoc.Root _
                                .Elements(w + "style") _
                            Where( _
                                CStr(style.Attribute(w + "type")) = "paragraph" And _
                                CStr(style.Attribute(w + "default")) = "1") _
                        ) _
                        .First() _
                        .Attribute(w + "styleId") _
                    )
 
            Dim paragraphs = _
                mainPartDoc.Root _
                    .Element(w + "body") _
                    .Descendants(w + "p") _
                    .Select(Function(p) _
                        New With { _
                            .ParagraphNode = p, _
                            .Style = GetParagraphStyle(p, defaultStyle) _
                        } _
                    )
 
            Dim r As XName = w + "r"
            Dim ins As XName = w + "ins"
 
            Dim paragraphsWithText = _
                paragraphs.Select(Function(p) _
                    New With { _
                        .ParagraphNode = p.ParagraphNode, _
                        .Style = p.Style, _
                        .Text = p.ParagraphNode _
                            .Elements() _
                            .Where(Function(z) z.Name = r or z.Name = ins) _
                            .Descendants(w + "t") _
                            .StringConcatenate(Function(s) CStr(s)) _
                    } _
                )
 
            For Each p In paragraphsWithText
                Console.WriteLine("{0} {1}", _
                    p.Style.PadRight(12), _
                    p.Text)
            Next
 
        End Using
    End Sub
End Module
 

[Table of Contents] [Next Topic] [Blog Map]

Leave a Comment
  • Please add 6 and 4 and type the answer here:
  • Post