Blog Map
[Table of Contents] [Next Topic] [Blog Map] This blog is inactive. New blog: EricWhite.com/blog
There are two groups of paragraphs in our document that are styled as "Code". The first group contains the C# code that we want to test. The second group contains a single paragraph that is the output of the code in the first group. Next in the process of formulating our query, we want to retrieve each block of code as a separate group.
The problem is, the GroupBy extension method doesn't do what we want. It groups all items together in the collection, regardless of if they are separated by other items. It would join our two groups of code, which we want to keep separate.
For instance, if we amend the code to group the paragraphs, adding one more query to the bottom of our string of queries, as follows:
Dim defaultStyle As String = _ CStr( _ ( _ From style in styleDoc.Root _ .Elements(w + "style") _ Where( _ CStr(style.Attribute(w + "type")) = "paragraph" And _ CStr(style.Attribute(w + "default")) = "1") _ ) _ .First() _ .Attribute(w + "styleId") _ ) Dim paragraphs = _ mainPartDoc.Root _ .Element(w + "body") _ .Descendants(w + "p") _ .Select(Function(p) _ New With { _ .ParagraphNode = p, _ .Style = GetParagraphStyle(p, defaultStyle) _ } _ ) Dim r As XName = w + "r"Dim ins As XName = w + "ins" Dim paragraphsWithText = _ paragraphs.Select(Function(p) _ New With { _ .ParagraphNode = p.ParagraphNode, _ .Style = p.Style, _ .Text = p.ParagraphNode _ .Elements() _ .Where(Function(z) z.Name = r or z.Name = ins) _ .Descendants(w + "t") _ .StringConcatenate(Function(s) CStr(s)) _ } _ ) Dim groupedCodeParagraphs = _ paragraphsWithText.GroupBy(Function(p) p.Style) For Each g In groupedCodeParagraphs Console.WriteLine("Group of paragraphs styled {0}", g.Key) For Each p In g Console.WriteLine("{0} {1}", _ p.Style.PadRight(12), _ p.Text) Next Console.WriteLine()Next
Then we see:
Group of paragraphs styled Heading1Heading1 Parsing WordprocessingML with LINQ to XML Group of paragraphs styled NormalNormal The following example prints to the console.Normal This example produces the following output: Group of paragraphs styled CodeCode using System;CodeCode class Program {Code public static void Main(string[] args) {Code Console.WriteLine("Hello World");Code }Code }CodeCode Hello World
This grouped the "Hello World" with the code, which is not what we want.
As it turns out, there isn't a standard query operator that does exactly what we want. We want an operator that groups only adjacent fields with a common key. So let's write one. In addition to the GroupAdjacent extension method, we need an GroupOfAdjacent class that we can iterate through for each grouping. It only takes a couple dozen lines of code to implement this.
Unlike the C# version, the GroupAdjacent implementation for Visual Basic is not lazy. But this really doesn’t impact performance in any noticeable way, even for large documents.
Before this version of GroupAdjacent returns the first group, it iterates through the entire collection, creating a list of lists.
To use GroupAdjacent, we pass it a lambda that selects the value that when that value changes, the operator creates a new group. GroupAdjacent then is a sequence of groups, each of which contain a sequence of type T.
Here is the listing:
Imports System.IOImports System.XmlImports System.TextImports DocumentFormat.OpenXml.Packaging Public Class GroupOfAdjacent(Of TElement, TKey) Implements IEnumerable(Of TElement) Private _key As TKey Private _groupList As List(Of TElement) Public Property GroupList() As List(Of TElement) Get Return _groupList End Get Set(ByVal value As List(Of TElement)) _groupList = value End Set End Property Public ReadOnly Property Key() As TKey Get Return _key End Get End Property Public Function GetEnumerator() As System.Collections.Generic.IEnumerator(Of TElement) _ Implements System.Collections.Generic.IEnumerable(Of TElement).GetEnumerator Return _groupList.GetEnumerator End Function Public Function GetEnumerator1() As System.Collections.IEnumerator _ Implements System.Collections.IEnumerable.GetEnumerator Return _groupList.GetEnumerator End Function Public Sub New(ByVal key As TKey) _key = key _groupList = New List(Of TElement) End SubEnd Class Module Module1 <System.Runtime.CompilerServices.Extension()> _ Public Function GroupAdjacent(Of TElement, TKey)(ByVal source As IEnumerable(Of TElement), _ ByVal keySelector As Func(Of TElement, TKey)) As List(Of GroupOfAdjacent(Of TElement, TKey)) Dim lastKey As TKey = Nothing Dim currentGroup As GroupOfAdjacent(Of TElement, TKey) = Nothing Dim allGroups As List(Of GroupOfAdjacent(Of TElement, TKey)) = New List(Of GroupOfAdjacent(Of TElement, TKey))() For Each item In source Dim thisKey As TKey = keySelector(item) If lastKey IsNot Nothing And Not thisKey.Equals(lastKey) Then allGroups.Add(currentGroup) End If If Not thisKey.Equals(lastKey) Then currentGroup = New GroupOfAdjacent(Of TElement, TKey)(keySelector(item)) End If currentGroup.GroupList.Add(item) lastKey = thisKey Next If lastKey IsNot Nothing Then allGroups.Add(currentGroup) End If Return allGroups End Function <System.Runtime.CompilerServices.Extension()> _ Public Function GetPath(ByVal el As XElement) As String Return el _ .AncestorsAndSelf _ .InDocumentOrder _ .Aggregate("", Function(seed, i) seed & "/" & i.Name.LocalName) End Function <System.Runtime.CompilerServices.Extension()> _ Function StringConcatenate(Of T) _ (ByVal source As IEnumerable(Of T), ByVal projectionFunc As Func(Of T, String)) _ As String Return source.Aggregate(New StringBuilder, _ Function(sb, i) sb.Append(projectionFunc(i)), _ Function(sb) sb.ToString) End Function Public Function LoadXDocument(ByVal part As OpenXmlPart) _ As XDocument Using streamReader As StreamReader = New StreamReader(part.GetStream()) Using xmlReader As XmlReader = xmlReader.Create(streamReader) Return XDocument.Load(xmlReader) End Using End Using End Function Public Function GetParagraphStyle(ByVal para As XElement, _ ByVal defaultStyle As String) As String Dim w As XNamespace = _ "http://schemas.openxmlformats.org/wordprocessingml/2006/main" Dim paraStyle = CStr(para.Elements(w + "pPr") _ .Elements(w + "pStyle") _ .Attributes(w + "val") _ .FirstOrDefault()) If (paraStyle Is Nothing) Then Return defaultStyle Else Return paraStyle End If End Function Sub Main() Dim w As XNamespace = _ "http://schemas.openxmlformats.org/wordprocessingml/2006/main" Dim filename As String = "SampleDoc.docx" Using wordDoc As WordprocessingDocument = _ WordprocessingDocument.Open(filename, True) Dim mainPart As MainDocumentPart = _ wordDoc.MainDocumentPart Dim styleDefinitionPart As StyleDefinitionsPart = _ mainPart.StyleDefinitionsPart Dim commentsPart As WordprocessingCommentsPart = _ mainPart.WordprocessingCommentsPart Dim mainPartDoc As XDocument = LoadXDocument(mainPart) Dim styleDoc As XDocument = LoadXDocument(styleDefinitionPart) Dim commentsDoc As XDocument = LoadXDocument(commentsPart) Dim defaultStyle As String = _ CStr( _ ( _ From style In styleDoc.Root _ .Elements(w + "style") _ Where ( _ CStr(style.Attribute(w + "type")) = "paragraph" And _ CStr(style.Attribute(w + "default")) = "1") _ ) _ .First() _ .Attribute(w + "styleId") _ ) Dim paragraphs = _ mainPartDoc.Root _ .Element(w + "body") _ .Descendants(w + "p") _ .Select(Function(p) _ New With { _ .ParagraphNode = p, _ .Style = GetParagraphStyle(p, defaultStyle) _ } _ ) Dim r As XName = w + "r" Dim ins As XName = w + "ins" Dim paragraphsWithText = _ paragraphs.Select(Function(p) _ New With { _ .ParagraphNode = p.ParagraphNode, _ .Style = p.Style, _ .Text = p.ParagraphNode _ .Elements() _ .Where(Function(z) z.Name = r Or z.Name = ins) _ .Descendants(w + "t") _ .StringConcatenate(Function(s) CStr(s)) _ } _ ) Dim groupedCodeParagraphs = _ paragraphsWithText.GroupAdjacent(Function(p) p.Style) For Each g In groupedCodeParagraphs Console.WriteLine("Group of paragraphs styled {0}", g.Key) For Each p In g Console.WriteLine("{0} {1}", _ p.Style.PadRight(12), _ p.Text) Next Console.WriteLine() Next End Using End SubEnd Module Group of paragraphs styled Heading1Heading1 Parsing WordprocessingML with LINQ to XML Group of paragraphs styled NormalNormal The following example prints to the console. Group of paragraphs styled CodeCode using System;CodeCode class Program {Code public static void Main(string[] args) {Code Console.WriteLine("Hello World");Code }Code }Code Group of paragraphs styled NormalNormal This example produces the following output: Group of paragraphs styled CodeCode Hello World
This is what we want.
[Table of Contents] [Next Topic] [Blog Map]