Welcome to MSDN Blogs Sign in | Join | Help
How to Create Hierarchy from Flat Data using LINQ

This is a fun, geeky post for those interested in functional programming.

Sometimes you have flat data where there is hierarchy expressed in the data, but the form of the data is flat.  You may need to transform this data into a hierarchy, such as an XML tree.

For instance, you may have the following data:

Heading[] headings = new[] {

    new Heading { Level = 1, Text = "Intro" },

    new Heading { Level = 2, Text = "Hello" },

    new Heading { Level = 2, Text = "Summary" },

    new Heading { Level = 1, Text = "Technical Info" },

    new Heading { Level = 2, Text = "Class API" },

    new Heading { Level = 3, Text = "Class1" },

    new Heading { Level = 3, Text = "Class2" },

    new Heading { Level = 2, Text = "Interoperability" },

    new Heading { Level = 3, Text = "Scenario 1" },

    new Heading { Level = 3, Text = "Scenario 2" },

    new Heading { Level = 1, Text = "In Conclusion" }

};

And you want to transform this data into the following XML:

<Root>

  <Heading1 Name="Intro">

    <Heading2 Name="Hello" />

    <Heading2 Name="Summary" />

  </Heading1>

  <Heading1 Name="Technical Info">

    <Heading2 Name="Class API">

      <Heading3 Name="Class1" />

      <Heading3 Name="Class2" />

    </Heading2>

    <Heading2 Name="Interoperability">

      <Heading3 Name="Scenario 1" />

      <Heading3 Name="Scenario 2" />

    </Heading2>

  </Heading1>

  <Heading1 Name="In Conclusion" />

</Root>

This is easy to do – you create a recursive function that returns the headings for the next level down. This function can use the SkipWhile and TakeWhile extension methods to return the next level down headings. This is certainly not the most efficient way to do it, but I was interested in coding this as fast as possible – I’m only going to run this once, and don’t care very much how long it takes.

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using System.Xml.Linq;

 

namespace ConsoleApplication1

{

    class Heading

    {

        public int Level { get; set; }

        public string Text { get; set; }

    }

 

    class Program

    {

        public static IEnumerable<XElement> GetChildrenHeadings(

            IEnumerable<Heading> headingList,

            Heading parent)

        {

            return

                headingList

                    .SkipWhile(h => h != parent)

                    .Skip(1)

                    .TakeWhile(h => h.Level > parent.Level)

                    .Where(h => h.Level == parent.Level + 1)

                    .Select(h =>

                        new XElement("Heading" + h.Level,

                            new XAttribute("Name", h.Text),

                            GetChildrenHeadings(headingList, h)

                        )

                    );

        }

 

        public static void Main(string[] args)

        {

            Heading[] headings = new[] {

                new Heading { Level = 1, Text = "Intro" },

                new Heading { Level = 2, Text = "Hello" },

                new Heading { Level = 2, Text = "Summary" },

                new Heading { Level = 1, Text = "Technical Info" },

                new Heading { Level = 2, Text = "Class API" },

                new Heading { Level = 3, Text = "Class1" },

                new Heading { Level = 3, Text = "Class2" },

                new Heading { Level = 2, Text = "Interoperability" },

                new Heading { Level = 3, Text = "Scenario 1" },

                new Heading { Level = 3, Text = "Scenario 2" },

                new Heading { Level = 1, Text = "In Conclusion" }

            };

 

            var toc = new XElement("Root",

                    headings

                        .Where(h => h.Level == 1)

                        .Select

                        (

                            h => new XElement("Heading" + h.Level,

                                new XAttribute("Name", h.Text),

                                GetChildrenHeadings(headings, h)

                            )

                        )

                );

 

            Console.WriteLine(toc);

        }

    }

}

Today, I had the need to take the Ecma Open XML spec part 4 and generate a TOC in XML.  I used the code from this blog post to return a collection of the paragraphs, and the technique shown above to generate the TOC.  Loading the doc takes 7 seconds on my machine, generating the TOC in this manner is less than a second, so even though it is not the most efficient, it is more than efficient enough for my purposes.

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using System.Xml.Linq;

using Microsoft.Examples.LtxOpenXml;

using DocumentFormat.OpenXml.Packaging;

 

class Heading

{

    public string StyleName { get; set; }

    public string Text { get; set; }

    public int HeadingLevel { get; set; }

}

 

class Program

{

    public static IEnumerable<XElement> GetChildrenHeadings(

        IEnumerable<Heading> headingList, Heading parent)

    {

        return

            headingList

                .SkipWhile(h => h != parent)

                .Skip(1)

                .TakeWhile(h => h.HeadingLevel > parent.HeadingLevel)

                .Where(h => h.HeadingLevel == parent.HeadingLevel + 1)

                .Select(h =>

                    new XElement(h.StyleName,

                        new XAttribute("Name", h.Text),

                        GetChildrenHeadings(headingList, h)

                    )

                );

    }

 

    public static void Main(string[] args)

    {

        using (WordprocessingDocument doc =

            WordprocessingDocument.Open(

                "Office Open XML Part 4 - Markup Language Reference.docx",

                false))

        {

            // materialize to list for efficiency

            List<Heading> headingList =

                doc.MainDocumentPart

                    .Paragraphs()

                    .Where(p => p.StyleName.StartsWith("Heading"))

                    .Select(p =>

                        {

                            int headingLevel;

                            Int32.TryParse(p.StyleName.Substring(7),

                                out headingLevel);

                            return new Heading

                            {

                                StyleName = p.StyleName,

                                Text = p.Text,

                                HeadingLevel = headingLevel

                            };

                        }

                    )

                    .ToList();

 

            var toc = new XElement("Root",

                    headingList

                        .Where(p => p.HeadingLevel == 1)

                        .Select

                        (

                            p => new XElement(p.StyleName,

                                new XAttribute("Name", p.Text),

                                GetChildrenHeadings(headingList, p)

                            )

                        )

                );

 

            toc.Save("toc.xml");

            Console.WriteLine(headingList.Count());

            Console.WriteLine(toc.Descendants().Count());

        }

    }

}

When run, it produces:

<?xml version="1.0" encoding="utf-8"?>

<Root>

  <Heading1 Name="Part Overview">

    <Heading2 Name="WordprocessingML Part Summary" />

    <Heading2 Name="SpreadsheetML Part Summary" />

    <Heading2 Name="PresentationML Part Summary" />

    <Heading2 Name="DrawingML Part Summary" />

    <Heading2 Name="Shared Part Summary" />

  </Heading1>

  <Heading1 Name="WordprocessingML Reference Material">

    <Heading2 Name="Table of Contents" />

    <Heading2 Name="Main Document Story">

      <Heading3 Name="background (Document Background)" />

      <Heading3 Name="body (Document Body)" />

      <Heading3 Name="document (Document)" />

    </Heading2>

    <Heading2 Name="Paragraphs and Rich Formatting">

      <Heading3 Name="Paragraphs">

        <Heading4 Name="adjustRightInd (Automatically Adjust Right Indent When Using Document Grid)" />

        <Heading4 Name="autoSpaceDE (Automatically Adjust Spacing of Latin and East Asian Text)" />

        <Heading4 Name="autoSpaceDN (Automatically Adjust Spacing of East Asian Text and Numbers)" />

        <Heading4 Name="bar (Paragraph Border Between Facing Pages)" />

        ...

Code is attached.

Posted: Thursday, July 10, 2008 10:17 PM by EricWhite
Attachment(s): CreateHierarchyFromFlat.zip

Comments

Doug Mahugh said:

Eric White on Linq and Open XML. Eric White has been posting some great code samples lately, including

# July 15, 2008 10:09 PM

Julien Chable said:

Ce post n’a pas voulu partir ni jeudi ni vendredi, le voici donc ! Des mise à jours à n’en plus finir

# August 18, 2008 3:55 AM

Julien Chable said:

Vous générer des documents Open XML, Office ne veut pas les ouvrir ? Voici quelques moyens d’en découdre

# August 18, 2008 7:00 AM
Leave a Comment

(required) 

(required) 

(optional)

(required) 

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Page view tracker