November, 2008

  • Eric White's Blog

    Using LINQ to Query Excel Tables

    • 33 Comments

    Excel has a very cool feature where you can declare that a range of cells is a table.  It is a feature that allows you to use Excel very much like a database.  You can add new rows as necessary, sort the table by columns, do some simple filtering, calculate the sum of columns, and more.  Each table has a unique table name, and each column has a column name.  Because these tables are stored in Open XML documents, we can implement some simple extension methods and some classes so that we can query these tables using LINQ in a manner that is similar to querying a SQL database.  This post presents a bit of code to do this.  The code and sample documents are attached to this post.

    This blog is inactive.
    New blog: EricWhite.com/blog

    Blog TOC
    (Updated July 21, 2010 - Fixed a bug where the code would return the wrong value for cells in the table if the columns had been moved.)

    Note: this code is presented as an example - a proof-of-concept.  This code could be further optimized, so that it performs better (although it performs quite well as is).  And it may be interesting in the future to modify the code to use a strongly-typed approach – as the code is currently implemented, if you misspell a table or column name, the code throws an exception.  However, this code is useful as is for doing ad-hoc queries of Excel tables.  I certainly will be using it! :-)

    This code uses the Open XML SDK, either V1, or the CTP of V2.  You can download V1 of the SDK here.  You can download CTP1 of V2 of the SDK here.

    Thanks to Brian Jones who suggested this project.

    Following is a screen clipping of an Excel spreadsheet that contains a table:

    You can see the four columns of this table: Item, Qty, Price, and Extension.  In addition, in the Design tab of the ribbon, in the far left box, you can see that this table has a table name of “Inventory”.  Using the code presented in this post, you can query this table as follows:

    var query =
        from i in spreadsheet.Table("Inventory").TableRows()
        where (int)i["Qty"] > 2
        select i;

    foreach (var r in query)
    {
        Console.WriteLine(r["Item"]);
        Console.WriteLine(r["Qty"]);
        Console.WriteLine(r["Price"]);
        Console.WriteLine(r["Extension"]);
        Console.WriteLine();
    }

    When you run this code, it produces:

    Book
    44
    2
    88

    Phone
    4
    10
    40

    As you can see from the above code, to access a particular column from a table row, you can use a default indexed property, passing the name of the column:

    Console.WriteLine(r["Item"]);
    Console.WriteLine(r["Qty"]);
    Console.WriteLine(r["Price"]);
    Console.WriteLine(r["Extension"]);

    This allows us to write code that is easy to read.

    The table class (returned by the Table method) has a TableColumns method that iterates the columns in the table:

    // list all of the columns in the Inventory table
    Console.WriteLine("Table: Inventory");
    foreach (var c in spreadsheet.Table("Inventory").TableColumns())
        Console.WriteLine("  {0}", c.Name);

    When you run this code, you see:

    Table: Inventory
      Item
      Qty
      Price
      Extension

    The LtxOpenXml Namespace

    Some time ago, I wrote some code that enabled querying Open XML spreadsheets using LINQ to XML, presented in the blog post ‘Open XML SDK and LINQ to XML’.  I’ve added the code to query tables to the code presented in that post.  The extension methods that enable querying tables make use of that code.  The enhanced LtxOpenXml namespace now contains code for:

    • Querying word processing documents
    • Querying spreadsheets
    • Querying tables contained in spreadsheets

    The code for querying word processing documents and spreadsheets is unmodified.  Refer to the above mentioned blog post for details on using those extension methods.

    The code that enables querying of spreadsheet tables is, of course, written in the pure functional style.  No state is maintained, and all methods to query the document are lazy.

    If you have questions about how to write functional code (like the code that implements the extension methods and classes associated with this post), go through this Functional Programming Tutorial.

    I’ve provided a summary of the types and extension methods included in the LtxOpenXml namespace at the end of this post.

    Use of Data Types

    Here’s another example of a table that contains a few more columns with more data types:

    Each row returned by the TableRows method is a collection of TableCell objects.  I’ve defined explicit conversions between TableCell and some of the most common .NET types, so that you can simply cast a TableCell to your desired type.  Here’s a query to list all vehicles in the table:

    // list all vehicles
    var q = from c in spreadsheet.Table("Vehicles").TableRows()
            select new VehicleRecord()
            {
                Vehicle = (string)c["Vehicle"],
                Color = (string)c["Color"],
                Year = (int)c["Year"],
                HorsePower = (int)c["HorsePower"],
                Cost = (decimal)c["Cost"],
                AcquisitionDate = (DateTime)c["AcquisitionDate"],
                ExecutiveUseOnly = (bool)c["ExecutiveUseOnly"]
            };

    Console.WriteLine("List of all vehicles");
    PrintVehicles(q);
    Console.WriteLine();

    I’ve written a PrintVehicles method:

    public static void PrintVehicles(IEnumerable<VehicleRecord> list)
    {
        int[] tabs = new[] { 12, 10, 6, 6, 10, 14, 10 };
        foreach (var z in list)
            Console.WriteLine("{0}{1}{2}{3}{4}{5}{6}",
                z.Vehicle.PadRight(tabs[0]),
                z.Color.PadRight(tabs[1]),
                z.Year.ToString().PadRight(tabs[2]),
                z.HorsePower.ToString().PadRight(tabs[3]),
                z.Cost.ToString().PadRight(tabs[4]),
                ((DateTime)z.AcquisitionDate).ToShortDateString()
                    .PadRight(tabs[5]),
                ((bool)z.ExecutiveUseOnly).ToString()
                    .PadRight(tabs[6]));
    }

    When you run the above query, you see:

    List of all vehicles
    Pickup      White     2002  165   23000     2/22/2002     False
    Pickup      Red       2004  185   32000     10/21/2004    False
    Sports Car  Red       2003  165   23000     1/1/2004      True
    Sedan       Blue      2005  200   21000     2/25/2005     False
    Limo        Black     2008  440   72000     4/1/2008      True

    You can query for all executive vehicles, like this:

    // list all executive vehicles
    q = from c in spreadsheet.Table("Vehicles").TableRows()
            where (bool)c["ExecutiveUseOnly"] == true
            select new VehicleRecord()
            {
                Vehicle = (string)c["Vehicle"],
                Color = (string)c["Color"],
                Year = (int)c["Year"],
                HorsePower = (int)c["HorsePower"],
                Cost = (decimal)c["Cost"],
                AcquisitionDate = (DateTime)c["AcquisitionDate"],
                ExecutiveUseOnly = (bool)c["ExecutiveUseOnly"]
            };

    You can write queries that select on data types such as DateTime:

    // list all vehicles acquired after 2004
    q = from c in spreadsheet.Table("Vehicles").TableRows()
        where (DateTime)c["AcquisitionDate"] >= new DateTime(2004, 1, 1)
        select new VehicleRecord()
        {
            Vehicle = (string)c["Vehicle"],
            Color = (string)c["Color"],
            Year = (int)c["Year"],
            HorsePower = (int)c["HorsePower"],
            Cost = (decimal)c["Cost"],
            AcquisitionDate = (DateTime)c["AcquisitionDate"],
            ExecutiveUseOnly = (bool)c["ExecutiveUseOnly"]
        };

    And of course, you can use all of the grouping, ordering, and filtering capabilities of LINQ queries:

    // vehicles grouped by user
    var groups = from v in spreadsheet.Table("Vehicles").TableRows()
                 group v by v["ExecutiveUseOnly"];

    foreach (var g in groups)
    {
        Console.WriteLine("Executive Use: {0}", (bool)g.Key);
        foreach (var v in g)
            Console.WriteLine("  Vehicle:{0}  Year:{1}",
                v["Vehicle"], v["Year"]);
        Console.WriteLine();
    }

    I’ve imported the Customers and Orders from the Northwind database into a spreadsheet, where the Customers table is in one sheet, and the Orders table is in another sheet within the worksheet.  Here is the Customers table:

    And here is the Orders table:

    We can now write a query that joins the customers and orders tables:

    using (SpreadsheetDocument spreadsheet =
        SpreadsheetDocument.Open(filename, false))
    {
        // list all of the columns in the Customer table
        Console.WriteLine("Table: Customer");
        foreach (var c in spreadsheet.Table("Customer").TableColumns())
            Console.WriteLine("  {0}", c.Name);
        Console.WriteLine();

        // list all of the columns in the Order table
        Console.WriteLine("Table: Order");
        foreach (var o in spreadsheet.Table("Order").TableColumns())
            Console.WriteLine("  {0}", o.Name);
        Console.WriteLine();

        // query for all customers with city == London,
        // then select all orders for that customer
        var q = from c in spreadsheet.Table("Customer").TableRows()
                where (string)c["City"] == "London"
                select new
                {
                    CustomerID = c["CustomerID"],
                    CompanyName = c["CompanyName"],
                    ContactName = c["ContactName"],
                    Orders = from o in spreadsheet.Table("Order").TableRows()
                             where (string)o["CustomerID"] ==
                                   (string)c["CustomerID"]
                             select new
                                 {
                                     CustomerID = o["CustomerID"],
                                     OrderID = o["OrderID"]
                                 }
                };

        // print the results of the query
        int[] tabs = new[] { 20, 25, 30 };
        Console.WriteLine("{0}{1}{2}",
            "CustomerID".PadRight(tabs[0]),
            "CompanyName".PadRight(tabs[1]),
            "ContactName".PadRight(tabs[2]));
        Console.WriteLine("{0} {1} {2} ", new string('-', tabs[0] - 1),
            new string('-', tabs[1] - 1), new string('-', tabs[2] - 1));
        foreach (var v in q)
        {
            Console.WriteLine("{0}{1}{2}",
                v.CustomerID.Value.PadRight(tabs[0]),
                v.CompanyName.Value.PadRight(tabs[1]),
                v.ContactName.Value.PadRight(tabs[2]));
            foreach (var v2 in v.Orders)
                Console.WriteLine("  CustomerID:{0}  OrderID:{1}",
                    v2.CustomerID, v2.OrderID);
            Console.WriteLine();
        }
    }

    This code produces the following output:

    Table: Customer
      CustomerID
      CompanyName
      ContactName
      ContactTitle
      Address
      City
      Region
      PostalCode
      Country
      Phone
      Fax

    Table: Order
      OrderID
      CustomerID
      EmployeeID
      OrderDate
      RequiredDate
      ShipVia
      Freight
      ShipName
      ShipAddress
      ShipCity
      ShipRegion
      ShipPostalCode
      ShipCountry

    CustomerID          CompanyName              ContactName
    ------------------- ------------------------ -----------------------------
    AROUT               Around the Horn          Thomas Hardy
      CustomerID:AROUT  OrderID:10355
      CustomerID:AROUT  OrderID:10383
      CustomerID:AROUT  OrderID:10453
      CustomerID:AROUT  OrderID:10558
      CustomerID:AROUT  OrderID:10707
      CustomerID:AROUT  OrderID:10741
      CustomerID:AROUT  OrderID:10743
      CustomerID:AROUT  OrderID:10768
      CustomerID:AROUT  OrderID:10793
      CustomerID:AROUT  OrderID:10864
      CustomerID:AROUT  OrderID:10920
      CustomerID:AROUT  OrderID:10953
      CustomerID:AROUT  OrderID:11016

    BSBEV               B's Beverages            Victoria Ashworth
      CustomerID:BSBEV  OrderID:10289
      CustomerID:BSBEV  OrderID:10471
      CustomerID:BSBEV  OrderID:10484
      CustomerID:BSBEV  OrderID:10538
      CustomerID:BSBEV  OrderID:10539
      CustomerID:BSBEV  OrderID:10578
      CustomerID:BSBEV  OrderID:10599
      CustomerID:BSBEV  OrderID:10943
      CustomerID:BSBEV  OrderID:10947
      CustomerID:BSBEV  OrderID:11023

    CONSH               Consolidated Holdings    Elizabeth Brown
      CustomerID:CONSH  OrderID:10435
      CustomerID:CONSH  OrderID:10462
      CustomerID:CONSH  OrderID:10848

    EASTC               Eastern Connection       Ann Devon
      CustomerID:EASTC  OrderID:10364
      CustomerID:EASTC  OrderID:10400
      CustomerID:EASTC  OrderID:10532
      CustomerID:EASTC  OrderID:10726
      CustomerID:EASTC  OrderID:10987
      CustomerID:EASTC  OrderID:11024
      CustomerID:EASTC  OrderID:11047
      CustomerID:EASTC  OrderID:11056

    NORTS               North/South              Simon Crowther
      CustomerID:NORTS  OrderID:10517
      CustomerID:NORTS  OrderID:10752
      CustomerID:NORTS  OrderID:11057

    SEVES               Seven Seas Imports       Hari Kumar
      CustomerID:SEVES  OrderID:10359
      CustomerID:SEVES  OrderID:10377
      CustomerID:SEVES  OrderID:10388
      CustomerID:SEVES  OrderID:10472
      CustomerID:SEVES  OrderID:10523
      CustomerID:SEVES  OrderID:10547
      CustomerID:SEVES  OrderID:10800
      CustomerID:SEVES  OrderID:10804
      CustomerID:SEVES  OrderID:10869

    Summary of the LtxOpenXml Namespace

    This section summarizes the LtxOpenXml extension methods and types that make it easy to work with Open XML SpreadsheetML tables.

    For details on the extension methods and types for word processing documents and spreadsheets (other than Tables within spreadsheets), see the post, Open XML SDK and LINQ to XML.

    Tables Extension Method

    This method returns a collection of all tables in the spreadsheet.  Its signature:

    public static IEnumerable<Table> Tables(this SpreadsheetDocument spreadsheet)

    Table Extension Method

    This method returns the Table object with the specified table name.  Its signature:

    public static Table Table(this SpreadsheetDocument spreadsheet,
        string tableName)

    Table Class

    This method represents an Excel Table.  Its definition:

    public class Table
    {
        public int Id { get; set; }
        public string TableName { get; set; }
        public string DisplayName { get; set; }
        public string Ref { get; set; }
        public int? HeaderRowCount { get; set; }
        public int? TotalsRowCount { get; set; }
        public string TableType { get; set; }
        public TableDefinitionPart TableDefinitionPart { get; set; }
        public WorksheetPart Parent { get; set; }
        public Table(WorksheetPart parent) { Parent = parent; }
        public IEnumerable<TableColumn> TableColumns()
        {
            ...
        }
        public IEnumerable<TableRow> TableRows()
        {
            ...
        }
    }

    This class contains a number of properties about the table.  In addition, it contains two methods, TableColumns, which returns a collection of TableColumn objects (the columns of the table), and TableRows, which returns a collection of TableRow objects (the rows of the table).

    TableColumn Class

    This class represents a column of a table.  Its definition:

    public class TableColumn
    {
        public int Id { get; set; }
        public string Name { get; set; }
        public int? FormatId { get; set; }  // dataDxfId
        public int? QueryTableFieldId { get; set; }
        public string UniqueName { get; set; }
        public Table Parent { get; set; }
        public TableColumn(Table parent) { Parent = parent; }
    }

    The most important property of this class is the Name property.

    TableRow Class

    This class represents a row of a table.  Its definition:

    public class TableRow
    {
        public Row Row { get; set; }
        public Table Parent { get; set; }
        public TableRow(Table parent) { Parent = parent; }
        public TableCell this[string columnName]
        {
            get
            {
                ...
            }
        }
    }

    The most important feature of this class is the default indexed property that takes a column name and returns a TableCell object.  This is what allows us to write code like this:

    Console.WriteLine(r["Item"]);
    Console.WriteLine(r["Qty"]);
    Console.WriteLine(r["Price"]);
    Console.WriteLine(r["Extension"]);

    TableCell Class

    This class represents a cell of a row of a table.  It implements IEquatable<T> so that you can do a value compare of two cells.  It also implements a number of explicit conversions to other data types so that it’s easy to deal with columns of various types.  Its definition:

    public class TableCell : IEquatable<TableCell>
    {
        public string Value { get; set; }
        public TableCell(string v)
        {
            Value = v;
        }
        public override string ToString()
        {
            return Value;
        }
        public override bool Equals(object obj)
        {
            return this.Value == ((TableCell)obj).Value;
        }
        bool IEquatable<TableCell>.Equals(TableCell other)
        {
            return this.Value == other.Value;
        }
        public override int GetHashCode()
        {
            return this.Value.GetHashCode();
        }
        public static bool operator ==(TableCell left, TableCell right)
        {
            if ((object)left != (object)right) return false;
            return left.Value == right.Value;
        }
        public static bool operator !=(TableCell left, TableCell right)
        {
            if ((object)left != (object)right) return false;
            return left.Value != right.Value;
        }
        public static explicit operator string(TableCell cell)
        {
            if (cell == null) return null;
            return cell.Value;
        }
        public static explicit operator bool(TableCell cell)
        {
            if (cell == null) throw new ArgumentNullException("TableCell");
            return cell.Value == "1";
        }
        public static explicit operator bool?(TableCell cell)
        {
            if (cell == null) return null;
            return cell.Value == "1";
        }
        public static explicit operator int(TableCell cell)
        {
            if (cell == null) throw new ArgumentNullException("TableCell");
            return Int32.Parse(cell.Value);
        }
        public static explicit operator int?(TableCell cell)
        {
            if (cell == null) return null;
            return Int32.Parse(cell.Value);
        }
        public static explicit operator uint(TableCell cell)
        {
            if (cell == null) throw new ArgumentNullException("TableCell");
            return UInt32.Parse(cell.Value);
        }
        public static explicit operator uint?(TableCell cell)
        {
            if (cell == null) return null;
            return UInt32.Parse(cell.Value);
        }
        public static explicit operator long(TableCell cell)
        {
            if (cell == null) throw new ArgumentNullException("TableCell");
            return Int64.Parse(cell.Value);
        }
        public static explicit operator long?(TableCell cell)
        {
            if (cell == null) return null;
            return Int64.Parse(cell.Value);
        }
        public static explicit operator ulong(TableCell cell)
        {
            if (cell == null) throw new ArgumentNullException("TableCell");
            return UInt64.Parse(cell.Value);
        }
        public static explicit operator ulong?(TableCell cell)
        {
            if (cell == null) return null;
            return UInt64.Parse(cell.Value);
        }
        public static explicit operator float(TableCell cell)
        {
            if (cell == null) throw new ArgumentNullException("TableCell");
            return Single.Parse(cell.Value);
        }
        public static explicit operator float?(TableCell cell)
        {
            if (cell == null) return null;
            return Single.Parse(cell.Value);
        }
        public static explicit operator double(TableCell cell)
        {
            if (cell == null) throw new ArgumentNullException("TableCell");
            return Double.Parse(cell.Value);
        }
        public static explicit operator double?(TableCell cell)
        {
            if (cell == null) return null;
            return Double.Parse(cell.Value);
        }
        public static explicit operator decimal(TableCell cell)
        {
            if (cell == null) throw new ArgumentNullException("TableCell");
            return Decimal.Parse(cell.Value);
        }
        public static explicit operator decimal?(TableCell cell)
        {
            if (cell == null) return null;
            return Decimal.Parse(cell.Value);
        }
        public static implicit operator DateTime(TableCell cell)
        {
            if (cell == null) throw new ArgumentNullException("TableCell");
            return new DateTime(1900, 1, 1).AddDays(Int32.Parse(cell.Value) - 2);
        }
        public static implicit operator DateTime?(TableCell cell)
        {
            if (cell == null) return null;
            return new DateTime(1900, 1, 1).AddDays(Int32.Parse(cell.Value) - 2);
        }
    }

Page 1 of 10 (10 items) 12345»
Page 1 of 1 (10 items)