I did some volunteer work a few years ago that required processing lots of CSV files. So I solved the problems by writing a C# CSV reader, which I wanted to share here. The basic features here are:

  1. be easy to use
  2. read and write CSV files (and support tab and “|” delimiters too)
  3. create CSV files around IEnumerable<T>, dictionaries, and other sources.
  4. Provide a “linq to CSV” experience
  5. provide both in-memory mutable tables and streaming over large data sources (thank you polymorphism!)
  6. provide basic analysis operations like histogram, join, find duplicates, etc. The operations I implemented were driven entirely by the goals I had for my volunteer work.
  7. Read from Excel
  8. Work with Azure. (This primarily means no foolish dependencies, and support TextReader/TextWriter instead of always hitting the file system)

I went ahead and put it on github  at https://github.com/MikeStall/DataTable. And it’s available for download via Nuget (see “CsvTools”).  It’s nice to share, and maybe somebody else will find this useful. But selfishly, I’ve used this library for quite a few tasks over the years and putting it on Github and Nuget also makes it easier for me to find for future projects.

There are the obvious disclaimers here that this was just a casual side project I did as a volunteer, and so use as is.

Step 1: Install “CsvTools” via Nuget:

When you right click on the project references node, just select “Add Library Package Reference”. That will bring up the nuget dialog which will search the online repository for packages. Search for “CsvTools” and then you can instantly install it. It’s built against CLR 4.0, but has no additional dependencies.

image

 

Example 1: Loading from a CSV file

Here’s a CSV at: c:\temp\test.csv

name, species
Kermit, Frog
Ms. Piggy, Pig
Fozzy, Bear

To open and print the contents of the file:

using System;
using DataAccess; // namespace that Csv reader lives in

class Program
{
    static void Main(string[] args)
    {
        DataTable dt = DataTable.New.ReadCsv(@"C:\temp\test.csv");

        // Query via the DataTable.Rows enumeration.
        foreach (Row row in dt.Rows)
        {
            Console.WriteLine(row["name"]);
        }        
    }
}

There are a bunch of extension methods hanging off “DataTable.New” to provide different ways of loading a table. ReadCsv will load everything into memory, which allows mutation operations (see below).  But this also supports streaming operations via the methods with “lazy” in their name, such as ReadLazy().

Example 2: Creating a CSV from an IEnumerable<T> and saving back to a file

Here’s creating a table from an IEnumerable<T>, and then saving that back to a TextWriter (in this case, Console.Out).

var vals = from i in Enumerable.Range(1, 10) select new { N = i, NSquared = i * i };
DataTable dt = DataTable.New.FromEnumerable(vals);
dt.SaveToStream(Console.Out);  


Which produces this CSV:

N,NSquared
1,1
2,4
3,9
4,16
5,25
6,36
7,49
8,64
9,81
10,100

 

Example 3: Mutations

DataTable is actually an abstract base class. There are two primary derived classes:

  1. MutableDataTable,, which loads everything into memory, stores it in column major order, and provides mutation operations.
  2. streaming data table, which provides streaming access over a rows. This is obviously row major order, and doesn’t support mutation. The streaming classes are non-public derived classes of DataTable.

Most of the builder functions that load in memory actually return the derived MutableDataTable object anyways. A MutableDataTable is conceptually a giant 2d string array stored in column major order. So adding new columns or rearranging columns is cheap. Adding rows is expensive. Here’s an example of some mutations:

static void Main(string[] args)
{
    MutableDataTable dt = DataTable.New.ReadCsv(@"C:\temp\test.csv");

    // Mutations
    dt.ApplyToColumn("name", originalValue => originalValue.ToUpper());
    dt.RenameColumn(oldName:"species", newName: "kind");
    
    
    int id = 0;
    dt.CreateColumn("id#", row => { id++; return id.ToString(); });

    dt.GetRow(1)["kind"] = "Pig!!"; // update in place by row
    dt.Columns[0].Values[2] = "Fozzy!!"; // update by column

    // Print out new table
    dt.SaveToStream(Console.Out);        
}

Produces and prints this table:

name,kind,id#
KERMIT,Frog,1
MS. PIGGY,Pig!!,2
Fozzy!!,Bear,3

 

There’s a builder function, DataTable.New.GetMutableCopy, which produces a mutable copy from an arbitrary DataTable.

Example 4: Analysis

I needed some basic analysis functions, like join, histogram, select duplicates, sample, and where. These sit as static methods in the Analyze class.

Here’s an example of creating a table with random numbers, and then printing the histogram:

static void Main(string[] args)
{   
    // Get a table of 1000 random numbers
    Random r = new Random();
    DataTable dt = DataTable.New.FromEnumerable(
        from x in Enumerable.Range(1, 1000) 
        select r.Next(1, 10));

    Tuple<string,int>[] hist = Analyze.AsHistogram(dt, columnIdx: 0);
    
    // Convert the tuple[] to a table for easy printing
    DataTable histTable = DataTable.New.FromTuple(hist, 
        columnName1: "value",
        columnName2: "frequency");
    histTable.SaveToStream(Console.Out);
}

Produces this result:

value,frequency
9,151
8,124
2,118
7,110
3,107
5,104
1,101
6,99
4,86