I did some volunteer work a few years ago that required processing lots of CSV files. So I solved the problems by writing a C# CSV reader, which I wanted to share here. The basic features here are:
I went ahead and put it on github at https://github.com/MikeStall/DataTable. And it’s available for download via Nuget (see “CsvTools”). It’s nice to share, and maybe somebody else will find this useful. But selfishly, I’ve used this library for quite a few tasks over the years and putting it on Github and Nuget also makes it easier for me to find for future projects.
There are the obvious disclaimers here that this was just a casual side project I did as a volunteer, and so use as is.
Step 1: Install “CsvTools” via Nuget:
When you right click on the project references node, just select “Add Library Package Reference”. That will bring up the nuget dialog which will search the online repository for packages. Search for “CsvTools” and then you can instantly install it. It’s built against CLR 4.0, but has no additional dependencies.
Example 1: Loading from a CSV file
Here’s a CSV at: c:\temp\test.csv
name, species Kermit, Frog Ms. Piggy, Pig Fozzy, Bear
To open and print the contents of the file:
using System; using DataAccess; // namespace that Csv reader lives in class Program { static void Main(string[] args) { DataTable dt = DataTable.New.ReadCsv(@"C:\temp\test.csv"); // Query via the DataTable.Rows enumeration. foreach (Row row in dt.Rows) { Console.WriteLine(row["name"]); } } }
There are a bunch of extension methods hanging off “DataTable.New” to provide different ways of loading a table. ReadCsv will load everything into memory, which allows mutation operations (see below). But this also supports streaming operations via the methods with “lazy” in their name, such as ReadLazy().
Example 2: Creating a CSV from an IEnumerable<T> and saving back to a file
Here’s creating a table from an IEnumerable<T>, and then saving that back to a TextWriter (in this case, Console.Out).
var vals = from i in Enumerable.Range(1, 10) select new { N = i, NSquared = i * i }; DataTable dt = DataTable.New.FromEnumerable(vals); dt.SaveToStream(Console.Out);
Which produces this CSV:
N,NSquared 1,1 2,4 3,9 4,16 5,25 6,36 7,49 8,64 9,81 10,100
Example 3: Mutations
DataTable is actually an abstract base class. There are two primary derived classes:
Most of the builder functions that load in memory actually return the derived MutableDataTable object anyways. A MutableDataTable is conceptually a giant 2d string array stored in column major order. So adding new columns or rearranging columns is cheap. Adding rows is expensive. Here’s an example of some mutations:
static void Main(string[] args) { MutableDataTable dt = DataTable.New.ReadCsv(@"C:\temp\test.csv"); // Mutations dt.ApplyToColumn("name", originalValue => originalValue.ToUpper()); dt.RenameColumn(oldName:"species", newName: "kind"); int id = 0; dt.CreateColumn("id#", row => { id++; return id.ToString(); }); dt.GetRow(1)["kind"] = "Pig!!"; // update in place by row dt.Columns[0].Values[2] = "Fozzy!!"; // update by column // Print out new table dt.SaveToStream(Console.Out); }
Produces and prints this table:
name,kind,id# KERMIT,Frog,1 MS. PIGGY,Pig!!,2 Fozzy!!,Bear,3
There’s a builder function, DataTable.New.GetMutableCopy, which produces a mutable copy from an arbitrary DataTable.
Example 4: Analysis
I needed some basic analysis functions, like join, histogram, select duplicates, sample, and where. These sit as static methods in the Analyze class.
Here’s an example of creating a table with random numbers, and then printing the histogram:
static void Main(string[] args) { // Get a table of 1000 random numbers Random r = new Random(); DataTable dt = DataTable.New.FromEnumerable( from x in Enumerable.Range(1, 1000) select r.Next(1, 10)); Tuple<string,int>[] hist = Analyze.AsHistogram(dt, columnIdx: 0); // Convert the tuple[] to a table for easy printing DataTable histTable = DataTable.New.FromTuple(hist, columnName1: "value", columnName2: "frequency"); histTable.SaveToStream(Console.Out); }
Produces this result:
value,frequency 9,151 8,124 2,118 7,110 3,107 5,104 1,101 6,99 4,86