I’d like to demonstrate a technique for storing compressed rowsets in SQL Server using CLR integration.
This is really the story of several .NET and SQL Server technologies coming together. Some have been around for a while, some are new. If you’ve been around Microsoft .NET’s data programming stack for long you’ll remember the ADO.NET DataSet and DataTable. They provide an in-memory representation of tabular data, along with the ability to marshal data back and forth to a database. In .NET 2.0 efficient binary serialization of DataSets and DataTables was added. Prior to this DataSets were always serialized as XML, and the serialized size was quite large. In .NET 2.0 the System.Compression namespace was added with stream compressors GZipStream and DeflateStream. The compression performance of these were not terribly impressive, and in .NET 4.0 they were improved substantially. In SQL Server 2008 the APPLY TSQL operators were added to the language, about which more later, and support for large CLR custom CLR types was added. In SQL Server 2012 the internal version of the CLR available inside SQL Server was upgraded to .NET 4.0.
So putting this all together, SQL Server 2012 has the ability to take an arbitrary query, store the results in memory, compress them and store them in a single scalar BLOB. Why is this interesting? A couple of scenarios I’ve worked on lately are good candidates for this. One was an application that handled engineering models for various engineering and analysis tasks. Depending on the exact task the kind of input and output datasets would vary, and these data sets tended to be kind of big. The application didn’t need to report on this data, or even read and write individual rows. It just needed to store and retrieve it, so semi-structured storage is good fit. The users are also distributed across the globe, so XML, the go-to choice for storing semi-structured data was perhaps not the best choice.
Another scenario involved what’s commonly called Time Series Data. Where for a given entity at a given time there’s a sequence of measurements, or payments or projections. Time Series data is challenging to store in relational databases at scale since you end up with a very large number of narrow rows. This works fine up to a point, a point which can be extended with things like SQL Server’s table and index compression. But it’s not the most efficient way to store this kind of data. That’s why most of the big commercial systems dealing with this kind of data use some alternate kind of storage (sometimes inside a traditional relational database), for the actual time series.
This data has a couple of things in common:
1) The detail data has a large number of rows, and may have a variable shape 2) The database engine only needs address the data in bulk, not on a row-by-row basis
1) The detail data has a large number of rows, and may have a variable shape
2) The database engine only needs address the data in bulk, not on a row-by-row basis
So I wrote a little sample, which is available here: SQL Server Compressed Rowset Sample
The code and instructions are there. Check it out and let me know what you think.