In SQL Server 2008, we implemented a set of features, including Sparse Column, Column Set + Wide Table, Filtered Index and HierarchyID, to support Semi-Structure data management.   Today, I will talk about how Sparse Column will save disk space. 

When a table has no sparse column defined, and no data compression defined, the null value storage for that table depends on the data type of the column.  For fix size data type, such as datetime, int, char, the Null value will be the size of the target data type.  For example, it will take 8 bytes to store a Null value for a BigInt column.   For variable length column, in addition to storing the data value, we also need to store the starting offset of the value.  Thus, we have a variable-length column offset array which takes two bytes per variable length columns in each row. As a result, we can use 0 byte to represent the NULL value itself, but we still need 2 byte for the offset value.   Paul Randal ‘s blog “Poking about with DBCC PAGE (Part 1 of ?)” can give your guy more details about this. In addition, we also have a NULL Bitmap array, the size is the number of column/8.  Each bit in the array represent whether a column is NULL or not.  Please see here for more information about why we need the Bitmap array.

In SQL Server 2008, we introduce the Sparse Column feature.  The basic idea is that we want to optimize the storage of NULL value.  Suppose in a table with 200 columns while 160 columns are defined as sparse column, the values of these 160 columns will store together into a complex structure, which we called it as “Sparse Vector”.  The structure of the “Sparse Vector” will looks like:

<4 byte header> <a list of columns which values are not null> <a list of column offset> <real data>

For each not null value in a row, there will be 2 bytes for the id of the column in the “column list”, and 2 bytes for the “offset” of that column in the real data. In one word, not null value for a sparse column will take 4 +data size to store, and null column will take 0 byte to store. If all sparse columns in a row are NULL, the “Sparse Vector” will be empty, and it will take 0 bytes.  The NULL Bitmap array will only contain non-sparse columns, i.e., the Null Bitmap will take 5 bytes instead of 25 bytes for that table.

In SQL Server, the Data Compression feature can also compress the null values.  The Row compression will change all fix-length columns in a table into variable length columns, so that the Null value will not take the size of the data type.  Also, it does not have the 2 bytes offset for the variable length columns, instead, it has 4 bits per column (no matter the value is null or not null).  Data Compression is table based, sparse column is column based.  Unless a column is defined as sparse, it will not be compressed using sparse technique.  In the following table, I compared the data compression side by side

Scenario

Sparse Column

Data Compression

Compressing not null values

No

Yes

Compressing null values

Only the column is defined as “Sparse”

Yes

Compressing indexes

No

Yes

Length of Null Bitmap

No. of non-sparse column/8

0

Length of Column Offset  array

No. of non-sparse variable length column *2

0

Additional Overhead

4 byte (sparse vector header) + 2 bytes per not null sparse column

No of columns /2 bytes

CPU overhead

Light, accessing non sparse column has no impact, accessing not null sparse column has slight more CPU overhead

Heavy, not only need decompress the value, also need to calculate the offset for a given column.

 Base on the above description, we can see that Sparse Column is very useful when the table has large amount of Null values in term of space saving.   My next blog will provide several examples of using sparse column for performance tuning.

Another fact I like to point out is that you can still achieve null compression in SQL Server 2005.   If your table have many null values in variable length columns, you can re-arrange your column order to make these columns to the end of table (the order of fix length columns can be anywhere in the table).  The reason behind this is that the order of variable length columns can help to remove trailing nulls.  If the last N values of a row is all null, SQL Server will take 0 bytes to store these N column offset. Blog http://zine.net.pl/blogs/sqlgeek/archive/2008/06/06/en-column-order-does-it-matter.aspx also suggest that.   The present of Sparse Vector does not hurt this feature as well.