SQL Server deploys two strategies to compress the data
· First, it stores all fixed length data types in variable length format. If you recall, SQL Server 2005/SP2 provided a new variable length storage format decimal and numeric. Please refer to the series of blogs vardecimal-storage-format for details. SQL Server 2008 extends this to all fixed length data types. Some examples of fixed length data types are integer, char, and float data types. One important point to notes is that even though, the SQL Server stores these data types in variable length format, the semantics of the data type remains unchanged (i.e. it is still the fixed length data type from the perspective of the application). This means that you can avail the benefits for data compression without requiring any changes in your application(s).
Let us consider some examples to illustrate the space savings you can get by storing data in variable length storage format.
Example-1: Let us say you have a table that has an integer column along with many other columns. If the common integer value of this column can fit in 1 byte (i.e. the value is <= 255), then there is no reason to take 4 bytes to store it. With variable length storage format, SQL Server can shave off 3 bytes from most integer values for this case there by saving you 3 bytes off 4 byte value, a space saving of 75%.
Example-2: Let us say you have a table that has a CHAR(100) column. In fixed length format, this column will take 100 characters independent of the actual value stored in it. So if you store “Hello” or “This is a longer string”, both values will take space required to store 100 characters. However, when CHAR (100) is stored in variable length storage format, it will only take 5 characters for the first value (“hello”) and 23 characters for the second value (“This is a longer string”). You can see in this case, we are able to reduce the size of the first value by 95% and for the second value by 77%. A significant space saving.
Of course, if the character value in example-2 has many more characters or if the common integer values in example-1 take 2 or more bytes, the space savings will be lower. This means the space savings you can achieve will depend on the data distribution. Other point that is not so obvious is that the space savings achieved will also depend on the schema of the table as well. So for example, if the column was declared to be CHAR (150) instead of CHAR (100), you would get significantly more space (i.e. additional 50 characters) savings. Also, note as a special case, the 0s and NULL values require no storage.
Storing values in variable length format will typically get you some space savings, but there is a catch. When a fixed length value is stored in variable length format, the SQL Server needs to store its offset within the row. Note, this offset was not needed when the column value was stored as fixed length because its relative position in the row was always at the same offset. If you recall the record storage format in SQL Server record-structure-blog-post, it takes 2 bytes to store the offset of each variable length column. Given this, it makes little sense to create a column of type varchar(2) as the minimum overhead is 2 bytes already. Same is true for data type such as smallint or even for integer for that matter. SQL Server 2008 addresses this issue by introducing a new record format to be used for compressed data that minimizes the overhead of storing the variable length values. It uses only 4 bits of overhead to store the length of the variable length column that is less than or equal to 8 bytes. Note that the previous record format is fully supported in SQL Server 2008 and is the default record format.
This strategy is exposed as ROW compression externally through DDL. Also note that ROW compression is a superset of vardecimal storage format and that vardecimal storage format is fully supported in SQL Server 2008.
· Second, it minimizes the data redundancy in columns in one or more rows on a given page. It does that by storing the redundant data once on the page and then referencing it from the multiple columns. So for example, consider the following table
Table employee (name varchar(100),
status varchar (10) default ‘full time’)
Now, if you insert multiple rows into this table, it is possible that many rows will have the same value (i.e. the default) for status column. SQL Server can take advantage of this by storing ‘full time’ value once on the page and then referring to this value from all other occurrences of this value on the same page. As you can imagine, the space savings using this strategy will depend on the amount redundant data on the page.
This strategy is exposed as PAGE compression externally through DDL. A customer can choose to enable ROW or PAGE compression on a table or an index or even on an individual partition(s). PAGE compression includes ROW compression.
I will provide details on PAGE compression later but for now it will suffice to know that ROW compression means storing fixed length data in variable length storage format and PAGE compression means minimizing the data redundancy.
As you can imagine, enabling compression is an expensive operation because the uncompressed data needs to be converted to new record format and for PAGE compression, the data redundancy needs to be minimized. So before you enable compression, you may want to know how much space savings you can get. In the next blog, I will describe how to estimate compression savings.
Thanks for the great tech specs, as record format. It would be even better if these essential specs were in a wiki.
Question: In any version of SQL (and especially SQL Server 2005 & 2008),
WHAT COMPRESSION is there FOR A COLUMN/FIELD OF REPEATING VALUES
(contiguous series of rows with the same value for that that field/column)
--MOST ESPECIALLY FOR SHORT FEW-BYTE VALUES as INT?
Say in an Order-Item table: an order is generally entered all once (or about the same time), each Order-Item tends to reference the same Order ID as the prior Order-Item (very frequently)).
As a simple do-it-yourself version of this, you could have your row-insert procedure (plus update & delete procedures) put null for a value of a field/column (expected to have a value) if that field's value desired value happens to be same value as the prior row's value for that field/column -- storing the same data with just 1 bit (for the null option, which is allocated anyway in all cases in SQL Server 2008 server). I call this DITTO Compression 20080103075141, because it's exactly what the ditto mark does, used here in the cell of a table. Then have a view of the data which actually restores the desired value (when the view sees null, it uses the prior non-null value). Then even index this computed column (and perhaps persist it) to get the same performance as if you actually had this underlying data, but without having to permanently waste the space. How well would that work?
However, something as basic & simple as this ("ditto compression") I would expect would already be built-in to SQL. Indeed I seem to recall reading years ago it was, perhaps in Oracle. But I can't find docs of this, for SQL Server nor SQL in general. Is such basic "ditto compression" built-in? If not, why not?
-- http://www.infoq.com/news/2007/11/SQL-Server-Compression talks of "page level" "prefix" compression (which is quite clever), but that's not until the latest 2008, and it seems like that might only work well on strings, and still require (for a 100% repeated value) a minimum of 3 bytes or so, instead of a bit.
Thanks! Mike Parker
The CTP6 release of SQL Server 2008 includes row and page compression. It’s a feature that will only
Your article is impressive. Thanks for all these internals.
One Question : Will this work for BLOB data as well, since you mentioned compression happens at binary level.
only in-row data gets compressed. if BLOB is stored off row, a common case, it won't be compressed. you have two options for off row
(1) use file stream
(2) use RBS (remote blob storage)
thanks for your interest and comments
There is a vast array of new features in SQL Server 2008. While I would love to explore all of them,
What about if have many VARCHAR declaration will the "ROW" compression still benefit or should I go for "PAGE" compression?
Yes, it can. PAGE compression reduces data redundancy on the page. So for example, if two VARCHAR values are same
Once the table is compressed with Row, is there way I can check how much compression I am getting?
Quick question. If I have got compressed column in the GROUP BY clause of my query do the values get uncompressed before or after the GROUP BY aggregate is done. You can see why I'm asking, right? If the uncompression is done *after* the GROUP BY then there is a lot less data to uncompress.
Values get uncompressed before "group by". The query processing layer only sees uncompressed data
OK, thanks Sunil. Any chance that the QP could aggregate compressed data in the future and gain anything from it? Worth submitting to Connect?
Ideally what you want is to have compression state preserved thru query processing and let SQL Server make decision when to compress intermediate rowsets by weighing compression/decompression cost and the size. For your specific case, are you running into any performance issue? if your selectivity of the predicate is high, then you will get fewer rows to group-by. If you want to submit the connect request, please provide real data on
(1) The performance problem you are seeing and why do you think that compressing intermediate result will give you better performance
(2) Is there a way to re-write the query for your specific case.
Not that much intersted..Make it clear..