When we think of compressing the database, one of the first question that pops up is why not used the compressed volume? You may know that Windows OS has supported compression of individual files, folders and the entire NTFS volumes since Windows 2000. Given this, you may wonder why don’t we just create databases on compressed volume and get the disk savings?
Well, there are few issues with it as follows
· Generally the compression methods are effective on large chunks of data. So you may need to compress many pages (think in mega bytes) together before you can get good compression. This runs counter to how data is actually accessed in the database. Most user queries access data “randomly” and in “small” parts. This means that accessing random data will force reading many pages on the compressed volume/file and this needs to be followed by decompression before the requested data can be retrieved. This becomes prohibitively more expensive.
· When data is read into the SQL Server memory (i.e. buffer pool), it is uncompressed. So there is no benefit from memory perspective either.
· The IO on compressed volume is synchronous which will impact the performance of your workload. Here is some information picked up from an excellent white paper written by Bob Dorr http://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/sqlIObasics.mspx . “SQL Server relies heavily on asynchronous I/O capabilities to maximize resource usage. SQL Server Support has debugged problems with some filter drivers that do not allow the I/O request to be completed asynchronously. Instead, the filter driver requires the I/O to complete before returning control to SQL Server. This is easily observed by watching the disk queue lengths. When SQL Server is running, it commonly keeps more than one I/O posted. When I/Os become synchronous, the disk queue is often held at one outstanding I/O. This causes the SQL Server code to unnecessarily block. Because the disk sec/transfer time may not be a fair statistic, use it with caution. When less I/O is outstanding on a drive, the disk sec/transfer is often fast. The longer the disk queue length, the more variations that can occur on disk sec/transfer. However, because SQL Server understands the asynchronous nature of the I/O patterns, longer disk queue lengths, and somewhat longer disk sec/transfer can result in overall better throughput and resource utilization.”
In the next blog, I will describe how SQL Server compresses the data.
Great post, i wonder though what the cost of the compression is when considering updates and inserts. Do you have any design time considerations that would allow us to choose the correct scnearios for compression.
Mark: as you correctly pointed out, there is a cost to be paid for compression. The SELECT operator needs to pay the cost of uncompressing the data while DML operators (e.g. update, insert) will pay the cost of compressing the data. We are planning to publish (before SQL 2008 RTM) a white paper/case study on how to use compression feature to strike a right balance between space savings and CPU overhead.
Don't you also lose the ability of async I/O's when NTFS compression is used?
Good point. I had missed it. I will update the bolg-entry