SQL Server Storage Engine

How does your schema affect your shrink run-time?

For part two of my short series on data file shrinking, I want to look at how elements of your schemas can cause shrink to take much longer to run. In SQL Server 2005, three things in your schema will drastically affect the run-time of data file shrink.

 

LOB data

By LOB data, I mean any large-value data-type (e.g. text, image, varchar(max), XML). The problem here is with the way that LOB values are stored. They’re usually stored off-row, which means they’re stored in separate text pages from the table or index rows that they’re part of. Each row has a physical link to the LOB value(s) on the different text page(s).

 

The big catch is that the link is one-way only – from the data/index row to the LOB value – there is no backwards link from the LOB value back to the ‘owning’ data/index row. So, if a LOB value needs to be moved to a new page as part of the shrink process, how can the owning data/index row be updated with the new physical location?

 

Not easily is the answer. Because there’s no backwards link from the LOB value back to the owning row, shrink has to scan all rows that could own the text value (i.e. all records in the partition, index, or base table that the text page is part of). As you can imagine, this is a slow process and it has to happen for every LOB value that is moved.

 

Note: This was the same in SQL Server 2000 as well.

 

Row-overflow data

This is quite non-obvious. Row-overflow data is where you have a schema that allows row sizes larger than can fit in a page (see here for more details and here for how to examine the on-disk structure of row-overflow data values). The key point is that row-overflow data is stored in exactly the same way as LOB values, and so the same algorithm kicks in during shrink when a page with one or more of these values needs to be moved.

 

Heaps with non-clustered indexes

Non-clustered index rows have to have a link back to the matching base table (heap or clustered index) record, so that queries that have used non-covering non-clustered indexes can retrieve the desired columns from the base table record. For tables with a clustered index, this link is a logical link – the clustering key of the matching base table record. For tables without a clustered index (i.e. just a heap), this link is a physical link – the actual physical location of the matching base table record.

 

So, if a clustered index data-page moves, then nothing needs to change in the non-clustered indexes, because the clustering keys of the records on that page haven’t changed – only their location. However, if a heap data-page needs to be moved, than all the matching non-clustered index records need to be updated with the new location of the base table records!

 

On SQL Server 2000, this isn’t as bad as it seems because the Storage Engine could do this manually itself. However, we made a bunch of architectural changes in SQL Server 2005 that means that the Storage Engine no longer does this. Now the Query Processor is the only part of the SQL Server Engine that can update non-clustered index records, so when the Storage Engine wants to move a heap page in SQL Server 2005, it needs to ask the Query Processor to update the non-clustered index records for it. It turns out that this is a lot slower than the SQL 2000 method.

 

Summary

 

More reasons not to use shrink. Next up – how to avoid using shrink in SQL Server 2005.

Published Thursday, March 29, 2007 10:58 AM by Paul Randal - MSFT
Filed under:

Comments

 

SQL Server Storage Engine said:

Late night blog post to round out spring-break vacation... A number of customers I’ve spoken to in the

April 16, 2007 2:47 AM
Anonymous comments are disabled

About Paul Randal - MSFT

Paul started in the industry in 1994 working for DEC on the VMS file system and check/repair tools. In 1999 he moved to Microsoft to work on SQL Server, specifically on DBCC. For SQL Server 2000, he concentrated on index fragmentation (writing DBCC INDEXDEFRAG and DBCC SHOWCONTIG) plus various algorithms in DBCC CHECKDB. During SQL Server 2005 development Paul was the lead developer/manager of one the core dev teams in the Storage Engine, responsible for data access and storage (DBCC, allocation, indexes & heaps, pages/records, text/LOB storage, snapshot isolation, etc). He also spent several years rewriting DBCC CHECKDB and repair. For SQL Server 2008, Paul managed the Program Management team for the core Storage Engine to become more focused on customer/partner engagement and feature set definition. In 2007, after 8.5 years on the SQL Server team, Paul left Microsoft to join his wife, Kimberly Tripp, running SQLskills.com and pursuing his passion for presenting and consulting. Paul regularly presents at conferences and user groups around the world on high-availability, disaster recovery and Storage Engine internals. His popular blog is at http://www.sqlskills.com/blogs/paul/.

© 2009 Microsoft Corporation. All rights reserved. Terms of Use  |  Trademarks  |  Privacy Statement
Microsoft
Page view tracker