SQL Server Storage Engine

Fragmentation (part 1): What are records?

This blogging thing sucks you in, doesn't it? Not content with having an ongoing series on disaster recovery and CHECKDB (with another 6 and 25 more posts planned respectively), I'm starting a new series on fragmentation. This will begin from first principles and work up, in approximately 18 posts over the next few months. The first few posts could be skipped by some people and will cover:

  1. What are records?
  2. What are pages?
  3. What are extents?
  4. What is a heap?
  5. What is a clustered index?
  6. What is a non-clustered index?

Bear with me as I build up the terminology. You could also check out Kalen's excellent book Inside SQL Server 2000 for details on these topics. (Her upcoming volume on the Storage Engine for Inside SQL Server 2005 should be out sometime this summer - buy it!).

The idea for this series came to mind last Friday at TechEd when I spent about 3 hours repeating a deck on fragmentation I gave to the North Texas SSUG in April 2004, people kept wandering by and stopping to listen. If that many people are interested in this stuff, it should make good blog material, and it also seems to be frequently misunderstood. The level of sophistication here ranges from not having any idea what fragmentation is up to defragging or rebuilding all indexes every night. I didn't see anyone there who knew just when it was worth removing fragmentation so I decided to explain here. The first 6 posts will also be useful as background for the CHECKDB internals series I'm doing.

So, what are records? At the simplest level, a record is the physical storage associated with a table or index row. Of course, it gets much more complicated than that...

Data records

  • These store table rows in a heap, or in the leaf level of a clustered index. They store all the columns of the table row that will fit in a single 8Kb page.
  • Data records are stored on data pages.
  • If any columns are for LOB data types (text, ntext, image, and the new LOB types in SQL Server 2005 - varchar(max), nvarchar(max), varbinary(max), XML), then there's a pointer stored in the data record which points to a text record on a different page (the root of a loose tree that stores the LOB value). Exceptions to this are when schema has been set to store LOB columns 'in-row' when possible. This is when a LOB value is small enough to fit within the confines of the data page that holds the data record, and so is stored in the same data record. This is a performance benefit as selecting the LOB column does not require an extra IO to read the text record.
  • In SQL Server 2005, non-LOB variable length columns (e.g. varchar, sqlvariant) may also be stored 'off-row' as part of the new capability (called row-overflow) of having table rows longer than 8060 bytes. In this case the storage format is the same as for LOB values - a pointer in the data record pointing to a text record.
  • I'll go into how the columns are laid out differently between heaps and clustered indexes in parts 4 and 5. There are also some twists in the record types for heaps that I'll cover in part 4.

Index records

  • There are two types of index records (which differ only in what columns they store):
    1. Those that store non-clustered index rows
    2. Those that comprise the b-tree that make up clustered and non-clustered indexes.
  • I'll explain more about the differences between these in parts 5 and 6 - it can be quite complicated (especially the differences between SQL Server 2000 and 2005) and is worth doing in separate posts.
  • Index records are stored on index pages.
  • Index records typically do not contain all the column values in a table (although some do - called covering indexes).
  • In SQL Server 2005, non-clustered index records can include LOB values as included columns (with the storage details exactly the same as for data records) and also can have row-overflow data that is pushed off-row (again, in exactly the same way as for data records).

Text records

  • There are various types of text records that comprise the tree structure that stores LOB values, stored on two types of text page.
  • An in-depth explanation of the various types and how they hang together is beyond the scope of this post and discussion.
  • They are also used to store variable-length column values that have been pushed out of data or index records as part of the row-overflow capability.

Ghost records

  • These are records that have been logically deleted but not physically deleted from the leaf level of an index.
  • The reasons for this are complicated, but basically having ghost records simplfies key-range locking and transaction rollback.
  • The record is marked with a bit that indicates it's a ghost record and cannot be physically deleted until the transaction that caused it to be ghosted commits. Once this is done, it is deleted by an asynchronous background proces (called the ghost-cleanup task) or it is converted back to a real record by an insert of a record with the exact same set of keys.
  • Ghost records will be mentioned later in the series when I discuss page compaction.

Other record types

  • There are also records that are used to store various allocation bitmaps, intermediate results of sort operations, and file and database metadata (e.g. in the per-file fileheader page and database boot page).
  • These are not relevant to a discussion of  fragmentation so won't be considered further here.

Record structure

All records have the same structure, regardless of their type and use, but the number and type of columns will be different. For instance, a data record from a table with a complex schema may have hundreds of columns of various types whereas an allocation bitmap record will have a single column, filling up the whole page.

The record structure isn't relevant to a discussion on fragmentation but is for CHECKDB internals, so here it is:

  • record header
    • 4 bytes long
    • two bytes of record metadata (record type)
    • two bytes pointing forward in the record to the NULL bitmap
  • fixed length portion of the record, containing the columns storing data types that have fixed lengths (e.g. bigint, char(10), datetime)
  • NULL bitmap
    • two bytes for count of columns in the record
    • variable number of bytes to store one bit per column in the record, regardless of whether the column is nullable or not (this is different and simpler than SQL Server 2000 which had one bit per nullable column only)
    • this allows an optimization when reading columns that are NULL
  • variable-length column offset array
    • two bytes for the count of variable-length columns
    • two bytes per variable length column, giving the offset to the start of the column value
  • versioning tag
    • this is in SQL Server 2005 only
    • this is a 14-byte structure that contains a timestamp plus a pointer into the version store in tempdb

If you have any questions on this stuff - put them in the comments of drop me an email.

Next time - what are pages?

Published Friday, June 23, 2006 4:41 PM by Paul Randal - MSFT

Comments

 

MikeWalsh said:

thanks for the good info again.
June 24, 2006 12:18 PM
 

Gaurav Bindlish said:

Awesome series .... Can't wait for future posts.
June 24, 2006 1:26 PM
 

oj_n said:

Thanks much for the detailed info on sql server storage engine. It's awesome that you and other PMs on the storage engine team are sharing this vast knowledge.

Btw, I would love to hear more on how sparse file and alternate data stream are incorporated in sql2k5.

Thanks again,
--
-oj

June 26, 2006 6:33 PM
 

Wesley Backelant said:

Great job on your blog posts lately Paul, keep up the good work!

I was wondering what exactly you mean by "this allows an optimization when reading columns that are NULL".  Could you elaborate on that?

Thanks
Alba, Gu Bra
June 28, 2006 12:39 PM
 

Paul Randal - MSFT said:

Hi oj,

Kevin or I will cover database snapshots in another post - this is the feature that makes use of sparse files and alternate data streams.

Thanks
June 28, 2006 10:15 PM
 

Glorf.It - Bedenkliches aus dem Alltag - » SQL Server Storage Engine said:

July 3, 2006 5:03 PM
 

Glorf.It - Bedenkliches aus dem IT-Alltag » SQL-Server: Ghost-Cleanup-Tasks said:

August 24, 2006 4:01 PM
 

SQL Server Storage Engine said:

Have you ever tried updating a variable length column and fail? Well, it can happen if the modified row

January 4, 2007 9:47 PM
 

SSQA- Users & SQL tools said:

On day-to-day basis a DBA might come across with the issues on the fragmentation on the database, it

April 23, 2007 5:42 AM
 

SQL Server Storage Engine said:

SQL Server deploys two strategies to compress the data · First, it stores all fixed length data types

November 12, 2007 12:17 PM
 

Noticias externas said:

SQL Server deploys two strategies to compress the data · First, it stores all fixed length data types

November 12, 2007 12:50 PM
 

SQLServerPedia » Does a database backup/restore update statistics? said:

December 13, 2007 4:58 AM
 

Danny's SQL Server and Internals Viewer Blog said:

The CTP6 release of SQL Server 2008 includes row and page compression. It’s a feature that will only

February 24, 2008 2:58 PM
 

Anthony Bloesch's Web Log said:

We have recently been looking at database fragmentation for real usage of the "Oslo" repository. However, since database fragmentation is a major cause of poor performance I thought a discussion of how to minimize and deal with database fragmentation

December 18, 2008 3:52 PM
 

Null takes place? | keyongtech said:

January 18, 2009 12:36 PM
 

char or varchar? | keyongtech said:

January 22, 2009 12:59 AM
 

Glorf IT - Bedenkliches aus dem IT-Alltag » SQL Server Storage Engine said:

February 18, 2009 1:26 PM
 

SQL Server Storage Engine Fragmentation part 1 What are records | Shed Kits said:

May 27, 2009 3:12 AM
Anonymous comments are disabled

About Paul Randal - MSFT

Paul started in the industry in 1994 working for DEC on the VMS file system and check/repair tools. In 1999 he moved to Microsoft to work on SQL Server, specifically on DBCC. For SQL Server 2000, he concentrated on index fragmentation (writing DBCC INDEXDEFRAG and DBCC SHOWCONTIG) plus various algorithms in DBCC CHECKDB. During SQL Server 2005 development Paul was the lead developer/manager of one the core dev teams in the Storage Engine, responsible for data access and storage (DBCC, allocation, indexes & heaps, pages/records, text/LOB storage, snapshot isolation, etc). He also spent several years rewriting DBCC CHECKDB and repair. For SQL Server 2008, Paul managed the Program Management team for the core Storage Engine to become more focused on customer/partner engagement and feature set definition. In 2007, after 8.5 years on the SQL Server team, Paul left Microsoft to join his wife, Kimberly Tripp, running SQLskills.com and pursuing his passion for presenting and consulting. Paul regularly presents at conferences and user groups around the world on high-availability, disaster recovery and Storage Engine internals. His popular blog is at http://www.sqlskills.com/blogs/paul/.

© 2009 Microsoft Corporation. All rights reserved. Terms of Use  |  Trademarks  |  Privacy Statement
Microsoft
Page view tracker