Fragmentation (part 4): what are heaps?

Fragmentation (part 4): what are heaps?

Rate This
  • Comments 9
Ok - really catching up with the various blog post series I started back in June/July - time to bang out the next few posts in the series on index fragmentation. Remember I'm starting from first principles and covering
before getting into the details of fragmentation.
 
So what is a heap?
  • A heap is the simplest storage arrangement and is an unordered table (or you could think of it as a collection of unordered pages).
  • This means that rows will be inserted into the heap anywhere where there is space (don't let this statement confuse you - for a table that has nothing but inserts, records will always be appended to the last allocated page).
  • A heap consists entirely of data pages - as there is no b-tree, there are no index pages.
  • As a heap is unordered, the data pages are not linked in any way. (Ah - there are some exceptions. In SQL Server 2000 and 2005 (and maybe in 7.0 - I don't remember), the sysfiles1 table is a linked-heap. This table contains the locations of the files comprising the database and so for safety's sake we want the pages to be linked, but not with the complexity of having a b-tree because we read the pages directly at startup time. In SQL Server 2005, some other system tables are also linked-heaps).
  • If you want to do a singleton lookup from a heap, the whole heap has to be scanned to find the row(s) matching the search predicate.
  • Doing a select (*) of the contents of a heap will not guarantee to return the rows in the order they were inserted (as it did in 6.5 when all heaps were linked as sysfiles1 is and the space from singleton deletes was not reclaimed). This is because the pages are accessed in allocation order.
  • Non-clustered index records contain the physical RID (record ID or record locator) of the corresponding heap record.
Heaps have one interesting feature - forwarded records. If a record needs to be updated, the updated record size is greater than the current record size, and there is no space on the page to fit the new record in then we have two choices:
  1. move the record to a new page and change all the non-clustered index records that point to it to point to the new location of the record
  2. move the record to a new page and leave a forwarding record in the original location to point to the new location
Guess which approach we took? Right, #2. Approach #1 has enormous performance implications because of the extra IOs and logging involved. The forwarding record has the physical location of the new record and the forwarded record has a back-pointer to the forwarding record (so that if its deleted, the forwarding record can be deleted too).
 
This is one drawback of using heaps - all the extra space that's "wasted" with the forwarding/forwarded records. Another drawback is that when scanning through the heap, forwarding records have to followed immediately (as opposed to ignoring them and just reading the forwarded records when they're encountered). This is to vastly reduce the possiblity of read anomalies (such as non-repeatable reads or missed rows if a row moves before the scan point during a scan).
 
And that's it. Next up is clustered indexes...
Leave a Comment
  • Please add 3 and 3 and type the answer here:
  • Post
  • Greg Linwood, a fellow SQL Server MVP, has started a series of articles in which he attempts to prove...

  • Paul - I just wante to make sure I understand correctly your second bullet point on "what is a heap":

    "This means that rows will be inserted into the heap anywhere where there is space (don't let this statement confuse you - for a table that has nothing but inserts, records will always be appended to the last allocated page)"

    Microsoft kb article 297861 indicates:

    "In SQL Server 7.0, and later, SQL Server generally optimizes inserts into a heap with the assumption that saving space is more important than performance. That is the tradeoff you choose to make when you decide to leave a table as a heap. Therefore, an insert into a heap often spends time searching for a location to insert a new row."

    Are you saying that if I have a heap which has never had any deletes or updates, SQL Server 7.0 and later do not try to optimize for space usage - they simnply insert the new rows at the last allocated page? Does the same apply if I start from an empty heap and insert millions of rows with a single INSERT/SELECT FROM statement?

  • Hi Gent,

    The KB article really only applies when there is free space in the pages that comprise the heap. For a heap that is append-only, with no deletes, the free space will always be in the last allocated page - so there's nothing to optimize - the only place to insert the record is in the last allocated page. The same applies to the insert/select example you give.

    Thanks

  • Paul - thank you for the reply.

    If I create a variable length table then, depending on the data inserted on a page, it might happen that there is free space in pages other then the last allocated one. For example:

    create table T1(col1 varchar(8000))

    insert into T1 select replicate('a', 100)

    insert into T1 select replicate('a', 8000)

    The above two inserts allocate two pages with free space left on both of them.

    I used DBCC PAGE to see where would data go if I ran a third insert (by the way, thank you for the great posting on DBCC PAGE):

    insert into T1 select replicate('a', 10)

    Although there should be enough free space on the second page, in all my tests SQL Server consistently inserted the data in the first one. I ran into the same behaviour if I changed the second insert (the one that allocates the second page) to insert only 7800 characters:

    insert into T1 select replicate('a', 7800)

    Again, the third insert (inserting a 10 character row) would consistently use the first page, although there should be more than 200 bytes free on the second one.

    I also ran a loop as in the following and watched the FreeSpace Scan/sec counter being high for the duration of the loop (I was running these tests on my local machine with nothing else running on SQL Server):

    declare @counter int

    set @counter = 1

    while (@counter < 10000)

    begin

    insert into T1 select replicate('a', 1000)

    set @counter = @counter + 1

    end

    I ran my tests both on SQL Server 2000 and SQL Server 2005 with similar results.

    All the above seems to indicate that even in the case of 'insert only' heaps, SQL Server still tries to optimize for space and will utilize a page which might not necessarily be the last allocated one. What am I missing?

    Thanks again for your time and thank you for the great information and knowledge you are sharing in your posts.

    Gent

  • Yes, when the records are randomly sized then it will make use of any free space in earlier pages.

  • Greg Linwood, a fellow SQL Server MVP, has started a series of articles in which he attempts to prove

  • Hi, Paul

    Very nice blog on those fragmentation series..but where is you next one on the clustered index and non-clustered index?

    My another question,

    if all the rows from a page are deleted, that page gets deallocated and reclaimed by the database and if all pages are empty due to range deleteion in an extent, the extent will get deallocated and reclaimed by the database. Is that correct statement? Is that the same for the heap? I dont think so as "empty" page(due to deletion) do not reclaimed and deallocated for the heap.

    Can you explain in that area in respect to the when a range deletion occurs on the page, what happens to the page?

    You can ping me at billkan@microsoft.com

    Thank you,

    Bill

  • Hi Bill,

    I haven't done them yet - I'll probably redo most of the series at the end of the summer.

    When pages and extents are deallocated is complicated, and the subject of a whole blog post. I'll get to it...

    Thanks

  • MVP Hugo Kornelius once reported that he encountered a situation in which it was possible to perform

Page 1 of 1 (9 items)