Builder (write data to the in-build index)

                           |

                     Sort (order by index key)

                           |

                     Scan (read data from source)

 

In order to build the b-tree for the index we have to first sort the data from source.  The flow is to scan the source, sort it (if possible - in memory*), then build the b-tree from the sort. 

Why do we need to sort first before building the b-tree?  In theory we don’t have to sort, we could use regular DML and directly insert data into the in-build index (no sort), but in this case we would be doing random inserts, random inserts in a b-tree require searching the b-tree for the correct leaf node first and then inserting the data. And while searching a b-tree is fairly fast, doing so before each insert is far from optimal. So for index build operation, we sort the data using the sort order for the new index, so when we insert data into the in-build index, it is not a random insert, it is actually an append operation and this is why the operation can be much faster than random inserts. 

While inserting data between sort and index builder we free each extent from the sort table as soon as all of its rows are copied.  In this way we reduce the overall disk space consumption from a possible 3*Index Size (source + sort table + b-tree) to just 2.2*Index Size (approximately).

 

*We do not guarantee in memory sort; the decision of whether we can do in memory sort or not depends on memory available and actual row count. ‘In-memory’ sort is, naturally, fast (also disk space requirements will be more relaxed in this case, because we don’t have to allocate space for the sort table on the disk), but it is not required; we can always spill data to disk, although the performance is much slower than in-memory sort.

 

For an index build operation, we use the user database (the database where we build the index) by default for sort to spill data, but if the user specifies sort_in_tempdb, then we use tempdb for spill.

Each sort table (even when we have very little data to sort) requires at least 40 pages (3200KB) to run (later we will see that in case of parallelism we can have several sort tables at the same time). When calculating memory for sort, we try to allocate enough memory to have an in-memory sort.  For large index build operations it is not likely that we will be able to fit the entire sort in memory. If we can’t provide at least 40 pages for the Index Build operation, it will fail.

 

The last step of index build is to always build full statistics. Good statistics information helps the query optimizer to generate better plan, users can issue ‘create’ or ‘update’ stats commands to force SQL Server generate or refresh stats on a certain object.  When we are building a new index, since we need to touch every row, we use this opportunity to build full stats at the same time as a side benefit.

 

Conclusion:

To be able to build a non partitioned Index offline with serial plan we will need free disk space (in user’s or in tempdb database) of approximately 2.2*IndexSize and at least 40 pages of memory available for the query executor to be able to start the process.

 

 

Read in next post: Index Build Scenario 2: Offline, Parallel, No Partitioning

 

Posted by: Lyudmila Fokina