Last year we worked with a customer on a proof of concept that involved populating and running some performance tests against a 20TB database.  The POC was conducted using the following software and hardware:

  • Windows Server 2003 Datacenter sp1
  • SQL Server 2005 Enterprise sp2
  • Unisys ES7000 (64 x64 processor cores, 256GB RAM)
  • HP XP12000 SAN with 24TB of storage spread across 860+ spindles

As it turned out, the bulk of the data was contained within one partitioned table in particular and we ended up having to re-load that table a few times as we optimized the system and made changes to the disk layout.

During the POC, we documented a few tips and lessons along the way in a list that we've been maintaining internally.  Now that we've started this blog, I thought it may be useful to publish this information for others that are building or working on very large database systems with SQL Server.  Note this is not intended to be a complete list of VLDB considerations but just a list of configuration settings, commands and tips that were helpful in our POC.

Configuration

  • Use the -E startup option to enable SQL Server to allocate 4 extents per file instead of the default of 1. This must be done prior to loading data into the database.
  • Grant the "Lock pages in Memory" privilege to the SQL service account in order to keep the process memory in physical RAM.
  • Turn on Trace Flag 834 to enable large page support for buffer pool. Note this requires the "Lock pages in Memory" privilege.
  • Make sure that the SQL Server service account has "manage volume" (SE_MANAGE_VOLUME_NAME) rights in order to enable instant file initialization.
  • Set CPU affinity mask (and affinity64 mask if the system has more than 32 processors) on a NUMA system to bind schedulers and threads to specific CPU's. In our case, we set both of these configuration items to 0xFFFFFFFF in order to use all 64 CPU cores.
  • Do not rely on autogrow to size the data or transaction log files. Pre-allocate space so that SQL can grab contiguous space on disk.

Partitioning

  • ALTER TABLE SWITCH can invalidate statistics causing update stats to run.
  • Update statistics on a very large table with several indexes can take a long time to run. In addition to executing a query, it can be triggered by trying to get an estimated execution plan if there have been sufficient data changes.
  • In our case, we were doing a lot of partition switching but not changing the data so we turned off auto update statistics so that the stats updates did not affect query performance. In a production environment it would most likely be more appropriate to set the AUTO_UPDATE_STATISTICS_ASYNC option.
  • To split a partition and move data, it is more efficient to switch the partition out and move the data manually using an INSERT-SELECT statement rather than using ALTER PARTITION FUNCTION SPLIT RANGE. Then ALTER PARTITION SPLIT RANGE can be used on the partitioned table to split an empty partition into two. In our case that made a difference of several hours when splitting a 200GB partition.
  • It is a good practice to always keep an empty partition at either end of the table so it can be split before loading new data.

Parallelism / Performance

  • When doing a parallel CREATE INDEX operation, use the SORT_IN_TEMPDB option to get less fragmentation. Be sure to size tempdb adequately so the data files don't have to grow.
  • If you have long-running queries or CREATE INDEX commands using large amounts of memory, you can monitor their memory usage with the DMV sys.dm_exec_query_memory_grants.
  • CREATE INDEX with MAXDOP larger than the number of processor cores in a single NUMA node will result in SQL Server spreading the tasks across all NUMA nodes. This can cause excessive overhead in the form of CXPACKET wait times as the threads communicate across nodes.
  • When running several long-running operations in parallel, you may need to pay attention to NUMA node affinity to ensure that tasks are being evenly distributed across nodes. We used the following query to monitor this:

select t.session_id, r.command, s.parent_node_id, count(*) as tasks, sum(pending_io_count) as [IO],

sum(case t.task_state

when 'RUNNING' then 1

else 0

end) as [RunningTasks]

from sys.dm_os_tasks t

join sys.dm_exec_requests r on t.session_id = r.session_id

join sys.dm_os_schedulers s on t.scheduler_id = s.scheduler_id

where t.session_id > 50

group by s.parent_node_id, t.session_id, r.command

order by 3

  • We used SQL LiteSpeed for backups and were able to achieve backup speeds of over 700MB/sec using all 64 CPU cores.  Of course, your mileage will vary based on compression settings and data.  We have recently published some impressive numbers regarding the new SQL Server 2008 compressed backups.

 

-Scott Hulke, Microsoft Technology Center – Dallas