Every once in a while, a process gets too complex for humans to manage consistently. For example, when Boeing’s B-17 (the “flying fortress”) was first being evaluated, it crashed during initial tests due to pilot error. The consensus became that it “was too much airplane to fly.” Fortunately (for Boeing), someone came up with the idea of using checklists for each stage of flight. This ensured that the correct steps were taken at the appropriate time. Ultimately, this saved the program and the B-17 subsequently flew 1.8 million miles without an accident.

 

This SQL Server checklist is a work in process. If you have any additions or suggestions, send it to me and I'll trying to include it. For example, there’s not much here about setting up a cluster because I don’t have a lot of experience with clustering. Most of these items should be considered suggestions, rather than mandates. There may be valid reasons to skip an item or implement alternatives; the point is that you should at least consider each entry.

 

Initial SAN Configuration

q Use different storage for data files and log files

Data files typically use a random access pattern while log file writes are usually sequential. You will get much better performance by storing these files on different physical media (not just different volumes on the same drives!) because it allows the disk heads on the log drives to maintain their position for the next write.

q Configure Storage System to RAID 10 (aka RAID 1+0).

RAID 5 is likely to deliver poor write performance. RAID 0+1 is more vulnerable to failure and will require a longer recovery in the event of failure. You need at least four drives to implement RAID 10, so if you have less than four drives, just use RAID 1 (redundant disks).

q Use DISKPART.EXE to partition new LUNs

If you are not using partitions created with Windows Server 2008, use DISKPART to specify disk offset to ensure volume alignment. (You need Windows 2003 SP1 or later). You can find alignment instructions here. This step is critical to ensure that the 64k reads and writes align with the 64k stripe elements of the RAID media.

q Format the partitions with a 64k allocation unit size.

Since SQL Server reads and writes 64k at a time, the default 8k allocation size just creates unnecessary overhead.

q Balance Disks between service controllers.

If using EMC Clariion (or a similar mid-level SAN), ensure that the disk access is balanced between the service controllers.

SQL Server Setup

q Size and split tempdb and turn off auto-grow

Before creating other databases, create 1 tempdb file for each CPU (or core) and allocate these files across all data LUNs (if you have more than one data LUN). Size tempdb to largest table, plus 10% or better, then turn Autogrow off.

q Turn off features you don’t need

In SQL 2005 you can run the Surface Area Configuration Tool to do this. SQL 2008 has a better policy-based approach, but you can do this manually also: Start SQL Server Management Studio, right-click on server in the object explorer window, then select “Facets”. When the View Facets window appears, select the Surface Area Configuration facet.

q Execute SQL Server’s services with a domain account

Using domain accounts (rather than local accounts) can make inter-server communication much simpler. Using different accounts for different services is even better, as it allows fine-grained control over access and permissions. Importantly, this is the sort of thing that is hard to change once a server is in production. You can configure this at setup or with Configuration Manager.

q Get a CA signed certificate for SQL Server

SQL Server 2005 and 2008 will automatically generate a self-signed certificate for use in securing connection handshakes. This is not an optimal security strategy, since anybody can generate a self-signed certification; and consequently, anybody capable of presenting such a certificate to the client is also capable of decrypting the logon information. A better approach is to obtain a certificate signed by the enterprise Certificate Authority. The CA signature informs the client that the certificate is indeed authentic.

q Change and block default ports (1433, 1434)

These ports are well-known and subject to attack. One way to address this is to use named instances of SQL Server, another way is to select a different default port number.

q Limit data on Primary Filegroup

Isolate a minimum set of data onto Primary filegroup to get database available ASAP on recovery; everything else goes to one or more other filegroups. Upon recovery, the database will be available when the Primary filegroup is loaded, although requests for data in the remaining filegroups will block until they are also loaded.

q Keep first and last partitions empty

When you create partitions, always keep the first and last partition empty. Before you start populating the empty partition (via an INSERT or SWAP), create a new one so that the outer partitions are always empty. Empty partitions are quick and easy for the database engine to split; partitions with data are not.

q Isolate and static data

Put static (or very slowly changing) tables into their own filegroup and mark te filegroup read-only (If the data needs to be updated, you can flip the state back to read-write, change it, then flip it back). This provides two advantages: the filegroup does not need to be included on regular backups and read-only filegroups can be accessed by multiple database servers concurrently.

Post Installation

q Turn off HyperThreading

HyperThreading (which is a feature of some Intel processors) makes a single processor appear to be two processors to the OS, which can make some activities more efficient. Since SQL Server’s CPU scheduler is already very efficient at optimizing processor usage, HyperThreading simply causes the processor to switch between two sets of scheduler tasks. This context switch is not more efficient than allowing the scheduler to simply keep the CPU busy. Generally, turning off HyperThreading is done through the BIOS.

q Ensure multiple HBAs are teamed, not failover

Just because you have two HBA cards does not mean you are getting twice the bandwidth. Check your configuration and ensure that they are teamed and not set to simple failover.

q Enable SAN Write Caching

If you are using a battery-backed SAN controller (which most of them are), ensure that database is not waiting for disk commits, but only cache commits. If possible allocate almost all cache for writes (say, 80% writes, 20% reads).

q NUMA hardware (or Quad-core): Set MaxDOP

Setting the “Maximum Degree of Parallelism” server property ensures that a single task is not spread across multiple NUMA nodes. This is certain to be more efficient than having NUMA node tasks waiting for other nodes. When using nodes with dual-core CPUs, set the MaxDOP to 4 (such a node would have 4 cores); with quad-core CPUs, set MaxDOP to 8. Note that a Quad-core x64 CPU is itself treated as a NUMA node even on “non-NUMA” multi-processor hardware; in this case set, MaxDOP to 4.

q Enable teamed NICs on the client network.

Since network connectivity is often a bottleneck, if your driver supports this functionality, enable teamed NICs on your server. Each should be wired to a different switch.

q Set “Queue Depth” (or “Execution Throttle”) on the HBA to 64 or 128, or so

Ensure that other machines on the same SAN are also adjusted or this machine may dominate the SAN.

q Turn on CheckSums in all databases.

This is the default for new databases, but upgraded databases may not have this feature enabled. This will not cause the engine to rewrite the database, but as each page is eventually rewritten, a checksum value will be added. The checksum is a valuable feature and it worth the minimal overhead.

q Enable Read Committed Snapshots

This allows the engine to support multiple “Read Committed” isolation-level transactions without blocking, even when they affect the same records. If you routinely use the NOLOCK hint, you need to stop; simply enable read committed snapshots on the server instead.

q Give engine service account "Lock Pages in Memory" permission in group policy.

This will allow the engine to use Windows’ Address Windowing Extensions (AWE) to set memory locks. Complete instructions can be found here. (Although AWE is an explicit option on 32-bit systems, on 64-bit systems AWE is always on — as long as this permission is granted).

q Give engine service account "Manage Volume Name" permission in group policy.

Ideally you will not need to grow the database files, but if the need arises, having this permission can dramatically improve performance. Without this permission, the OS must “zero” (wipe clean) the newly allocated space. If the previous data on the disk was extremely sensitive, you might reconsider this option.

q If this is a very large database, consider Trace Flag 1224

When a query acquires a certain number of locks, the database engine “escalates” the lock to the table level. This provides substantial efficiencies and is in most cases the desired behavior. However, very large systems may find the substantial lock contention at the table level as multiple queries routinely trigger this lock escalation. To disable lock escalation, you can use TF 1224. See http://support.microsoft.com/kb/323630/en-us for more information.

q Manually set speed setting for each NIC.

If you have a 1 Gbps or 10 Gbps network, you do not want the NICs negotiating down.

q Enable jumbo frames on the network.

Enabling this will enable more efficient network communication between servers (this assumes that the data center network switches supports jumbo frames — which it almost certainly does). Ideally, jumbo frames should also be enabled on the other servers that communicate with the database. Enabling jumbo frames on the NIC is device specific, but can usually be accomplished by accessing the properties dialog for the network card (usually under “advanced” settings). Once this is done, you must change the Network Packet Size property in SQL Server. The easiest way to change the this property is to use Microsoft SQL Server Management Studio: right-click on the server object in the Object Explorer frame, then select Properties. In the Server Properties dialog, select the “advanced” page and you will see the Network Packet Size option under the Network heading. Change this value from 4096 (i.e. 4k) to 8192 (i.e. 8k).