Cindy Gross: Small Bites of Big Data, Small Data, All Data

Small Bites of Big Data, Small Data, All Data for Hadoop, SQL Server, Hive, Distributed Systems, Scale Out....

General Hardware/OS/Network Guidelines for a SQL Box

General Hardware/OS/Network Guidelines for a SQL Box

  • Comments 2

I have put together some general guidelines for how you want a server to be delivered to the DBA team for a new SQL Server install. You won't necessarily use all of them, but consider it a starting point for your SQL Server install standards. Places where it may be common to change the statements are in [red]. Always run the SQL Server Best Practices Analyzer or an equivalent tool to check for the latest recommendations before releasing the system to production. I'm sure some of you will disagree with some of the points for various reasons, but I've found them to be a good baseline for discussion and a great starting point for standards documents. I'm ok with that because I am very fond of saying "it depends". :-)

The below is specific to SQL Server 2008/200R2 on Windows 2008/2008R2.

OS Specifications (things often controlled by a Windows team)

 

·         Power saving features: For a SQL Server box if you want consistent, predictable, high performance you either need to fine tune the power setting parameters for each individual workload and/or for different times of day or just set the power options to high performance. Databases are harder to fit into the normal power saving profile so they don’t fit as well into the default power saving settings. [If your environment requires that you favor power savings over performance change this statement and be aware of the impact.]

·         You should double check that your virus scanner is certified for Windows 2008 R2. Older scanners use TDI and you need WFP models to work properly on the newer OSs. The older type of anti-virus scanners can cause serious IO problems.
981889 A Windows Filtering Platform (WFP) driver hotfix rollup package is available for Windows Vista, Windows Server 2008, Windows 7, and Windows Server 2008 R2
http://support.microsoft.com/default.aspx?scid=kb;en-US;981889  
979278 Using two Windows Filtering Platform (WFP) drivers causes a computer to crash when the computer is running Windows Vista, Windows 7, or Windows Server 2008
http://support.microsoft.com/default.aspx?scid=kb;EN-US;979278  
979223 A nonpaged pool memory leak occurs when you use a WFP callout driver in Windows Vista, Windows 7, Windows Server 2008, or in Windows Server 2008 R2
http://support.microsoft.com/default.aspx?scid=kb;EN-US;979223  
976759 WFP drivers may cause a failure to disconnect the RDP connection to a multiprocessor computer that is running Windows Vista, Windows Server 2008, windows 7 or Windows Server 2008 R2
http://support.microsoft.com/default.aspx?scid=kb;EN-US;976759  
Windows Filtering Platform
http://www.microsoft.com/whdc/device/network/WFP.mspx

·         Virus scanners and spyware detection should not scan SQL Server data and log files (usually mdf/ldf/ndf extensions) and other SQL related files because the scanning significantly degrades performance. [Note that this is a tradeoff with security and you must decide on performance vs. security based on your own security guidelines.]
REASON: Performance, smoother setup. See 309422 Guidelines for choosing antivirus software to run on the computers that are running SQL Server
http://support.microsoft.com/default.aspx?scid=kb;EN-US;309422

·         Firmware, BIOS, network adapter drivers, storport drivers, etc. will be at their most recent, stable versions before the server is released to the DBAs.
REASON: There are common SQL Server performance, usability, and supportability problems caused by older firmware, BIOS, network adapter drivers, etc.

·         For Windows 2008 and Windows 2008 R2 you can download a Windows storport enhancement (packaged as a hotfix). This enhancement can lead to faster root cause analysis for slow IO issues. Once you apply this Windows hotfix you can use Event Tracing for Windows (ETW) via perfmon or xperf to capture more detailed IO information that you can share with your storage team.  http://blogs.msdn.com/b/cindygross/archive/2010/07/13/windows-storport-enhancement-to-help-troubleshoot-io-issues.aspx

·         Do not install SQL Server on a domain controller (DC).
REASON:  A busy DC can take resources away from SQL Server. There are also negative security implications from installing SQL Server on a DC.

·         Grant SE_MANAGE_VOLUME_NAME to the SQL Server group to allow instant file initialization of data (but not log) files. There is a small security risk associated with this but it can greatly improve the performance of CREATE/ALTER data (but not log) files. [Decide as a company whether this performance enhancement is worth the small risk] http://blogs.msdn.com/b/sql_pfe_blog/archive/2009/12/23/how-and-why-to-enable-instant-file-initialization.aspx

·         Critical updates for Windows will be tested and applied ASAP after their release.
REASON: Security that affects Windows often affects SQL Server as well.

·         Resource intensive screensavers will be disabled and replaced with low resource consumption security to lock the consoles.
REASON: Performance – Resource intensive screen savers can steal resources from SQL Server.

·         Files will be secured: All copies of the data and log files as well as all copies of the backup files will be secured with access given only to those documented in the SQL Server Disaster Recovery plan.
REASON: Data and log files can be copied and attached to another instance of SQL Server, thereby exposing the information to the sysadmins of the new instance. Therefore access to these files must be very limited. However enough access must be granted to allow for recovery.

·         EFS: SQL Server will not be installed on disk that is encrypted with EFS.
REASON: 922121 You may experience decreased performance in some features of SQL Server 2005 when you use EFS to encrypt database files
http://support.microsoft.com/default.aspx?scid=kb;EN-US;922121

Storage Specifications (often configured by a Windows and/or storage team)

 

·         Battery backup must be enabled for all controllers or storage media which do write caching.
REASON: This is required by the WAL protocol to ensure stable media for SQL Server. See
http://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/sqliobasics.mspx#EYCAE

·         For SQL Server disks, performance is more important than conserving space. This means there may be what would be considered “wasted space” on a file server and that the overall cost per stored MB will be higher for a database system than for a file server. [This is a general guideline, if your environment prefers costs savings and space usage maximization over performance change this statement.]
REASON: High performance is generally a major requirement of a database system, and is much more important than on most file systems. Higher performance requires that disk be laid out, configured, and managed in particular ways.

  • Disk alignment must be done to a multiple of 64KB. Some vendors may express a preference for a particular value, but most mainstream hardware vendors have agreed that 1024KB is acceptable. That is the default for Windows 2008+. If you use dynamic disks it is difficult to see the alignment from Windows.
    REASON: If the disk is not aligned, performance can suffer as much as 30-40% because some read/write activity may be to/from two blocks instead of one. See 929491 Disk performance may be slower than expected when you use multiple disks in Windows Server 2003, in Windows XP, and in Windows 2000
    http://support.microsoft.com/default.aspx?scid=kb;EN-US;929491 and Disk Partition Alignment Best Practices for SQL Server http://msdn.microsoft.com/en-us/library/dd758814.aspx
  • Disk allocation unit should be 64KB for SQL Server boxes.
    REASON: See
    Predeployment I/O Best Practices http://www.microsoft.com/technet/prodtechnol/sql/bestpractice/pdpliobp.mspx

NTFS Allocation Unit Size

When formatting the partition that will be used for SQL Server data files, it is recommended that you use a 64-KB allocation unit size for data, logs, and tempdb. Be aware however, that using allocation unit sizes greater than 4 KB results in the inability to use NTFS compression on the volume. SQL Server, although it is not recommended that you use this, does support read-only data on compressed volumes.

·         Drive Compression: Drives will not be compressed.
REASON: Compression has a big negative performance impact on SQL Server.

·         NTFS file system will be used instead of FAT or Raw partitions.
REASON: NTFS allows features such as database snapshots, online DBCC checks, instant file initialization, mount points, and additional security. It has larger file size limits (16 exabytes) than FAT (4 GBs). Raw partitions limit your recoverability options.

·         Often you will need one or more of these to achieve optimal performance for a database [Decide which of these you will deploy for each tier of storage and whether each can be requested by a DBA at server configuration time.]

1.       HBA queue depth for SQL Server is often best at 64 or 128; testing will determine the optimal value.
REASON: See Predeployment I/O Best Practices http://www.microsoft.com/technet/prodtechnol/sql/bestpractice/pdpliobp.mspx
HBA Queue Depth Settings

When configuring HBAs on the host, ensure that the Queue Depth is set to an optimal value. SQL Server applications are generally I/O-intensive, with many concurrent outstanding I/O requests. As a result, the default values for Queue Depth on HBAs are usually not high enough to support optimal performance. Currently the default value for the major HBA vendors is in the range of 8 to 32.

In our SQL Server testing, we have seen substantial gains in I/O performance when increasing this to 64 or even higher. It is worth noting that in these tests SQL Server was usually the only application using the storage array. It is important to discuss with your storage administrator the appropriate values for this setting, as the setting may affect other applications in sharing the same storage environment. When Queue Depth is set too low, a common symptom is increasing latency and less-than-expected throughput given the bandwidth between host/storage and the number of spindles in a particular configuration.

2.       RAID 10 or its equivalent will be used for the highest performance and best recoverability. Read-only data (no updates from users, replication, batch jobs, or anything else) can see acceptable performance on RAID 5. RAID 5 systems will have slower write performance and less recoverability but might be allowed for lower tiered systems with a signoff that high performance is not guaranteed.
REASON: RAID 10 is the fastest disk for SQL Server data and logs. It also provides the best recoverability options.

o   See Physical Database Storage Design http://www.microsoft.com/technet/prodtechnol/sql/2005/physdbstor.mspx
“For excellent performance and high reliability of both read and write data patterns, use RAID10.”

o   “RAID10 (stripe of mirrors): RAID10 is essentially many sets of RAID1 or mirrored drives in a RAID0 configuration. This configuration combines the best attributes of striping and mirroring: high performance and good fault tolerance. For these reasons, we recommend using this RAID level. However, the high performance and reliability level is the trade-off for storage capacity.”

o   RAID 10 is recommended for “Data requiring high performance for both read and write and excellent reliability while trading off storage efficiency and cost.“

3.       Follow hardware vendor recommendations for configuring the storage for a database, often this is very different than configuring for other non-database systems.

4.       Keep the physical disks no more than about 80% full (avoid full stroking, get closer to short stroking). Some SAN configurations may make this difficult to determine from Windows with concepts such as thin provisioning.

5.       Use multiple HBA controllers with a high throughput capacity. The same applies for other components such as switch ports, NICs, Fibre Channel array ports, storage array service processors, etc.

6.       Favor (battery backed) write cache over read cache for an OLTP system. Often 80/20 or 90/10 in favor of writes is beneficial. It is relatively easy for a busy SQL Server to flood the cache.

1.       Log writes have the lowest allowable latency of any SQL activity on an OLTP system.

2.       Write cache can help absorb checkpoint bursts as they write data to the disks.

3.       Maintenance operations can be write intensive and long running.

4.       SQL Server’s internal data organization rarely matches the physical layout of the data on disk so IO subsystem level read ahead through the read cache is rarely effective for a database.

7.       Performance will be more predictable, IO troubleshooting will be easier, and in many cases overall performance can be higher if SQL Server is isolated to an IO path not shared with other systems. If it does share with others (which is very common), it is better to share with other databases than with file servers or other systems that have different needs.

·         Data and log files will not go on the same drive (they can have the same drive letter if on different mount points). SANs often hide the physical layer behind the drive letter/mount point by mixing data on the back end but it is still important to keep them separate in case you later move to totally separate IO paths for each.

o   Where cost/benefit analysis allows, each database’s transaction log file(s) will get a separate drive.  For maximum performance this would be a truly separate IO path.
REASON: Since writes to the transaction log are sequential (even if there are multiple log files for a given database, only one is written to at a time), if there are multiple database’s log files, or a mix of data and log files, on one drive the read head is moving between them and decreasing performance.

o   Where cost/benefit analysis allows it, multiple, separate IO paths may be requested for each data file.
REASON: The more drives you can use for data files, the more the IO can be spread out to increase performance. For certain types of activities guaranteeing that the IO does not overlap for two different data files can be advantageous.

·         Before putting each SQL box into production, run SQLIO under various loads, gather data to see if the IO subsystem is likely to be capable of the expected load. If you have numbers for a typical and current peak load, we can use those numbers and then add some to it (perhaps test things like 150% of expected peak batch cycle).

·         Before putting each SQL box into production, run SQLIOSim to test for data correctness.

·         Disk IO Subsystem Performance Requirements:

o   Log writes will not exceed [put your values here] ms per write under the expected peak load (typical is 1-5ms for log writes)

o   Non-log writes will not exceed [put your values here] ms per read under the expected peak load (typical is 1-20ms for OLTP and 25-30ms for warehouses)

o   Reads will not exceed [put your values here] ms per read under the expected peak load (typical is 5-20ms for OLTP and 25-30ms for warehouses)

o   These are general standards; some systems may require faster IO subsystems and some may be ok with slower IO subsystems.

o   Measures of read and write speed are from the Performance Monitor counters reads/sec and writes/sec and/or from SQLIO as these are the speeds seen by SQL Server.
REASON: Databases require fast disk access in order to deliver high performance. Many DBA and application developer hours can be spent troubleshooting performance problems that are later tracked to slow IO.

 

Network Specifications

 

·         SNP/TCP Chimney settings will depend on whether your NIC vendor supports it. If you have NICs/drivers that support it, turn it on. Otherwise disable it to avoid known problems with SQL Server and other products.
REASON: Performance and usability. When TCP Chimney is enabled it on a NIC that doesn’t support it, you will often see failed connectivity to SQL Server and/or
dropped packets and connections that affect SQL server. See Information about the TCP Chimney Offload, Receive Side Scaling, and Network Direct Memory Access features in Windows Server 2008 http://support.microsoft.com/kb/951037  and 942861 Error message when an application connects to SQL Server on a server that is running Windows Server 2003: "General Network error," "Communication link failure," or "A transport-level error" http://support.microsoft.com/default.aspx?scid=kb;EN-US;942861

·         Put a firewall between your server and the internet. Block TCP port 1433, UDP port 1434, and static ports used by named instances on the perimeter firewalls but not necessarily on individual internal servers. Be careful which ports you block in the other firewalls, SQL Server will use various ports to communicate on.
REASON: Security – Hackers may try to use well known ports to attack SQL Server.

·         Open ports used by SQL Server in the individual server firewalls. [If you define a specific port range for all instances include it here.]
REASON: The DBAs have defined this range of ports as what each SQL Server uses for incoming connections.

Windows Cluster

 

·         Identical hardware: While Windows 2008 clusters are not required to be identical, to have a greater chance of predictability no matter which node owns each SQL Server group it is recommended that they be configured as close to the same as possible.

·         Windows policies and rights: Windows policies and rights should be the same on all nodes.
REASON: The behavior of SQL Server must be the same on all nodes. Policies can change SQL Server behavior.

·         Mount points: Do not install SQL Server 2000 on any Windows cluster with mount points.  The mount points must have an associated drive letter and must be cluster resources in the group where SQL Server will reside. SQL Server must “depend on” all mount points that it uses.
REASON: SQL Server 2005+ supports mount points but SQL Server 2000 setup, including service packs and hotfixes, will fail when it tries to enumerate the mount points, even if they are not in the SQL Server 2000 group. For 2005+ instances, the mount points must be in the SQL Server group in order for SQL Server to access them.

·         The Cluster service account must be a login in the SQL Server instance and a simple user in the master database, but should NOT be a sysadmin role member.
REASON: Avoid elevated privileges.

·         MS DTC Choose a consistent standard for how you configure DTC. On Windows 2008+ clusters you can have more than one DTC per cluster and there are pros/cons to various configuraitons on how SQL Server uses one or more of those DTCs. http://blogs.msdn.com/cindygross/archive/2009/02/22/how-to-configure-dtc-for-sql-server-in-a-windows-2008-cluster.aspx DTC must be enabled for network access. If you choose to put DTC in a group with SQL Server and choose to have a DTC failure cause the group to fail, you should be aware that DTC can cause a SQL Server failover. This may be appropriate in some environments.

·         Cluster Group Names will not contain any special characters such as <, >, ‘, “, &
REASON: Special characters in any group name may cause SQL Server setup to fail.

·         NIC names will not have any special characters or trailing spaces.
REASON: Special characters in any network name may cause SQL Server setup to fail.

·         Auto start must NOT be on for any clustered resource in a Windows cluster.
REASON: The cluster administrator needs to bring the resources online (start the services). If Windows startup has already started the service the cluster service cannot bring it online which results in errors.

·         Use SIDs in a Windows cluster. If you choose not to use Service SIDs, create unique domain groups: Each individual service installed with SQL Server needs a unique domain group created for it unless you choose the default of SIDs. The following naming standard will be used: [put your company standard here:  Examples: myserver1_instance1_SQLServer, myserver1_instance1_SQLAgent  ]     

http://technet.microsoft.com/en-us/library/ms143504.aspx

Best practices that you can use to set up domain groups and solutions to problems that may occur when you set up a domain group when you install a SQL Server 2005 failover cluster http://support.microsoft.com/default.aspx?scid=kb;EN-US;915846
REASON: On a cluster domain groups are used to manage permissions for the SQL Server components. Each service needs a unique group to reduce the attack surface.

Remote Control

·         Remote control to the server will only be done when absolutely necessary. No SQL Server tools will be opened on the production server itself unless there is no other way to access the server.
REASON: It adds overhead to the server and can cause performance problems. Most access will be done from client tools installed on desktops. The overhead of the GUI interfaces is not acceptable on a production server. Some SQL Server client tools are not available on 64-bit systems and all servers will be x64 going forward.

Hardware Specifications

 

·         Expected Life: Servers are spec’d with an expectation of being in service for up to [put your policy here, 3-4 is common] years unless otherwise stated.
REASON: We need to know in advance how long the hardware is expected to stay in service so we can predict the resources needed for that time period. An uncertainty factor will be considered as well, so systems with high uncertainty in the predictions may need more expansion opportunity (have the ability to add more hardware resources as needed).

·         Architecture: All new servers will be based on x64 hardware with an x64 version/edition of Windows.
REASON: 64-bit systems allow much more memory to be used than 32-bit systems.

 

 Cindy Gross, Microsoft Dedicated Support Engineer for SQL Server and Microsoft Certified Master : SQL Server 2008

  • Very good, comphensive list! Thanks.

  • Very usefull! Thanks a lot!

Page 1 of 1 (2 items)
Leave a Comment
  • Please add 3 and 7 and type the answer here:
  • Post