This SQL meditation provides a general approach on how to troubleshoot SQL Server I/O related issues. My belief is that if you truly understand this, you are empowered to solve the disk I/O challenges yourself. I offer to teach you how to fish. What kind of fish you catch, and where you catch it and how you catch it is your path.

 

Define slow disk performance:

First, let's define how we measure slow disk performance. The best metric we use at Microsoft and specifically within the SQL Server support groups is disk latency. In other words, how fast is the I/O subsystem servicing each I/O request on the average. The specific Perfmon counters that measure latency are Avg Disk sec/ Read, Avg. Disk sec/Write and Avg. Disk sec/Transfer (cumulative of both reads and writes).

 

The Methodology:

 

 

 

Methodology Summary:

 

All that you really need is Performance Monitor to isolate the problem

Physical Disk and/or Logical Disk Object:

Avg Disk Sec/Transfer

Avg Disk Sec/Read (optional)

Avg Disk Sec/Write (optional)

Disk Bytes/Sec

Disk Read Bytes/Sec (optional)

Disk Write Bytes/Sec (optional)

SQL Server Buffer Manager Object:

Page Reads/Sec

Page Writes/Sec

Process Object:

IO Data Bytes/Sec (all instances)

 

1. Determine whether there is an I/O problem first, by examining the latency counter - Avg Disk Sec/Transfer. If, the values of this counter are consistently above 15-20 ms. then you need to take a look at the issue further. (Occasional spikes don't count in most cases, but be sure to double-check the duration of a spike - if it lasted 1 minute or more, then it is more of a plateau than a spike. )

2. If I/O subsystem is slow, then find out if that is caused by the system being pushed beyond capacity by looking at I/O volumes - Disk Bytes/Sec or Disk Read Bytes/Sec, Disk Write Bytes/Sec. Be sure to check with your System Admin or hardware vendor on what the expected throughput is for your SAN (or I/O subsystem). For example you can only push no more than 200 MB/sec of I/O through a 2 Gb/sec HBA card or 2 Gb/sec dedicated port on a SAN switch.  

3. If I/O subsystem is overwhelmed beyond capacity, then find out if SQL Server is the culprit (it commonly is if you are at this stage) by looking at Buffer Manager: Page Reads/Sec (most common culprit) and Page Writes/Sec (a lot less common). This would then mean that you need to work with the Application Development teams (or application vendor) to tune queries - better indexes, update statistics, re-write queries, etc.

 

Further Comments:

In general there could be two high-level reasons why disk I/O subsystem are slow to respond in relation to SQL Server performance (Avg Disk sec /Transfer is high):

  1. Hardware issues: There is a misconfiguration, exceeded capacity, driver/firmware bug, etc with the disk I/O subsystem and that is where the hardware vendor need to be engaged

  2. Query Issues: SQL Server (or some other process in some cases) on the system is saturating the disks with I/O requests and that is why transfer rates are high. In this case we likely need to find queries that are driving tons of Reads (or writes) and tune them to minimize the disk I/O.

 

Therefore before jumping to conclusions about a faulty/misconfigured SANs, HBAs, ports, storage arrays, etc, I look at a few Perfmon counters to isolate the problem area. I look at Disk Bytes /sec (Physical or Logical Disk) and then I compare that to Process:IO Data Bytes/Sec - for SQLServr.exe instance and/or better yet to Buffer Manager: Page Read/sec + Page Writes/sec. That way you can see how much of the total bytes/sec on the system are driven by SQL Server process. If you find that SQL Server is driving majority of the I/O (bytes/sec values for both counters are pretty close), then you look at how high Bytes/sec is. If it is say 30 MB/sec and the transfer rates are high (latency is high), then we clearly have a problem in 1 because any enterprise I/O sub-system configured for SQL Server should be able to handle such volume (hardware problem). But if the bytes/sec is say 500 MB/sec, then we clearly have 2. problem - SQL is saturating the drives. I say this because from my rough understanding the best SANs out there can realistically provide around 150 - 300 MB/sec throughput. (these capacity characteristics are changing constantly as hardware advances) Thus, if SQL is asking for 500 MB/sec, which is beyond the capacity of most systems, then it is time to tune queries. To confirm SQL Server is doing a lot of reads/writes, look at SQLServer:Buffer Manager object and then Page Reads/Sec and Page Writes/sec counters. If you see those values are more than say 20,000 consistently (20,000 * 8 KB page size = 150 MB/sec roughly), then clearly queries are driving the high volumes of data and pushing an I/O system to or beyond capacity. 

Keep in mind that occasionally you may encounter a scenario where someone using a I/O subsystem at a very low capacity (say a single or a couple of drives). In those cases the expected throughput will be much lower.

 

Namaste!

Joseph