Most of what happens here at Testing Services and Labs is determining the performance limit of a customer’s application. To accomplish this, we have to stage the application on systems that we are certain can handle it. After all, we want to know the limit of the application, not the system it’s running on. Pushing the limits typically means testing at volumes at which we can’t be sure what’s going to happen or why it’s happening. As the engineers running the test lab, we are often engaged when a resource constraint is suspected. The number one culprit is typically the disk. This post covers the method that we use to determine if the disk is slow. We will discuss designing a disk for high performance purposes in a later post.
When we are asked if a disk is slow, there are a few steps we take to investigate.
Is the disk really slow?
To find the answer to this question we must measure latency. This is the amount of time that a system waits for a disk I/O request to be satisfied. Specifically, we measure two counters on the physical disk objects:
The Windows Core Team gets into greater depth regarding these measures here:
Once we capture this data, we can compare the actual latency that the operating system is experiencing against the application’s acceptable latency. Yes, applications do accept some latency. You can find those acceptable latencies by simply searching, “Avg. Disk Sec” and “your product.”
For example, the acceptable latency for SQL Server 2012 is a sustained Avg. Disk Sec/Read of 20 ms or less. Anything beyond that would indicate a disk not suitable for SQL Server 2012 usage. For more on this, the TechNet Wiki has some good information:
Okay, the disk really is slow. Why?
To answer this question conclusively, we need to understand the underlying physical components making up the disk. It can be as simple as a controller and a physical disk. However, in enterprise scenarios, it is most frequently a storage area network (SAN) providing a logical disk to the operating system. The following components may need to be measured within the storage area network:
*Diagnosing direct attached or iSCSI storage is not covered in this post.
The majority of the time, the resource contention resides in the storage array itself, so, by practice, we look there first. This is a vendor specific task, but, the goal is to confirm, using the vendor’s tools and measures, that the latency that Windows sees is also being seen on the storage array. There is great flexibility in configuring storage arrays, allowing for varying levels of performance and reliability.
If the latency is not evident in the storage array, then the next logical suspect is the host bus adapter. Essentially, we would determine that the bandwidth utilization does not exceed the device threshold. Typically, data travels through multiple paths via multiple host bus adapters from the host system to the storage array. So, we would verify that the relevant adapters are not exceeding their throughput capacity using vender specific tools.
Lastly, if it’s not the array, and it’s not the adapter, then it must be the network. We utilize vendor specific tools and techniques to confirm throughput of relevant host adapter ports, trunked ports between switches, and storage array ports.
Alright, I know why my disk is slow. What can I do to fix it?
Look for our forthcoming post about designing storage for performance.