My name is HarshDeep Singh, and I am a SQL Server engineer with Microsoft CSS. I've been working on SQL Server for quite some time, and the intention behind starting this blog is to try and share my knowledge with the SQL Server community. Any feedback or suggestions on the blog are both welcome and solicited. Please feel free to drop me a line on firstname.lastname@example.org anytime.
One of the most common reason for server performance issues with respect to SQL Server is the presence of an I/O bottleneck on the system. When I say I/O bottleneck, it can mean issues like slow disks, other processes hogging I/O, out-dated drivers, etc. In this blog, I will seek to outline the approach for identifying and troubleshooting I/O bottlenecks on SQL Server.
The following are the most common symptoms of an I/O bottleneck on the SQL Server machine:
If you see the symptoms outlined above quite frequently on your SQL Server installation, then it will be safe to draw the conclusion that your instance is suffering from a disk subsystem or I/O bottleneck. Let's look at the data collection and troubleshooting approach pertaining to the same:
I normally recommend my clients to run the perfmon collector set for at least one business day, so that it has captured data for the load exerted by at least one standard business cycle.
Now that we have the data, we can start the analysis. After stopping the collector set, you can open the blg file generated (the path is displayed under the output column, on the right hand side in perfmon) using perfmon (a simple double click works, as the file type is associated with perfmon by default). Once open, is should have automatically loaded all the counters. Analysing with all the counters can be a bit cumbersome, so I would suggest that you first delete all the counters and then add specific counters one by one.
I will list out the important counters here, along with their expected values:
It's important to keep 2 things in mind. One, make sure your data capture is not skewed or biased in any way (for example, do not run a capture at the time of a monthly data load or something). Second, make sure you correlate the numbers reflected across the various counters to arrive at the overall picture of how your disks are doing.
Most of the time, I see that people are surprised when they are told that there are I/O issues on the system. Their typical response is "But, it's been working just fine for x years, how can it create a bottleneck now?". The answer lies within the question itself. When the server was initially configured, the disk resources were sufficient for the load on the server. However, with time, it's inevitable that the business grows as a whole, and so do the number of transactions, as well as the overall load. As a result, there comes a day when the load breaches that threshold, and the disk resources on the server are no longer sufficient to handle it. If you come to office one fine day, see high latency on the disks during normal working hours, and are sure that
Then it's highly likely that the load on your server has breached this threshold, and you should think about asking your disk vendor(s) for a disk upgrade (after having them check the existing system once for latency and throughput, of course). Another potential root cause that can cause high latency is that your disk drivers and/or BIOS are out of date. I would strongly recommend checking periodically for updates to all the drivers on the machine, as well as the BIOS.
Hope this helps. As always, comments, feedbacks and suggestions are welcome.