In order measure the performance of a common web application (a web frontend connected to a database server), to which performance counters we have to pay attention?. If we are trying to diagnose a performance problem, where to start?. What does each performance counter means? When I have to get worried by a specific counter?
Well, there are hundreds of performance counters available, most of them related with others, and could be a little bit tricky to select which counters we want to gather. The performance of any application will be determined by the throughput of following computing elements: CPU, Memory, IO and Network. At the same time, this means that if we have an application not performing well, the bottlenecks will be located in one or more of these elements. So, these are the elements that, in general, we have to monitor in any machine where our application is deployed.
Then, each of the machines involved in the application deployment plays a different role, so different counters must be gathered in each one. Our sample web application is using a web frontend and a database server, so we need to pay attention to specific counters to each of these roles.
Following, I have gathered from different sources a compilation of performance counters and their meanings that can be used as starting point to measure the performance or diagnose an issue in your web application. There are three different blocks: General Counters, Web Server Counters and SQL Server Counters.
Counters to be gathered in all servers.
Processor : % Processor Time
Processor : % Total User Time
The value of this counter helps to determine the kind of processing that is affecting the system. Of course the resulting value is the total amount of non-idle time that was spent on User mode operations. This generally means application code.
System : % Total Privileged Time
This is the amount of time the processor was busy with Kernel mode operations. If the processor is very busy and this mode is high, it is usually an indication of some type of NT service having difficulty, although user mode programs can make calls to the Kernel mode NT components to occasionally cause this type of performance issue.
System : Processor Queue Length
Oddly enough, this processor counter shows up under the System object, but not without good reason. There is only 1 queue for tasks that need to go to the processor, even if there is more than one CPU
The resulting value is a measure of how many threads are in the Ready state waiting to be processed. When dealing with queues, if the value exceeds 2 for a sustained period, you are definitely having a problem with the resource in question.
The number of pages read from or written to disk to resolve hard page faults. This counter serves as a primary indicator of the types of faults that cause system-wide delays.
Although it is normal to have some spikes, this counter generally remains at or close to zero
Memory : Page Faults/sec
This counter gives a general idea of how many times information being requested is not where the application (and VMM) expects it to be. The information must either be retrieved from another location in memory or from the pagefile.
Recall that while a sustained value may indicate trouble here, you should be more concerned with hard page faults that represent actual reads or writes to the disk. Remember that the disk access is much slower than RAM.
Memory : Page Reads/sec
This counter is probably the best indicator of a memory shortage because it indicates how often the system is reading from disk because of hard page faults. The system is always using the pagefile even if there is enough RAM to support all of the applications. Thus, some number of page reads will always be encountered
A sustained value over 5 Page Reads/sec is often a strong indicator of a memory shortage. You must be careful about viewing these counters to understand what they are telling you. This counter again indicates the number of reads from the disk that were done to satisfy page faults. The amount of pages read each time the system went to the disk may indeed vary. This will be a function of the application and the proximity of the data on the hard drive. Irrelevant of these facts, a sustained value of over 5 is still a strong indicator of a memory problem. Remember the importance of "sustained." System operations often fluctuate, sometimes widely. So, just because the system has a Page Reads/sec of 24 for a couple of seconds does not mean you have a memory shortage
Memory : Page Writes/sec
Much like the Page Reads/sec, this counter indicates how many times the disk was written to in an effort to clear unused items out of memory
The numbers of pages per read may change. Increasing values in this counter often indicate a building tension in the battle for memory resources
Memory : Available MB
This counter indicates the amount of memory that is left after nonpaged pool allocations, paged pool allocations, process' working sets, and the file system cache have all taken their piece
PhysicalDisk : Current Disk Queue Length
This counter provides a primary measure of disk congestion. Just as the processor queue was an indication of waiting threads, the disk queue is an indication of the number of transactions that are waiting to be processed.
Recall that the queue is an important measure for services that operate on a transaction basis. Just like the line at the supermarket, the queue will be representative of not only the number of transactions, but also the length and frequency of each transaction
PhysicalDisk : % Disk Time
Much like % Processor time, this counter is a general mark of how busy the disk is. You will see many similarities between the disk and processor since they are both transaction-based services
This counter indicates a disk problem, but must be observed in conjunction with the Current Disk Queue Length counter to be truly informative. Recall also that the disk could be a bottleneck prior to the % Disk Time reaching 100%.
PhysicalDisk : Avg. Disk Queue Length
This counter is actually strongly related to the %Disk Time counter. This counter converts the %Disk Time to a decimal value and displays it. This counter will be needed in times when the disk configuration employs multiple controllers for multiple physical disks. In these cases, the overall performance of the disk I/O system, which consists of two controllers, could exceed that of an individual disk.
If you were looking at the %Disk Time counter, you would only see a value of 100%, which wouldn't represent the total potential of the entire system, but only that it had reached the potential of a single disk on a single controller. The real value may be 120% which the Avg. Disk Queue Length counter would display as 1.2.
The number of requests should not exceed two times the number of spindles constituting the physical disk. If the number of requests is too high, you can add additional disks or replace the existing disks with faster disks.
PhysicalDisk: Disk Transfers/sec
The rates of read and write operations on the disk. Define a counter for each physical disk on the server.
Network Interface\Bytes total/sec
Number of bytes traveling over the network interface per second. This counter only reflects the local network connection.
If this value stays below 50 percent of your available network bandwidth, the network adapter on the server running SQL Server 2000 is not likely to cause any performance bottlenecks.
Counters to be gathered in the servers acting as web frontend in addition to the general counters.
Process(aspnet_wp)\% Processor Time
Process(aspnet_wp)\ Private Bytes
Process(aspnet_wp)\ Virtual Bytes
Process(aspnet_wp)\ Handle Count
.NET CLR Exceptions\# Exceps thrown / sec
The total number of managed exceptions thrown per second
As this number increases, performance degrades. Exceptions should not be thrown as part of normal processing. Note, however, that Response.Redirect, Server.Transfer, and Response.End all cause a ThreadAbortException to be thrown multiple times, and a site that relies heavily upon these methods will incur a performance penalty. If you must use Response.Redirect, call Response.Redirect(url, false), which does not call Response.End, and hence does not throw. The downside is that the user code that follows the call to Response.Redirect(url, false) will execute. It is also possible to use a static HTML page to redirect. Microsoft Knowledge Base Article 312629 provides further detail.
Threshold: 5% of RPS. Values greater than this should be investigated, and a new threshold should be set as necessary
.NET CLR Security(_Global_)\% Time in RT checks
Displays the percentage of elapsed time spent performing runtime code access security checks since the last sample. This counter is updated at the end of a .NET Framework security check. It is not an average; it represents the last observed value
.NET CLR Memory\% Time in GC
The percentage of time spent performing the last garbage collection. An average value of 5% or less would be considered healthy, but spikes larger than this are not uncommon. Note that all threads are suspended during a garbage collection
Threshold: an average of 5% or less; short-lived spikes larger than this are common. Average values greater than this should be investigated. A new threshold should be set as necessary
The number of application restarts. Recreating the application domain and recompiling pages takes time, therefore unforeseen application restarts should be investigated. The application domain is unloaded when one of the following occurs:
Threshold: 0. In a perfect world, the application domain will survive for the life of the process. Excessive values should be investigated, and a new threshold should be set as necessary.
The number of request currently rejected
ASP.NET\Worker Process Restarts
The number of aspnet_wp process restarts.
Threshold: 1. Process restarts are expensive and undesirable. Values are dependent upon the process model configuration settings, as well as unforeseen access violations, memory leaks, and deadlocks. If aspnet_wp restarts, an Application Event Log entry will indicate why. Requests will be lost if an access violation or deadlock occurs. If process model settings are used to preemptively recycle the process, it will be necessary to set an appropriate threshold.
ASP.NET\Request Execution Time
The number of milliseconds taken to execute the last request. In version 1.0 of the Framework, the execution time begins when the worker process receives the request, and stops when the ASP.NET ISAPI sends HSE_REQ_DONE_WITH_SESSION to IIS. For IIS version 5, this includes the time taken to write the response to the client, but for IIS version 6, the response buffers are sent asynchronously, and so the time taken to write the response to the client is not included. Thus on IIS version 5, a client with a slow network connection will increase the value of this counter considerably.
The number of requests currently handled by the ASP.NET ISAPI. This includes those that are queued, executing, or waiting to be written to the client
The number of requests currently queued
Threshold: 0. The value of this counter should be 0. Values greater than this should be investigated
ASP.NET\Request Wait Time
The number of milliseconds that the most recent request spent waiting in the queue, or named pipe, that exists between inetinfo and aspnet_wp (see description of Requests Queued). This does not include any time spent waiting in the application queues.
Threshold: 1000. The average request should spend 0 milliseconds waiting in the queue.
ASP.NET Applications(__Total__)\Requests Total
The number of requests since the application was started
The number of requests executed per second. I prefer "Web Service\ISAPI Extension Requests/sec" because it is not affected by application restarts
ASP.NET Applications(__Total__)\Errors Total
The sum of Errors During Preprocessing, Errors During Compilation, and Errors During Execution
ASP.NET Applications(__Total__)\Errors Total/Sec
The total of Errors During Preprocessing, Errors During Compilation, and Errors During Execution per second.
ASP.NET Applications(__Total__)\Cache API Entries
The number of entries currently in the user cache.
ASP.NET Applications(__Total__)\Cache API Hit Ratio
The total hit-to-miss ratio of User Cache requests.
ASP.NET Applications(__Total__)\Cache API Turnover Rate
The number of additions and removals to the user cache per second. A high turnover rate indicates that items are being quickly added and removed, which can be expensive.
ASP.NET Applications(__Total__)\Cache Total Entries
The current number of entries in the cache (both User and Internal). Internally, ASP.NET uses the cache to store objects that are expensive to create, including configuration objects, preserved assembly entries, paths mapped by the MapPath method, and in-process session state objects.
Note The "Cache Total" family of performance counters is useful for diagnosing issues with in-process session state. Storing too many objects in the cache is often the cause of memory leaks
ASP.NET Applications(__Total__)\Cache Total Hit Ratio
The number of additions and removals to the cache per second (both user and internal). A high turnover rate indicates that items are being quickly added and removed, which can be expensive.
ASP.NET Applications(__Total__)\Cache Total Turnover Rate
Web Service(_Total)\Current Connections
A threshold for this counter is dependent upon many variables, such as the type of requests (ISAPI, CGI, static HTML, and so on), CPU utilization, and so on. A threshold should be developed through experience.
Web Service(_Total)\ISAPI Extension Requests/sec
Used primarily as a metric for diagnosing performance issues. It can be interesting to compare this with "ASP.NET Applications\Requests/sec" and "Web Service\Total Method Requests/sec." Note that this includes requests to all ISAPI extensions, not just aspnet_isapi.dll
Counters to be gathered in the servers hosting the SQL Server service in addition to the general counters.
SQLServer:Memory Manager\Total Server Memory
The total memory in use by SQL
Add memory to the server if this value is generally higher than the amount of physical memory in the server.
SQLServer:Access Methods\Full Scans/sec
The number of unrestricted full scans. These can either be base table or full index scans.
SQLServer:Buffer Manager\Buffer Cache Hit Ratio
The percentage of pages that were found in the buffer pool without having to incur a read from disk.
When this percentage is high, your server is operating at optimal disk I/O efficiency. If this value decreases over time, you might consider adding physical memory to your server.
The total number of log growths for the selected database.
Run against your application database instance
SQLServer:Databases Application Database\Percent Log Used
The percent of space in the log that is in use.
SQLServer:Databases Application Database\Transactions/sec
The number of transactions started for the database.
Run against your application database instance.
SQLServer:General Statistics\User Connections
The number of users connected to the system.
Research any dramatic shifts in this value
SQLServer:Latches\Average Latch Wait Time
The average latch wait time, in milliseconds, for latch requests that had to wait.
If this number is high, your server might have resource limitations.
SQLServer:Locks\Average Wait Time
The average amount of wait time, in milliseconds, for each lock request that resulted in a wait.
The number of lock requests that could not be satisfied immediately and required the caller to wait before the lock was granted.
SQLServer:Locks\Number of Deadlocks/sec
The number of lock requests that resulted in a deadlock.
SQLServer:Memory Manager\Memory Grants Pending
The current number of processes waiting for a workspace memory grant.
SQLServer: SQL Statistics: Batch Requests/Sec
This counter measures the number of batch requests that SQL Server receives per second, and generally follows in step to how busy your server’s CPUs are
Generally speaking, over 1000 batch requests per second indicates a very busy SQL Server, and could mean that if you are not already experiencing a CPU bottleneck, that you may very well soon. Of course, this is a relative number, and the bigger your hardware, the more batch requests per second SQL Server can handle.
SQLServer: SQL Statistics: SQL Compilations/Sec
How many compilations are performed by SQL Server per second
If you find that your server is performing over 100 compilations per second, you should take the time to investigate if the cause of this is something that you can control. Too many compilations will hurt your SQL Server’s performance
SQLServer: SQL Statistics: SQL Re-Compilations/Sec
Number of statement recompiles per second. Counts the number of times statement recompiles are triggered.
Generally, you want the recompiles to be low
Good post, Raul. But you've made the assumption that the hardware is what is causing the bottleneck rather than the application itself. If it was the application, how would one monitor this instead and work out where the root cause is?
I'd be willing to bet that more often than not it is the code that could be improved to realise performance gains.
Saludos, by the way!
Thanks for your comment Daniel. I'm wasn't trying to make any assumption about where is the issue itself, the most probably as you mention is that the application can be optimized... or maybe you can use the perf counters just to see how your application scale up.
What I was trying to say is that the bottleneck always is located in one of those HW elements, and is because of that why you need to monitor them. But I'm not assuming that the root cause is in the HW itself... after you find a bottleneck, then you need to determine the root cause (by understanding why): let say that you find that during your perf tests your DB server reach 100% of CPU. Of course, you need to ask yourself why, then, you need to determine which DB queries cause the high CPU consume, then you need to see where in the application are executed those queries, then you will be in the position to determine if your application is behaving fine or not.