How do you measure CPU pressure?

How do you measure CPU pressure?

Rate This
  • Comments 14

It is important to understand whether CPU pressure is affecting SQL Server performance.  This is true even in the case where SQL Server is the only application running on a particular box.  The System object Perfmon counter Processor Queue length is not necessarily an effective way of measuring CPU pressure in SQL Server.  To see why this is the case, we first must take a brief (and simplified) look at the SQL Server Execution Model.

 

SQL Server uses a User Mode Scheduler (UMS) to control the execution of SQL Server user requests (SPIDs or session_ids).  The UMS does not replace the Windows scheduler but rather, manages the execution of SQL Server requests (without returning control to Windows).  So when SQL Server gets its time slice from the Windows scheduler, the SQL Server UMS manages what user requests are run during this time.   In a 4-proc scenario, there will be 4 User Mode Schedulers, one for each CPU.  Each UMS uses a number of constructs (queues, lists and worker threads) to govern execution.  At any given time, each UMS will have at most a single running user, a runnable queue of requests that are waiting for CPU, a waiter list (for resources such as IO, locks, memory), and a work queue (user requests that are waiting for worker threads).

 

The runnable queue can be likened to a grocery analogy where there are multiple check out lines.  The register clerk is the CPU.  There is just one customer checking out e.g. “running” at any given register.  The time spent in the checkout line represents CPU pressure.  The longer the line, the longer the waits, hence more CPU pressure.

 

OK, back to SQL Server.  Assume the following for a single UMS:  SPID 51 is currently running.  The Runnable Queue consists of SPIDs 60, 55, 87 & 79.  The Waiter list includes SPIDS 55, 84 & 72.  The Work queue is empty.  Now, assume the running SPID 51 needs physical IO.  Two things will happen.  SPID 51 is moved to the wait list until the IO is completed and the next session_id in the runnable queue, SPID 60, begins to run.  When the IO is complete, SPID 51 is moved to the bottom of the runnable queue which now consists of SPIDS 55, 87, 79 & 51.

 

Given this scenario, the total amount of time waiting consists of resource and signal waits.  The time waiting for a resource is shown as Resource Waits.  The time waiting in the runnable queue for CPU is called Signal Waits.  In SQL Server 2005, waits are shown in the Dynamic Management View (DMV) sys.dm_os_wait_stats.  The query to measure cpu pressure is as follows:

 

---- Total waits are wait_time_ms

Select signal_wait_time_ms=sum(signal_wait_time_ms)

          ,'%signal (cpu) waits' = cast(100.0 * sum(signal_wait_time_ms) / sum (wait_time_ms) as numeric(20,2))

          ,resource_wait_time_ms=sum(wait_time_ms - signal_wait_time_ms)

          ,'%resource waits'= cast(100.0 * sum(wait_time_ms - signal_wait_time_ms) / sum (wait_time_ms) as numeric(20,2))

From sys.dm_os_wait_stats

You can initialize or clear out SQL Server 2005 waitstats with the statement dbcc sqlperf ([sys.dm_os_wait_stats],clear) with no_infomsgs.  In SQL Server 2000, waits can be seen by looking at DBCC SQLPERF (WAITSTATS). 

 

If you have SQL–dedicated box it is conceivable that you could have high signal waits indicating CPU pressure and still have a low processor queue length.   In such cases, a faster CPU could reduce signal waits but just focusing on the low processor queue length as a measure of CPU contention may lead to the erroneous conclusion that CPU is fine.

 

In conclusion, if Signal Waits are a significant percentage of total waits, you have CPU pressure which may be alleviated by faster or more CPUs.  Alternately, CPU pressure can be reduced by eliminating unnecessary sorts (indexes can avoid sorts in order & group by’s) and joins, and compilations (and re-compilations).  If Signal Waits are not significant, a faster CPU will not appreciably improve performance.

 

Tom Davidson

Leave a Comment
  • Please add 7 and 6 and type the answer here:
  • Post
  • Tom,

    Very interesting and timely in our case! We are right now looking at CPU utilization in our organization and this is sure helpful. One question though, I know it’s hard to give specifics, but in orders of magnitude how much is a significant percentage? Is 80% too high, 50%...more than 30%?

    Thanks for the insight!

    Darryl
  • Hello Darryl,
    It is difficult to say what CPU utilization is the right number. For example if you are running an active/active cluster then you definitely want to run less than 45% CPU so when you fail over you have no problem handling the workload. A better thing to do is to use CPU pressure as a guide as Tom Davidson states above. Also you should look at what I call good CPU vs bad CPU. CPU is either user time(good) or privledge time(also known to some as kernal time). User time is the CPU is doing actual work like sorting, etc. Privledge time is CPU wait time. You could be waiting on IO completion, context switching or some other activity outside of user time.
    So, look at user time and look at CPU pressure as a better guide. Also , don't run at 100% :)
  • Mark,

    Thanks for the update and the additional information. I will definitely use this for additional analysis. However, my original comment was misleading, my bad. What I was actually talking about was CPU Pressure, not CPU utilization. Tom writes in his post “if Signal Waits are a significant percentage of total waits” then you may be experiencing CPU Pressure. This is great information, but I am not sure what “significant” means with respect to Signal Waits and Total waits. For example, monitoring my SQL Server with DBCC SQLPERF (WAITSTATS), I see that 31% of my total waits are Signal Waits. How do I tell if that is significant for my system?

    Thanks again guys!

    Darryl
  • Darryl --

    The percentage of the total wait time is not very helpful in decising whether signal wait is significant. To asses that, compare total signal wait time to total processing time available per cpu in the interval.

    You will need to look at the total signal wait time on the system during a specific sample period of time (say, 1 minute or 5 minutes). If you see Signal Wait Time total 100% of the sample time (say, 60,000 msec during a 60-second sample), that tells you that you could fill up an additional CPU with work. If it is 200% then you could probably put 2 additional CPUs to work. Under 10% is probably below the radar.

    Stuart
  • Hi Darryl,

    Signal waits represent the time everyone waits for CPU and are thus unavoidable to some extent in a high concurrency system.

    When you compare signal to total wait time this just establishes how much of your total is completely dependent on CPU availability as opposed to other resources.

    The best questions you must ask when looking at cpu waits are:
    1. How much cpu latency is acceptable? Say your signal waits are 30% for a given INTERVAL of time. Is it acceptable to have performance variations that AVERAGE 30% based on CPU? What about the interval? If you measure the busiest 15 minute interval, you may have high signal waits. But if you measure over a longer period say 2 hours, you may have a much lower average.

    2. Is the total ms or percentage significant?
    For example, 50% looks like a high percentage but if the number of milliseconds is very small to start with, you discount the significance of it.

    The old sanity test still applies as to whether the percentage is significant or not. It is an interesting question but absolute numbers are hard to make because you have to factor in significance & the time interval.

    Regards,
    Tom Davidson
  • For SQL 2000, would the above script be:

    [code]
    create table #tempwaits
    (type varchar(40),
    requests int,
    waittime numeric(19,3),
    signalwaittime numeric(19,3))

    dbcc sqlperf (waitstats, clear)
    go
    waitfor delay '000:01:00'


    insert into #tempwaits
    exec ('dbcc sqlperf (waitstats)')

    select sum(signalwaittime) as "Signal Wait Time",
    100.0 * (sum(signalwaittime)/sum(waittime)) as "%cpu waits",
    sum(waittime - signalwaittime) as "Resource Wait Time",
    100.0 * (sum(waittime - signalwaittime)/sum(waittime)) as "% Resource Waits"
    from #tempwaits
    [/code]
  • On a related note, I've heard much conflicting information about Intel CPUs with their Hyperthreading architecture versus SQL Server. Whether to leave it on or disable it, etc. Can you address this in a future posting?
  • If you are having trouble performance tuning SQL Server 2000, there is one whitepaper you need. ...
  • SQL Server waits enable you to identify where SQL Server is spending a lot of time doing nothing but

  • And what about the deviation caused by wait types such as LAZYWRITER_SLEEP. Clearly, this wait should not been included in the calculation (as you can read in BOL describing dm_os_wait_stats view).

    My question is, how many other wait types should be excluded from this calculation? Perhaps also waits related to backup devices?

  • PingBack from http://sqlserverpedia.com/blog/?p=229

  • You might encounter a situation, under a very heavy workload, where processing for certain transactions

  • PingBack from http://www.keyongtech.com/2163609-context-switching-through-the-roof

  • PingBack from http://backyardshed.info/story.php?title=microsoft-sql-server-development-customer-advisory-team-how-do-you

Page 1 of 1 (14 items)