Jimmy May's Blog

SQL Server Performance, Best Practices, & Productivity

Case Study: Part 2: CXPACKET Wait Stats & 'max degree of parallelism' Option: Suppressing Query Parallelism Eliminated CXPACKET Waits & Liberated 30% of CPU

Case Study: Part 2: CXPACKET Wait Stats & 'max degree of parallelism' Option: Suppressing Query Parallelism Eliminated CXPACKET Waits & Liberated 30% of CPU

Rate This
  • Comments 7

Introduction

This is the second of a four-part series:

  1. Introduction to Query Parallelism
  2. Parallelism Suppression: Analysis, Remediation, & Results (this post)
  3. Parallel Query Identification & Remediation
  4. Summary

Executive Summary

In my first post in this series, I documented introductory information related to query parallelism in SQL Server.

Be clear!  This series is not an indictment of query parallelism. 

SQL Server Customer Advisory Team (SQL CAT) Principal Program Manager Lead Lindsey Allen recently shared the following with me, "CXPACKET waits are just a symptom of either consumer or provider is waiting on some other resources, which likely are IO, CPU, or memory grant.  It is misleading to jump to the conclusion that CXPACKET waits are a bad thing in general."

As I stated previously, properly designed reporting, DSS, & warehouse systems use query parallelism to advantage.  Yet many OLTP workloads benefit from suppressing query parallelism by reconfiguring the value for max degree of parallelism from 0 (the default value) to 1.  MVP Andrew J. Kelly stated in a session he hosted at the PASS 2008 Community Summit, "CXPACKET waits are a direct result of delays associated with inefficiencies of parallel processing".

Inefficiencies indeed.  In this post I document the following:

  • Discovery of a pathological level of CXPACKET waits on an OLTP system
  • Suppression of query parallelism by changing max degree of parallelism from 0 to 1
  • The compelling results:
      1. CXPACKET wait stats were virtually eliminated, reduced from 73.8% to 0.000002% of total waits.
      2. 30% lower levels of CPU utilization were achieved & sustained.

CXPACKET Waits: Analysis

Recently during a routine performance analysis at a customer site on an OLTP server, wait stats analysis revealed that query parallelism was a significant SQL Server bottleneck.

See my post Drum Roll, Please...The Debut of The SQL DMV All-Stars Dream Team! for syntax.

73.8% of SQL Server waits were attributable to CXPACKET waits, i.e., inefficiencies related to query parallelism sys.dm_os_wait_stats consisted of 81,781,945ms of 110,825,734ms total per day.  73.8% surpasses the threshold of 5% for a CXPACKET wait bottleneck as documented by Tom Davidson, et al. in SQL Server 2005 Waits & Queues.  (See my previous post for additional details.)

CXPACKET Waits: Remediation

We notified appropriate stakeholders & per best practice for OLTP systems set the configuration option for max degree of parallelism from 0 to 1, re-initialized wait stats, then compared before-&-after results. (Note: I don't recommend modifying this setting on production systems without thorough testing, yet this server had plenty of CPU headroom—besides which the customer had, er, ah, a rich & courageous heritage related to change management.)

Holy correlated wait stats!  After suppressing parallelism, total relevant waits per day were reduced to only 12.5% of their former levels—a reduction of almost a full order of magnitude.  CXPACKET waits were trivial, responsible for only 0.000002% compared to their previous levels, i.e., inefficiencies related to query parallelism sys.dm_os_wait_stats consisted of 2ms of a reduced total of 13,836,740ms.  The following figure documents these results.

image

Figure 1.  Total wait stats and CXPACKET waits before & after reconfiguring max degree of parallelism from 0 (the default value) to 1 (the value generally recommended for OLTP systems).

 

Anatomy of an Option Change: Same Day Before-&-After CPU Analysis

"Cool", you say, "waits are down, & CXPACKET waits are ¡hasta la vista, baby!  But, Jimmy May, what does that mean for us admins & our users, eh?"

You want change you can really believe in?  Then check out the results in terms of CPU utilization which were also compelling.  Average CPU Load (95th percentile) decreased from 37% to 26%, an enhancement in CPU of 30%.  Though CPU wasn’t previously a bottleneck on this system, this was considered a significant engagement win & demonstrates the value of compliance with best practices. 

I can't speak for you, but I'll celebrate the liberation of an additional 30% of CPU every time.  So how about a big w00t-w00t, eh?  Consider what the impact would be for a server for which CPU were a bottleneck.

The following numbers refer to the balloons in Figure 2 representing data captured on 24 September 2008. The mitigating effect the configuration change had on CPU is clear.

    1. Prior to changing the max degree of parallelism from the default value of 0 to 1, CPU typically peaks in excess of 40%.
    2. Users leave for lunch during which CPU naturally ebbs to low levels, then spikes to 36% as users return.
    3. At 1:05pm, the max degree of parallelism option is changed from 0 to 1; CPU utilization drops immediately from 37% to 26%.
    4. With max degree of parallelism set to 1, average CPU utilization in the afternoon is significantly less compared to morning levels; afternoon spikes don't even cross 30% (compare these values to those cited in balloon 1).

Note: The following three figures contain "pinkish" spikes.  These are artifacts of the monitoring utility.  They are present in the images which were given to me & were not modified in any way for this post.

image

Figure 2.  CPU before & after reconfiguring max degree of parallelism from 0 (the default value) to 1 (the value generally recommended for OLTP systems).  See the text for details.

Sustained Success

The following figures confirm the initial success was not an aberration.  The Figure 3 is from 18 September 2008, the week prior to the configuration change.   Several spikes exceed 40% & one spike exceeds 50%; Average CPU Load (95th percentile) is 37%—just as during the morning of the day of the configuration change shown in Figure 2.

 clip_image001

Figure 3.  Average CPU Load (95th percentile) is 37% one week prior to changing max degree of parallelism from 0 to 1.

The last figure is from seven days later, 25 September 2008 (the day after the max degree of parallelism was changed); CPU spikes don't cross 40%; Average CPU Load (95th percentile) is 26%—just as it was during the afternoon that max degree of parallelism was changed from 0 to 1.  Thus, after suppressing query parallelism, not only were CXPACKET waits eliminated, 30% lower levels of CPU usage were sustained.

clip_image002

Figure 4.  Average CPU Load (95th percentile) is sustained at 26% one week after changing max degree of parallelism from 0 to 1.

Next Post: Identification & Remediation of Erstwhile Parallel Queries

We've made great progress so far, yet we're far from finished.  Now that we've suppressed query parallelism, the optimizer has no choice but to execute queries serially—and thus some queries which leveraged parallelism now take longer to execute.  The next post in this series, part 3 of 4 describes:

  • Techniques used to identify erstwhile parallel queries
  • How one such a query was remediated for additional reductions in CPU & I/O

My Related Posts

Case Study: Part 1: CXPACKET Wait Stats & 'max degree of parallelism' Option: Introduction to Using Wait Stats to Identify & Remediate Query Parallelism Bottlenecks

SQL Server: 4-Step Performance Troubleshooting Methodology--Introduction

PerfMon Objects, Counters, Thresholds, & Utilities for SQL Server

Drum Roll, Please...The Debut of The SQL DMV All-Stars Dream Team!

Acknowledgements

Thanks much to Brent Downing & Tom Wempe of the State of South Dakota for their contributions.

Administrivia

Jimmy May, MCDBA, MCSE, MCITP: DBA + DB Dev | Senior Performance Consultant: SQL Server
A.C.E.: Assessment Consulting & Engineering Services
http://blogs.msdn.com/jimmymay
Performance is paramount: Asking users to wait is like asking them to leave.

This post was written with the PracticeThis.com plugin for Windows Live Writer

Leave a Comment
  • Please add 3 and 6 and type the answer here:
  • Post
  • PingBack from http://blog.a-foton.ru/index.php/2008/12/02/case-study-part-2-cxpacket-wait-stats-max-degree-of-parallelism-option-suppressing-query-parallelism-eliminated-cxpacket-waits-liberated-30-of-cpu/

  • Performance & the 20% Rule I was reading the post from Ed Glas , Make Your Performance Work Count:

  • It’s quite often at clients that I spend time working on issues revolving around parallelism.

  • when you show an example of cxpacket wait time for an OLTP, did you use a database with row versioning enabled(I mean read_committed_snapshot isolation level)? I guess that if you use row versioning, there should be a lot of cxpacket wait in an OLTP.

  • No, row versioning was not enabled.

    GG, it isn't obvious to me that versioning would generated CXPACKET waits.  The section "Understanding Concurrency Control" in this article has some interesting info:

    SQL Server 2005 Row Versioning-Based Transaction Isolation

    http://msdn.microsoft.com/en-us/library/ms345124(SQL.90).aspx

  • There is a misunderstanding here. You can reduce CPU utilization even more by turning off the computer. It should be understood that parallel processing is inherently "inefficient", but you use it because it executes faster.

    CXPACKET waits don't necessarily indicate a problem, and neither does high CPU utilization. Setting MAXDOP to 1 is essentially throttling your server. Having said that, a high CXPACKET count may indicate that parallelism is being used inefficiently (meaning no performance gain for the extra CPUs). This is usually the case if you have a server with more than about half a dozen CPUs and MAXDOP is set for the server at zero. In an OLTP environment MAXDOP of zero may be inefficient on just 4 CPUs, because SQL Server, when executing a parallel exec plan takes all available (idle at that moment) CPUs up to MAXDOP. It does not calculate the most efficient MAXDOP for a given query (at least it didn't with 2005; don't know if it changed with 2008).

    This is still an oversimplification, but the bottomline is, if your OLTP transaction executes faster as measured by ET (not CPU time) with a parallel exec plan, then it should be allowed to execute that way if the CPU is not totally maxed out to where other processes are waiting for the CPU (the solution to this problem is more CPUs; MAXDOP = 0 is a band-aid). My guess is that the system in this example could probably benefit from MAXDOP > 1 and MAXDOP != 0. My own testing with 2005 has indicated that there is a point MAXDOP > 1 where performance degrades; in the environment I tested with a complex query on a very large table it was consistently around 5 CPUs, don't know if it is better with 2008, a different test query or a different server.

  • Please re-read the post. I think you'll see there's no misunderstanding. Indeed, your experience provides validation.  Thanks much.

Page 1 of 1 (7 items)