Today's Annoyingly-Unwieldy Term: "Intra-Query Parallel Thread Deadlocks"

Today's Annoyingly-Unwieldy Term: "Intra-Query Parallel Thread Deadlocks"

  • Comments 7

I've received a couple of questions in email and in comments about deadlocks involving mysterious-sounding non-lock resources like "exchangeEvent" and "threadpool".  There are a couple of examples in the comments for post http://blogs.msdn.com/bartd/archive/2006/09/25/deadlock-troubleshooting-part-3.aspx, and here's a forum post on the topic: http://forums.microsoft.com/Forums/ShowPost.aspx?PostID=3913233&SiteID=1.  

 

Here's one example (note that I've omitted the "inputbuf" and "executionStack" nodes for the sake of brevity and clarity):

 

__________________________________________________________________

deadlock-list

deadlock victim=process38316d8

 

 process-list

  process id=process3808478 schedulerid=1 kpid=216 status=suspended  spid=51 sbid=0 ecid=8 

 

  process id=process3809ac8 schedulerid=1 kpid=5672 status=suspended spid=51 sbid=0 ecid=17

  process id=process38136d8 schedulerid=2 kpid=5644 status=suspended spid=51 sbid=0 ecid=16

  process id=process3813828 schedulerid=2 kpid=6064 status=suspended spid=51 sbid=0 ecid=9 

  process id=process381c478 schedulerid=3 kpid=5292 status=suspended spid=51 sbid=0 ecid=10

  process id=process381d2e8 schedulerid=3 kpid=4372 status=suspended spid=51 sbid=0 ecid=19

  process id=process38265c8 schedulerid=4 kpid=5552 status=suspended spid=51 sbid=0 ecid=11

  process id=process3827ac8 schedulerid=4 kpid=5716 status=suspended spid=51 sbid=0 ecid=18

  process id=process38309b8 waittime=609 schedulerid=5 kpid=0

  process id=process38312e8 schedulerid=5 kpid=3204 status=suspended spid=51 sbid=0 ecid=6 

  process id=process38316d8 schedulerid=5 kpid=5108 status=suspended spid=51 sbid=0 ecid=13

  process id=process383a718 schedulerid=6 kpid=5216 status=suspended spid=51 sbid=0 ecid=7 

  process id=process383ada8 waittime=609 schedulerid=6 kpid=0

  process id=process383beb8 schedulerid=6 kpid=5852 status=suspended spid=51 sbid=0 ecid=14

  process id=process3845588 schedulerid=7 kpid=6096 status=suspended spid=51 sbid=0 ecid=15

  process id=process38456d8 schedulerid=7 kpid=760 status=suspended  spid=51 sbid=0 ecid=0 

  process id=process3845c18 schedulerid=7 kpid=5992 status=suspended spid=51 sbid=0 ecid=12

 

 resource-list

  threadpool id=scheduleree6080

   owner-list

    owner id=process38316d8

    owner id=process38312e8

   waiter-list

    waiter id=process38309b8

  exchangeEvent id=port80140950 nodeId=9

   owner-list

    owner event=pending id=process383ada8

    owner event=pending id=process38309b8

   waiter-list

    waiter event=e_waitPortOpen type=consumer id=process3813828

    waiter event=e_waitPortOpen type=consumer id=process3808478

    waiter event=e_waitPortOpen type=consumer id=process381c478

    waiter event=e_waitPortOpen type=consumer id=process38265c8

    waiter event=e_waitPortOpen type=consumer id=process3845c18

    waiter event=e_waitPortOpen type=consumer id=process38316d8

    waiter event=e_waitPortOpen type=consumer id=process383beb8

    waiter event=e_waitPortOpen type=producer id=process3845588

    waiter event=e_waitPortOpen type=producer id=process38136d8

    waiter event=e_waitPortOpen type=producer id=process3809ac8

    waiter event=e_waitPortOpen type=producer id=process3827ac8

    waiter event=e_waitPortOpen type=producer id=process381d2e8

  exchangeEvent id=port80140690 nodeId=5

   owner-list

    owner event=pending id=process383ada8

    owner event=pending id=process38309b8

   waiter-list

    waiter event=e_waitPortOpen type=consumer id=process38456d8

  exchangeEvent id=port80140c10 nodeId=12

   owner-list

    owner event=pending id=process383ada8

    owner event=pending id=process38309b8

   waiter-list

    waiter event=e_waitPortOpen type=producer id=process38312e8

    waiter event=e_waitPortOpen type=producer id=process383a718

  threadpool id=scheduleref6080

   owner-list

    owner id=process383beb8

    owner id=process383a718

   waiter-list

    waiter id=process383ada8

__________________________________________________________________

 

"exchangeEvent" Deadlock Resources

Some terminology, to better understand the trace flag 1222 deadlock output shown above:

    spid = system process ID, AKA "session_id" -- to oversimplify slightly, this represents a connection to SQL

     sbid = system batch ID, also called "request_id" -- a query that a spid is running

    ecid = execution context ID -- a worker thread running part of a query

 

 

There's one thing that you should note about this deadlock right off the bat: all of the participants (the "process" nodes in the -T1222 output) are from the same session identifier (spid) and the same batch (sbid).  They each have a different thread ID (kpid and ecid).  Each "process" entry in this deadlock represents a different worker thread, but all of these worker threads are running part of a single large parallel query that was submitted by spid 51. 

 

Another interesting thing about this deadlock is in the resource list: most deadlocks involve lock resources ("pagelock", "keylock", etc), but this one only deals with "exchangeEvent" and "threadpool" resources. 

 

Deadlocks centering around exchangeEvent resources have been given the name "intra-query parallelism deadlock".  (I know -- it just drips 'sexy', doesn't it?)  They may be accompanied by this error message -- sent to the client app only, not logged in the SQL errorlogs:

Server: Msg 8650, Level 13, State 1, Line 1 Intra-query parallelism caused your server command (process ID #51) to deadlock. Rerun the query without intra-query parallelism by using the query hint option (maxdop 1).

An "exchangeEvent" resource indicates the presence of parallelism operators in a query plan.  The idea is that the work for an operation like a large scan, sort, or join is divided up so that it can be executed on multiple child threads.  There are "producer" threads that do the grunt work and feed sets of rows to "consumers".  Intra-query parallel requires signaling between these worker threads: the consumers may have to wait on producers to hand them more data, and the producers may have to wait for consumers to finish processing the last batch of data.  Parallelism-related waits show up in SQL DMVs as CXPACKET or EXCHANGE wait types (note that the presence of these wait types is normal and simply indicates the presence of parallel query execution -- by themselves, these waits don't indicate that this type or any other type of deadlock is occurring).

 

Wherever you have threads waiting for resources, there is a risk that they will end up in a circular blocking chain (thread A holding resource X and waiting for resource Y, thread B holding resource Y and waiting for resource X).  The synchronization objects used in parallel query execution are no exception; in rare cases, the threads running a single query can end up deadlocking with one another.  Most intra-query parallelism deadlocks are considered bugs, although some of them can be risky bugs to fix so a fix may not be possible.  If you run into one and you're already on the latest SQL service pack, your best bet may be to investigate workarounds.  Luckily, this type of deadlock is relatively uncommon, and in most cases it's possible to work around the problem by eliminating parallelism in the query.  Try one of these two approaches:

 

Workaround #1: Add an index or improve the query to eliminate the need for parallelism.  In most cases, the use of parallelism in a query indicates that you have a very large scan, sort, or join that isn't supported by proper indexes.  If you tune the query, you will often find that you end up with a much quicker and more efficient plan that doesn't use parallelism, and therefore isn't subject to this type of problem. Of course, in some queries (DSS/OLAP-type queries, in particular) it may be difficult to eliminate all large scans.


Workaround #2:  Force single-threaded execution with an "OPTION (MAXDOP 1)" query hint at the end of the query.  If you can't modify the query, you can apply the hint to any query with a plan guide (assuming that you're running SQL 2005 or later).

 

 

"threadpool" Deadlock Resources

A process waiting for a "threadpool" resource is actually waiting for a worker thread.  There are a finite number of threads in SQL's thread pool, and if they are all in use, new requests must wait for an in-progress task to complete and free up a thread.  Thread pool waits (in DMVs, "THREADPOOL" or "UMSTHREAD" waittype) are typically a side effect of a massive resource contention problem -- most commonly, a large blocking chain.  You should investigate what is tying up all of your worker threads, and eliminate that bottleneck.  While it's not clear from this deadlock output alone, I suspect that in this case there may have been many other large untuned queries using lots of parallel threads, so excessive parallelism itself may have been the cause of the thread starvation. 

 

 

 

"resourceWait" Deadlock Resources

(UPDATE: The following info on "resourceWait" was added to this post 15 Oct 2009.)

A process waiting for a "resourceWait" resource is waiting for a "resource semaphore".  Resource semaphores are typically used to govern memory used for query sorts and hashes.  So the following wait graph:

resource-list
   keylock hobtid=72057594038845440 dbid=6 objectname=XXXTABLE indexname=YYYINDEX id=lockffffffff81314cc0 mode=X associatedObjectId=72057594038845440
    owner-list
     owner id=processebb108 mode=X
    waiter-list
     waiter id=processebae38 mode=S requestType=wait
   resourceWait
    owner-list
     owner id=processebae38
    waiter-list
     waiter id=processebb108

could be read as:

     Spid A is waiting for a shared key lock, but is blocked by Spid B, who holds an exclusive lock on this key. 
     Spid B is waiting for more memory to run his query (and, eventually, to release the X key lock), but he is blocked by other spids, including Spid A, who are currently holding all of the memory available for this type of operation. 

 

You could tackle this by looking for tuning opportunities in the queries run by both deadlock participants.  They are probably running a query plan that involves a hash or sort operation.  Remove this through indexing or query changes and you should eliminate the query's need to wait for a query memory grant.  You could also try throwing RAM at the problem, but keep in mind that query workspace memory, the memory used for sorts and hashes, must be drawn from "visible buffer pool".  On a 32-bit box, visible bpool is limited to approximately 2GB (3GB if you are running with /3GB).  If SQL already has this much memory available to it, adding more won't help. 

 

 

 

 

Caveats

Just because you see “exchangeEvent” resources in your deadlock graph doesn’t necessarily mean that you are facing an intra-query parallelism deadlock.  Sometimes the engine includes extraneous resources in the deadlock graph.  This makes it important to find out how the waiters relate to one another so that you can determine which of the resources is an essential part of the circular blocking chain.   The post http://blogs.msdn.com/bartd/archive/2006/09/09/Deadlock-Troubleshooting_2C00_-Part-1.aspx steps you through a deconstruction of -T1222 output so that you can get a clearer understanding of the relationships.  As a rule of thumb, if there are any lock resources in your deadlock output (pagelock, keylock, rowlock) along with the exchangeEvent resources, you should suspect that the exchangeEvent resources are non-essential and that you are probably facing a “normal” deadlock. 

 

 

If you're interested in more background info on parallel query execution, there's a great presentation by Craig Freedman attached to this blog post: http://blogs.msdn.com/craigfr/archive/2007/04/17/parallel-query-execution-presentation.aspx

 

Leave a Comment
  • Please add 2 and 5 and type the answer here:
  • Post
  • PingBack from http://hoursfunnywallpaper.cn/?p=7682

  • PingBack from http://housesfunnywallpaper.cn/?p=7382

  • I have seen few users out there thinks that Deadlocks in SQL Server is a bug which has not be corrected

  • Hi Bart,

    I am experiencing a strange paralellization deadlock in which I cannot pinpoint whats going on.

    I have a query that is updating a table using a self-join and this is causing a paralellization deadlock but not always.

    This is the query along with the errorlog with flags 1222 and 1204 enabled. MAXDOP on the server is 4.

    update cf

    set cf.salednP = (case when b.handn=0 then 0 else (1. * cf.saledn / b.handn) * 100 end)  

    from ##CalcFacts_31478 cf

    join ##CalcFacts_31478 b

    on (cf.MBDID=b.MBDID and cf.NumericDistributionsBase=b.ItemLineID)

    Deadlock encountered .... Printing deadlock information

    Wait-for graph

    NULL

    Node:1  

    NULL

    Node:2  

    NULL

    Node:3  

    NULL

    Node:4  

    Task waiting for thread (first in queue) on scheduler 0000000000F06080:

    Task 0000000003C1D198

    NULL

    Node:5  

    NULL

    -- next branch --

    NULL

    NULL

    Node:5  

    NULL

    -- next branch --

    NULL

    NULL

    Node:2  

    NULL

    -- next branch --

    NULL

    NULL

    Node:1  

    deadlock-list

    deadlock victim=process3c1d2e8

     process-list

      process id=process3c09d68 taskpriority=0 logused=20006 waittime=2781 schedulerid=1 kpid=13036 status=suspended spid=54 sbid=0 ecid=6 priority=0 transcount=0 lastbatchstarted=2009-11-23T10:28:25.257 lastbatchcompleted=2009-11-23T10:28:25.257 clientapp=.Net SqlClient Data Provider hostname=ACN357NICVPS14 hostpid=6180 isolationlevel=read committed (2) xactid=53844564 currentdb=5 lockTimeout=4294967295 clientoption1=538968096 clientoption2=128056

       executionStack

        frame procname=unknown line=1 sqlhandle=0x02000000ed34c514d3a585aacc0312a34bfdb8088c95d7c3

    unknown    

        frame procname=unknown line=1 sqlhandle=0x0400ff7fbe80662601000000000000000000000000000000

    unknown    

        frame procname=unknown line=848 stmtstart=75496 stmtend=75578 sqlhandle=0x03000500b982ee5ad3c102018d9c00000100000000000000

    unknown    

       inputbuf

      process id=process3c12b08 taskpriority=0 logused=20006 waittime=2781 schedulerid=2 kpid=13092 status=suspended spid=54 sbid=0 ecid=5 priority=0 transcount=0 lastbatchstarted=2009-11-23T10:28:25.257 lastbatchcompleted=2009-11-23T10:28:25.257 clientapp=.Net SqlClient Data Provider hostname=ACN357NICVPS14 hostpid=6180 isolationlevel=read committed (2) xactid=53844564 currentdb=5 lockTimeout=4294967295 clientoption1=538968096 clientoption2=128056

       executionStack

        frame procname=unknown line=1 sqlhandle=0x02000000ed34c514d3a585aacc0312a34bfdb8088c95d7c3

    unknown    

        frame procname=unknown line=1 sqlhandle=0x0400ff7fbe80662601000000000000000000000000000000

    unknown    

        frame procname=unknown line=848 stmtstart=75496 stmtend=75578 sqlhandle=0x03000500b982ee5ad3c102018d9c00000100000000000000

    unknown    

       inputbuf

      process id=process3c1d198 waittime=2781 schedulerid=3 kpid=0

      process id=process3c1d2e8 taskpriority=0 logused=20006 waittime=2781 schedulerid=3 kpid=9436 status=suspended spid=54 sbid=0 ecid=4 priority=0 transcount=0 lastbatchstarted=2009-11-23T10:28:25.257 lastbatchcompleted=2009-11-23T10:28:25.257 clientapp=.Net SqlClient Data Provider hostname=ACN357NICVPS14 hostpid=6180 isolationlevel=read committed (2) xactid=53844564 currentdb=5 lockTimeout=4294967295 clientoption1=538968096 clientoption2=128056

       executionStack

        frame procname=unknown line=1 sqlhandle=0x02000000ed34c514d3a585aacc0312a34bfdb8088c95d7c3

    unknown    

        frame procname=unknown line=1 sqlhandle=0x0400ff7fbe80662601000000000000000000000000000000

    unknown    

        frame procname=unknown line=848 stmtstart=75496 stmtend=75578 sqlhandle=0x03000500b982ee5ad3c102018d9c00000100000000000000

    unknown    

       inputbuf

      process id=process3c27d68 taskpriority=0 logused=20006 waittime=2781 schedulerid=4 kpid=11376 status=suspended spid=54 sbid=0 ecid=7 priority=0 transcount=0 lastbatchstarted=2009-11-23T10:28:25.257 lastbatchcompleted=2009-11-23T10:28:25.257 clientapp=.Net SqlClient Data Provider hostname=ACN357NICVPS14 hostpid=6180 isolationlevel=read committed (2) xactid=53844564 currentdb=5 lockTimeout=4294967295 clientoption1=538968096 clientoption2=128056

       executionStack

        frame procname=unknown line=1 sqlhandle=0x02000000ed34c514d3a585aacc0312a34bfdb8088c95d7c3

    unknown    

        frame procname=unknown line=1 sqlhandle=0x0400ff7fbe80662601000000000000000000000000000000

    unknown    

        frame procname=unknown line=848 stmtstart=75496 stmtend=75578 sqlhandle=0x03000500b982ee5ad3c102018d9c00000100000000000000

    unknown    

       inputbuf

     resource-list

      threadpool id=schedulerf06080

       owner-list

        owner id=process3c1d2e8

       waiter-list

        waiter id=process3c1d198

      exchangeEvent id=port8019aed0 nodeId=6

       owner-list

        owner event=pending id=process3c1d198

       waiter-list

        waiter event=e_waitPortOpen type=consumer id=process3c1d2e8

        waiter event=e_waitPortOpen type=producer id=process3c09d68

        waiter event=e_waitPortOpen type=producer id=process3c12b08

        waiter event=e_waitPortOpen type=producer id=process3c27d68

    NULL

    Victim Resource Owner:

    ResType:ExchangeId Stype:'AND' SPID:54 BatchID:0 ECID:4 TaskProxy:(0x00000001FB9D1540) Value:0x39614390 Cost:(0/20006)  

    Thanks

    Michael

  • Michael, I recommend trying Workaround #1 or Workaround #2 described in the post.  

  • this article was very helpful to me and saved me a great deal of time. thanks for taking the time to write it up.

  • Has anyone heard / experienced if this is still a problem with sql 2012?

Page 1 of 1 (7 items)