CSS SQL Server Engineers

This is the official team Web Log for Microsoft Customer Service and Support (CSS) SQL Support. Posts are provided by the CSS SQL Escalation Services

Strange Sch-S / Sch-M Deadlock on Machines with 16 or More Schedulers

Strange Sch-S / Sch-M Deadlock on Machines with 16 or More Schedulers

  • Comments 7

Since it took me several days to track down this bug, and I did learn a couple of new things along the way, I thought I would share some of my work.

16 or More CPUS

When a system presents SQL Server with 16 or more CPUs, and you are using a high end SQL Server SKU, SQL Server will enable lock partitioning.   (Lock partitioning can be disabled using startup parameter, trace flag -T1229.)

Lock Partitioning

Lock partitioning optimizes locking structures by adding additional, per scheduler structures and actions.   This design has similarities to Sub/Super Latching (http://blogs.msdn.com/b/psssql/archive/2009/01/28/hot-it-works-sql-server-superlatch-ing-sub-latches.aspx)

As a quick overview, if the query needs to obtain a Shared lock it only needs to acquire the shared lock on the local partition.  For an exclusive lock the query acquires the lock on each partition, always progressing from partition 0 to n to avoid deadlocks.   This allows the SQL Server to utilize the local partition when appropriate and improves scalability on larger systems.

Deadlock from Shared Lock on a Different Partition - What?

The problem I was presented with was the following deadlock output.   (This was from trace flag 1222 and 3605 to add deadlock information to the error log.  You could get similar information using the trace events.)

objectlock lockPartition=8 objid=1765581328 subresource=FULL dbid=8 objectname=Test id=lock47b821a00 mode=Sch-M associatedObjectId=1765581328

Notice the partition is 8 and the mode held is Sch-M.

owner-list

owner id=process46c276188 mode=Sch-M

The process is the task address that can be mapped to sys.dm_os_tasks, who owns the lock.

waiter-list

waiter id=process47b07dc38 mode=Sch-S requestType=wait

This is the close of the deadlock cycle by the second process.

Note: The waiter list is usually printed in ascending order based on how the victims will be selected; usually work investment based.
objectlock lockPartition=13 objid=1765581328 subresource=FULL dbid=8 objectname=Test id=lock47b821f80 mode=Sch-S associatedObjectId=1765581328 Partition 13 is showing the process that already holds the same Sch-S and is attempting a new acquire on partition 8.

owner-list

owner id=process47b07dc38 mode=Sch-S

Owner of the Sch-S lock.

waiter-list

waiter id=process46c276188 mode=Sch-M requestType=wait

Blocked process attempting to acquire the Sch-M lock.  This is expected as the Sch-M is attempting to acquire the lock on all partitions.

 

Under a rare condition SQL Server may not associate the proper lock partition with the lock request, leading to additional locking overhead or possible deadlocks.   This bug does not expose any locking problems that would lead to data integrity issues. This is a very small window during compile, before a user transaction is started.

The problem is that when using lock partitioning the Sch-S lock should be acquired on the transaction associated, local partition.  However, the same process is attempting to acquire the Sch-S lock on 2 different partitions leading to the deadlock.  Why?

  • The lock partition hint is stored with the connection object (sys.dm_exec_sessions - physical connection internal object to be more precise.)  
  • SQL Server assigns new batches to one of the active schedulers on the same NUMA node based on active task load for the schedulers.

In this case the login took place on scheduler 8 and the lock partition, hint is cached.  When the batch is processed it is assigned to scheduler 13 and the second partition becomes involved; triggering the unexpected behavior.

Bob Dorr - Principal SQL Server Escalation Engineer

Leave a Comment
  • Please add 4 and 4 and type the answer here:
  • Post
  • Hi Bob,

    Do we have fix for it, I think I am seeing same issue on SQLServer 2008 on windows 2003.

    Thanks.

  • Investigating the fix.   Since it is limited to the core compile phase reducing compiles will reduce your pain.

  • Hi Bob,

    I believe we are experiencing this deadlock bug as well on SQL Server Standard 64bit SP2 / Windows Server 2008 R2 Entreprise 64bits SP1.

    It occurs on very rare instances when web front (LINQ) access a source table that is being refreshed in the same time.

    Global Isolation level is RCSI and online Scheduler Count is 16.

    Your blog entry is the only description of this issue that I have found around. Really not sure how to proceed next.

    Do you recommend to use Trace Flag 1229 ? Can we expect a fix for SQL Server 2008 or is it possible to keep track on the progress ?

    -- Error log (TraceOn 1222/1204):

    objectlock lockPartition=2 objid=2091870519 subresource=FULL dbid=27 objectname=Test id=lock26ed41100 mode=Sch-S associatedObjectId=2091870519

    owner-list

    owner id=process3a4d4c8 mode=Sch-S

    waiter-list

    waiter id=process4aac748 mode=Sch-M requestType=wait

    objectlock lockPartition=1 objid=2091870519 subresource=FULL dbid=27 objectname=Test id=lockb1a2f2500 mode=Sch-M associatedObjectId=2091870519

    owner-list

    owner id=process4aac748 mode=Sch-M

    waiter-list

    waiter id=process3a4d4c8 mode=Sch-S requestType=wait  

    Thanks,

    Emeric

  • HI Bob,

    I've detected the same issue within my PROD Server so I'd like to know if, besides the improves that I could implement in both involved queries, there is some ticket raised for this issue that you comment, or a "feedback" registered in Connect site.

    My 1222 result:

    resource-list

      objectlock lockPartition=2 objid=2037582297 subresource=FULL dbid=5 objectname=DB.dbo.TBLName id=lock94caeef00 mode=Sch-M associatedObjectId=2037582297

       owner-list

        owner id=process4c814c8 mode=Sch-M

       waiter-list

        waiter id=process4d03288 mode=Sch-S requestType=wait

      objectlock lockPartition=9 objid=2037582297 subresource=FULL dbid=5 objectname=DB.dbo.TBLName id=lock192add380 mode=Sch-S associatedObjectId=2037582297

       owner-list

        owner id=process4d03288 mode=Sch-S

       waiter-list

        waiter id=process4c814c8 mode=Sch-M requestType=wait

    Thanks in advance for the info and your time.

    Alejandro.

  • That sounds reasonable for deadlock  Sch-M and Sch-S involved in on object, SP or batch. Any suggestion to resolve it? How about setting the Maximun Degree of Parallell to 1?

  • Hi Bob, do we have any fix for this bug

  • Researching this issue I came across this KBa.

    FIX: Deadlocks occur when you execute a stored procedure to alter a temporary table if lock partitioning is enabled in SQL Server 2008 R2 or in SQL Server 2012

    support.microsoft.com/.../2776344

    SQL Server 2012 SP1 CU2

    SQL Server 2012 SP0 CU6

    Is this issue and the fix a tempdb-only subset of the issue - is another fix needed for the general case of similar deadlocks with lock partitioning, sch-s, sch-m contributors in persistent user databases?

Page 1 of 1 (7 items)