Troubleshooting SQL Server deadlocks and performance when running SharePoint

Troubleshooting SQL Server deadlocks and performance when running SharePoint

Rate This
  • Comments 2

I’ve worked on a lot of cases this year involving SharePoint. Here are some example problems I encountered recently. If you want to read the background or the formal advice first, here are some good starting links

Planning and Monitoring SQL Server Storage for Office SharePoint Server: Performance Recommendations and Best Practices (white paper)

Plan for performance and capacity (Office SharePoint Server)

In addition to this, you need to consider what you can and can’t do from a SQL Server perspective, if you’re trying to optimize or troubleshoot your SharePoint deployment. You need to pay detailed attention to these articles:

http://support.microsoft.com/kb/932744/

http://support.microsoft.com/kb/841057/

In simple terms, don’t go amending the schema or the objects! So back to my main point…

Things to do first:

1. Collect some decent SQL Server diagnostic data – my preferred option,as ever, SQLDIAG with perfstats scripts

2. Turn on trace flag 1222 to output deadlock information to the errorlog, or capture deadlock graphs in your profiler traces.

3. Collect this data at period of both good and bad performance, so you have a benchmark to compare against

4. Aggregate the profiler output. Personally I use readtrace.exe directly from RMLUtilities, but other like to use SQLNexus as well.

Here are the 2 real examples of problems I encountered recently, and diagnosed using the above data collection:

1. Are you experiencing significant parallel waits – SQL Server installations will have a default Max Degree of Parallelism set to 0. I have observed certain MOSS deployments this year where this is not the best MAXDOP, dependent upon the specifics of the data distribution and deployment. If you look in the perfstats output you might see evidence like this:

-- requests --
runtime                        session_id request_id  ecid        blocking_session_id blocking_ecid task_state      wait_type
------------------------------ ---------- ----------- ----------- ------------------- ------------- --------------- ---------
2009-05-27T15:06:38.060                68           0           4                   0             0 RUNNING         CXPACKET 
2009-05-27T15:06:38.060                68           0           0                   0             4 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060                96           0           0                   0             0 SUSPENDED       WAITFOR  
2009-05-27T15:06:38.060               116           0           0                   0             0 RUNNING                  
2009-05-27T15:06:38.060               128           0          12                   0             0 RUNNING         CXPACKET 
2009-05-27T15:06:38.060               128           0           9                   0             0 RUNNABLE        CXPACKET 
2009-05-27T15:06:38.060               128           0           0                   0             0 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               128           0           5                   0             0 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               128           0          11                   0             0 RUNNABLE        CXPACKET 
2009-05-27T15:06:38.060               128           0           6                   0             0 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               128           0           2                   0             0 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               128           0           3                   0             0 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               128           0           4                   0             0 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               128           0           7                   0             0 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               128           0           8                   0             0 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               128           0           1                   0             0 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               128           0          10                   0             0 RUNNABLE        CXPACKET 
2009-05-27T15:06:38.060               138           0          10                   0             0 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               138           0          12                   0             0 RUNNABLE        CXPACKET 
2009-05-27T15:06:38.060               138           0          11                   0             0 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               138           0           9                   0             0 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               138           0           0                   0             3 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               138           0           3                   0             8 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               138           0           1                   0             8 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               138           0           2                   0             8 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               138           0           4                   0             8 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               138           0           7                   0            12 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               138           0           5                   0            12 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               138           0           6                   0            12 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               138           0           8                   0            12 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               163           0           0                   0             0 RUNNABLE                 
2009-05-27T15:06:38.060               170           0           0                 163             0 SUSPENDED       LCK_M_X  

This is a classic example, where a long blocking chain starts to build due to CXPACKET waits. Note that the same sessionid is reported on multiple lines, but the ecid column changes. This indicates that the batch (sessionid) is being processed in parallel by more than one worked (ecid). This extract is very early on in the problem, from here the chains grow and grow, and eventually can result in multiple deadlocks. In this scenario the batches being executed by sessions 128 and 138 (which are both running PROC_LISTUNGHOSTEDFIELDTEMPLATESINLIST ) are picking parallel plans inappropriately. We worked around this by changing MAXDOP on the server to a value of 1. This is definitely something that you should test in your environment, since it’s a server wide setting. Currently according to the above KB articles there is no fixed recommendation about MAXDOP for MOSS deployments, but things can change, so you should check the fixed requirements of your version. In this scenario we saw significant other number of procedures which encountered the same issue, procedures such as proc_CheckDirectoryTree, proc_DeplGetListItemLinks and PROC_DEPLGETLISTITEMANDRESERVEID amongst others. Typical errors you might see in the application logs for SharePoint would be

Unexpected query execution failure, error code 1205. Additional error information from SQL Server is included below. "Transaction (Process ID 163) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction."

A global change of MAXDOP = 1 resolved this specific situation.

2. Are you encountering long blocking chains accessing the AllDocs table? This object is commonly accessed by a number of different stored procedures, and I’ve seen some scenarios where it becomes a point of contention. The most problematic one I came across recently resulted in numerous deadlocks, which firstly throw the above 1205 error to the SharePoint application log, and then in addition they throw the following error:

InitializePublishingWebDefaults failed. Microsoft.SharePoint.SPException: Operation aborted (Exception from HRESULT: 0x80004004 (E_ABORT)) ---> System.Runtime.InteropServices.COMException (0x80004004): Operation aborted (Exception from HRESULT: 0x80004004 (E_ABORT))     at Microsoft.SharePoint.Library.SPRequestInternalClass.SetWebProps

This second error occurs as SharePoint does not roll back properly after encountering the deadlock. If you have collected your SQLDIAG data as noted above, you’ll see symptoms like this in the perfstats script output:

a) long blocking chains waiting on LCK_M_U

-- requests --
runtime                        session_id request_id  ecid        blocking_session_id blocking_ecid task_state      wait_type                                          wait_duration_ms     
------------------------------ ---------- ----------- ----------- ------------------- ------------- --------------- -------------------------------------------------- -------------------- 
2009-05-28T10:01:40.973                60           0           0                 197             0 SUSPENDED       LCK_M_X                                                            3201 
2009-05-28T10:01:40.973                72           0           0                  60             0 SUSPENDED       LCK_M_S                                                            3224 
2009-05-28T10:01:40.973                77           0           0                   0             0 SUSPENDED       WAITFOR                                                            9087 
2009-05-28T10:01:40.973                84           0           0                  72             0 SUSPENDED       LCK_M_S                                                            1018 
2009-05-28T10:01:40.973               197           0           0                   0             0 RUNNING                                                                               0 
2009-05-28T10:01:40.973               268           0           0                 270             0 SUSPENDED       LCK_M_U                                                           36299 
2009-05-28T10:01:40.973               270           0           0                 197             0 SUSPENDED       LCK_M_U                                                           96753 
2009-05-28T10:01:40.973               277           0           0                 270             0 SUSPENDED       LCK_M_U                                                           44290 
2009-05-28T10:01:40.973               280           0           0                 270             0 SUSPENDED       LCK_M_U                                                           60461 
2009-05-28T10:01:40.973               292           0           0                 270             0 SUSPENDED       LCK_M_U                                                           52389 

b) the notableactivequeries part of the perfstats script will potentially note that the plan_total_exec_count is very low. This indicates that you appear to be running a freshly compiled plan for the blocking statement. This will be all the more likely if your system was running well, and then suddenly started degrading performance and encountering errors

-- notableactivequeries --
runtime                        session_id request_id  plan_total_exec_count plan_total_cpu_ms    plan_total_duration_ms plan_total_physical_reads plan_total_logical_writes plan_total_logical_reads 
------------------------------ ---------- ----------- --------------------- -------------------- ---------------------- ------------------------- ------------------------- ------------------------ 
2009-05-28T10:01:40.973               197           0                     5                33057                 406079                      2550                         0                  2352015 
2009-05-28T10:01:40.973               268           0                     5                33057                 406079                      2550                         0                  2352015 
2009-05-28T10:01:40.973               270           0                     5                33057                 406079                      2550                         0                  2352015 
2009-05-28T10:01:40.973               277           0                     5                33057                 406079                      2550                         0                  2352015 
2009-05-28T10:01:40.973               280           0                     5                33057                 406079                      2550                         0                  2352015 
2009-05-28T10:01:40.973               292           0                     5                33057                 406079                      2550                         0                  2352015 

 

 

 

 

c) when reviewing the error logs you note that the same statement shown below, being executed within the procedure proc_UpdateTpWebMetaData, is always involved in the deadlocking

UPDATE
    Docs
SET
    Docs.BuildDependencySet = NULL
FROM
    Docs INNER JOIN BuildDependencies
ON
    BuildDependencies.SiteId = Docs.SiteId AND
    BuildDependencies.LeafName = Docs.LeafName AND
    BuildDependencies.DirName = Docs.DirName
WHERE
    BuildDependencies.SiteId = @SiteId AND
    BuildDependencies.TargetDirName = @PreviousCustomMasterDir AND
    BuildDependencies.TargetLeafName = @PreviousCustomMasterLeaf AND
    Docs.BuildDependencySet IS NOT NULL

The reason for this problem, in this particular scenario, is that a new plan has been compiled recently for this statement which is using a HASH join, as opposed to a nested loop join which it was using earlier. This is caused by a lack of up to date statistics on the AllDocs table. We noted scenarios where the regular update statistics jobs (issued by SharePoint as part of the changes in 2007 SP2) were not thorough enough, and we were forced to run update statistics with the fullscan option on the AllDocs table.

The bottom line here being that although SharePoint does now do database maintenance, as shown in KB932744, there are still times potentially where you will need to do more detailed targeted maintenance.

Leave a Comment
  • Please add 7 and 8 and type the answer here:
  • Post
  • How out of date is too much for the Statistics on the AllDocs table? We're experiencing blocking issues but the stats are at most a couple of days out of date. From a profiler, i can see that the stats on BuildDependencies table are updated several times throughout the day, is there a reason why this would require auto stat updates quite frequently? We tend to be experiencing Resource_Semaphore_query_compile wait types on 100's of procedure calls, all various ones.....do you think this is related?

  • There's a number of different points here which are hard to cover specifically in just comments field:

    1. The answer to how of of date is, as often in such situations, "it depends"! From a purely practicle standpoint,  they're out of date when you are getting bad degraded plans where the plan problem is caused by scans in AllDocs, and which can be fixed by updating the stats manually. Each system varies so you can't say a fixed number, I've seen some that need updating every 3-4 hours and some every fews weeks. Look for plans which change their usage of AllDocs when stats are not recently updated.

    2. Auto update stats will trigger on any table based on a number of system controller algorhythms surrounding the amount of changes to the rows in the table. Basically it will trigger when SQL believes it appropriate and you shouldn;t consider this a problem unless the auto stats function itself starts blocking other processes or causing other perf issues itself.

    3. If you're having loads of Resource_Semaphore_query_compile waits though, it's very likely that this is unrelated. You might want to look at this article about this wait type: blogs.msdn.com/.../compilation-bottlenecks-error-8628-severity-17-state-0-part-1.aspx. These are compilation bottlenecks and need to be addressed as such.

    If you want to discuss in more detail or provide any data drop me an email through the contact section of the blog.

    Graham

Page 1 of 1 (2 items)