SQL in Stockholm

SQL Server thoughts from the support team in Stockholm

SQL Server troubleshooting thoughts from Graham Kent, (ex) Senior Support Escalation Engineer at Microsoft Sweden.

  • SQL in Stockholm

    Troubleshooting SQL Server deadlocks and performance when running SharePoint

    • 2 Comments

    I’ve worked on a lot of cases this year involving SharePoint. Here are some example problems I encountered recently. If you want to read the background or the formal advice first, here are some good starting links

    Planning and Monitoring SQL Server Storage for Office SharePoint Server: Performance Recommendations and Best Practices (white paper)

    Plan for performance and capacity (Office SharePoint Server)

    In addition to this, you need to consider what you can and can’t do from a SQL Server perspective, if you’re trying to optimize or troubleshoot your SharePoint deployment. You need to pay detailed attention to these articles:

    http://support.microsoft.com/kb/932744/

    http://support.microsoft.com/kb/841057/

    In simple terms, don’t go amending the schema or the objects! So back to my main point…

    Things to do first:

    1. Collect some decent SQL Server diagnostic data – my preferred option,as ever, SQLDIAG with perfstats scripts

    2. Turn on trace flag 1222 to output deadlock information to the errorlog, or capture deadlock graphs in your profiler traces.

    3. Collect this data at period of both good and bad performance, so you have a benchmark to compare against

    4. Aggregate the profiler output. Personally I use readtrace.exe directly from RMLUtilities, but other like to use SQLNexus as well.

    Here are the 2 real examples of problems I encountered recently, and diagnosed using the above data collection:

    1. Are you experiencing significant parallel waits – SQL Server installations will have a default Max Degree of Parallelism set to 0. I have observed certain MOSS deployments this year where this is not the best MAXDOP, dependent upon the specifics of the data distribution and deployment. If you look in the perfstats output you might see evidence like this:

    -- requests --
    runtime                        session_id request_id  ecid        blocking_session_id blocking_ecid task_state      wait_type
    ------------------------------ ---------- ----------- ----------- ------------------- ------------- --------------- ---------
    2009-05-27T15:06:38.060                68           0           4                   0             0 RUNNING         CXPACKET 
    2009-05-27T15:06:38.060                68           0           0                   0             4 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060                96           0           0                   0             0 SUSPENDED       WAITFOR  
    2009-05-27T15:06:38.060               116           0           0                   0             0 RUNNING                  
    2009-05-27T15:06:38.060               128           0          12                   0             0 RUNNING         CXPACKET 
    2009-05-27T15:06:38.060               128           0           9                   0             0 RUNNABLE        CXPACKET 
    2009-05-27T15:06:38.060               128           0           0                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               128           0           5                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               128           0          11                   0             0 RUNNABLE        CXPACKET 
    2009-05-27T15:06:38.060               128           0           6                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               128           0           2                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               128           0           3                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               128           0           4                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               128           0           7                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               128           0           8                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               128           0           1                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               128           0          10                   0             0 RUNNABLE        CXPACKET 
    2009-05-27T15:06:38.060               138           0          10                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               138           0          12                   0             0 RUNNABLE        CXPACKET 
    2009-05-27T15:06:38.060               138           0          11                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               138           0           9                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               138           0           0                   0             3 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               138           0           3                   0             8 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               138           0           1                   0             8 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               138           0           2                   0             8 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               138           0           4                   0             8 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               138           0           7                   0            12 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               138           0           5                   0            12 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               138           0           6                   0            12 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               138           0           8                   0            12 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               163           0           0                   0             0 RUNNABLE                 
    2009-05-27T15:06:38.060               170           0           0                 163             0 SUSPENDED       LCK_M_X  

    This is a classic example, where a long blocking chain starts to build due to CXPACKET waits. Note that the same sessionid is reported on multiple lines, but the ecid column changes. This indicates that the batch (sessionid) is being processed in parallel by more than one worked (ecid). This extract is very early on in the problem, from here the chains grow and grow, and eventually can result in multiple deadlocks. In this scenario the batches being executed by sessions 128 and 138 (which are both running PROC_LISTUNGHOSTEDFIELDTEMPLATESINLIST ) are picking parallel plans inappropriately. We worked around this by changing MAXDOP on the server to a value of 1. This is definitely something that you should test in your environment, since it’s a server wide setting. Currently according to the above KB articles there is no fixed recommendation about MAXDOP for MOSS deployments, but things can change, so you should check the fixed requirements of your version. In this scenario we saw significant other number of procedures which encountered the same issue, procedures such as proc_CheckDirectoryTree, proc_DeplGetListItemLinks and PROC_DEPLGETLISTITEMANDRESERVEID amongst others. Typical errors you might see in the application logs for SharePoint would be

    Unexpected query execution failure, error code 1205. Additional error information from SQL Server is included below. "Transaction (Process ID 163) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction."

    A global change of MAXDOP = 1 resolved this specific situation.

    2. Are you encountering long blocking chains accessing the AllDocs table? This object is commonly accessed by a number of different stored procedures, and I’ve seen some scenarios where it becomes a point of contention. The most problematic one I came across recently resulted in numerous deadlocks, which firstly throw the above 1205 error to the SharePoint application log, and then in addition they throw the following error:

    InitializePublishingWebDefaults failed. Microsoft.SharePoint.SPException: Operation aborted (Exception from HRESULT: 0x80004004 (E_ABORT)) ---> System.Runtime.InteropServices.COMException (0x80004004): Operation aborted (Exception from HRESULT: 0x80004004 (E_ABORT))     at Microsoft.SharePoint.Library.SPRequestInternalClass.SetWebProps

    This second error occurs as SharePoint does not roll back properly after encountering the deadlock. If you have collected your SQLDIAG data as noted above, you’ll see symptoms like this in the perfstats script output:

    a) long blocking chains waiting on LCK_M_U

    -- requests --
    runtime                        session_id request_id  ecid        blocking_session_id blocking_ecid task_state      wait_type                                          wait_duration_ms     
    ------------------------------ ---------- ----------- ----------- ------------------- ------------- --------------- -------------------------------------------------- -------------------- 
    2009-05-28T10:01:40.973                60           0           0                 197             0 SUSPENDED       LCK_M_X                                                            3201 
    2009-05-28T10:01:40.973                72           0           0                  60             0 SUSPENDED       LCK_M_S                                                            3224 
    2009-05-28T10:01:40.973                77           0           0                   0             0 SUSPENDED       WAITFOR                                                            9087 
    2009-05-28T10:01:40.973                84           0           0                  72             0 SUSPENDED       LCK_M_S                                                            1018 
    2009-05-28T10:01:40.973               197           0           0                   0             0 RUNNING                                                                               0 
    2009-05-28T10:01:40.973               268           0           0                 270             0 SUSPENDED       LCK_M_U                                                           36299 
    2009-05-28T10:01:40.973               270           0           0                 197             0 SUSPENDED       LCK_M_U                                                           96753 
    2009-05-28T10:01:40.973               277           0           0                 270             0 SUSPENDED       LCK_M_U                                                           44290 
    2009-05-28T10:01:40.973               280           0           0                 270             0 SUSPENDED       LCK_M_U                                                           60461 
    2009-05-28T10:01:40.973               292           0           0                 270             0 SUSPENDED       LCK_M_U                                                           52389 

    b) the notableactivequeries part of the perfstats script will potentially note that the plan_total_exec_count is very low. This indicates that you appear to be running a freshly compiled plan for the blocking statement. This will be all the more likely if your system was running well, and then suddenly started degrading performance and encountering errors

    -- notableactivequeries --
    runtime                        session_id request_id  plan_total_exec_count plan_total_cpu_ms    plan_total_duration_ms plan_total_physical_reads plan_total_logical_writes plan_total_logical_reads 
    ------------------------------ ---------- ----------- --------------------- -------------------- ---------------------- ------------------------- ------------------------- ------------------------ 
    2009-05-28T10:01:40.973               197           0                     5                33057                 406079                      2550                         0                  2352015 
    2009-05-28T10:01:40.973               268           0                     5                33057                 406079                      2550                         0                  2352015 
    2009-05-28T10:01:40.973               270           0                     5                33057                 406079                      2550                         0                  2352015 
    2009-05-28T10:01:40.973               277           0                     5                33057                 406079                      2550                         0                  2352015 
    2009-05-28T10:01:40.973               280           0                     5                33057                 406079                      2550                         0                  2352015 
    2009-05-28T10:01:40.973               292           0                     5                33057                 406079                      2550                         0                  2352015 

     

     

     

     

    c) when reviewing the error logs you note that the same statement shown below, being executed within the procedure proc_UpdateTpWebMetaData, is always involved in the deadlocking

    UPDATE
        Docs
    SET
        Docs.BuildDependencySet = NULL
    FROM
        Docs INNER JOIN BuildDependencies
    ON
        BuildDependencies.SiteId = Docs.SiteId AND
        BuildDependencies.LeafName = Docs.LeafName AND
        BuildDependencies.DirName = Docs.DirName
    WHERE
        BuildDependencies.SiteId = @SiteId AND
        BuildDependencies.TargetDirName = @PreviousCustomMasterDir AND
        BuildDependencies.TargetLeafName = @PreviousCustomMasterLeaf AND
        Docs.BuildDependencySet IS NOT NULL

    The reason for this problem, in this particular scenario, is that a new plan has been compiled recently for this statement which is using a HASH join, as opposed to a nested loop join which it was using earlier. This is caused by a lack of up to date statistics on the AllDocs table. We noted scenarios where the regular update statistics jobs (issued by SharePoint as part of the changes in 2007 SP2) were not thorough enough, and we were forced to run update statistics with the fullscan option on the AllDocs table.

    The bottom line here being that although SharePoint does now do database maintenance, as shown in KB932744, there are still times potentially where you will need to do more detailed targeted maintenance.

  • SQL in Stockholm

    Download location for SQLDIAG Perfstats DMV scripts

    • 1 Comments

    I refer to this many times in my other posts, so here’s a direct link to the download location of the perfstats scripts for SQLDIAG which you can use yourself. I use this everyday, you should too!

    http://blogs.msdn.com/psssql/archive/2007/02/21/sql-server-2005-performance-statistics-script.aspx

  • SQL in Stockholm

    SQL Server 2005 Error: 701, Severity: 17, State: 193

    • 1 Comments

    Here’s another interesting example which we worked recently in the Stockholm office, which relates also back to one of my previous entries. We had a customer whose error log was filled with a variety of critical errors such as these:

    Error: 701, Severity: 17, State: 193. There is insufficient system memory to run this query.

    Error: 1204, Severity: 19, State: 2. The instance of the SQL Server Database Engine cannot obtain a LOCK resource at this time. Rerun your statement when there are fewer active users. Ask the database administrator to check the lock and memory configuration for this instance, or to check for long-running transactions.

    Error: 17803, Severity: 20, State: 13. There was a memory allocation failure during connection establishment. Reduce nonessential memory load, or increase system memory. The connection has been closed. [CLIENT: <local machine>]

    This was a fairly high end system with 64GB RAM, 16 logical CPUs and was generally well configured in terms of its Max Server Memory and other basic sp_configure settings. As you may well be aware, there are innumerable causes that can actually cause you to receive a memory related error such as this, and some of the more basic troubleshooting tips are shown in the following MSDN article

    http://msdn.microsoft.com/en-us/library/aa337311(SQL.90).aspx

    However in this case none of the basics were really of any help. Typical starting questions to ask yourself in this scenario are:

    1. Is there a pattern to when this occurs

    2. Do I know of any particular workloads which are running when I see this problem

    …just to understand if you have any particular potential causes in your own code. Our customer in this scenario was not aware of any, and unfortunately we didn’t have any dumps created which might allow us to see retrospectively what batches were running when the problem occurred.

    However we did have the partial output of from DBCC MEMORYSTATUS which SQL had kindly written to the error log for us. An examination of this showed the following suspicious looking section:

    OBJECTSTORE_LOCK_MANAGER (node 2)
        VM Reserved = 0 KB
        VM Committed = 0 KB
        AWE Allocated = 0 KB
        SM Reserved = 0 KB
        SM Committed = 0 KB
        SinglePage Allocator = 29398856 KB
        MultiPage Allocator = 0 KB

    Global Memory Objects:
            Resource=    709
            Locks=    3676149
            SE Schema Mgr=    1646
            SQLCache=    261
            Replication=    2
            ServerGlobal=    109
            XPGlobal=    2
            Xact=    101
            SETLS=    16
            DatasetMemObjs=    32
            SubpDescPmos=    16
            SortTables=    2

    From this we can infer that we seemed to experience an equivalent problem to the one I wrote about here

    http://blogs.msdn.com/grahamk/archive/2009/02/03/compilation-bottlenecks-error-8628-severity-17-state-0-part-1.aspx

    although manifesting distinctly different symptoms.

    At the end of the day one of the NUMA nodes has taken up about 29GB of memory just managing lock resources, and this is confirmed by the number of locks which are reported in the global memory objects. Remember that the majority of the output from DBCC MEMORYSTATUS is publically documented here:

    How to use the DBCC MEMORYSTATUS command to monitor memory usage on SQL Server 2005

    http://support.microsoft.com/kb/907877

    This 29GB of memory purely managing locks appears to be placing an undue load of the memory resources of the system, and what we appear to have here is a runaway or huge batch of some type, or maybe a very large open transaction.

    The way to resolve this problem is to identify the batch and stop it from running. One good way to identify this problem proactively might be to use the tiny DMV query I wrote in my previous post about this type of situation and attach a monitor to it, warning you if the value goes above 1 million say, or some other high number of your choice.

    select request_session_id, count(*)
    from sys.dm_tran_locks
    group by request_session_id

    You could then easily identify the problem session_id before it got completely out of hand. There are of course many different alternative ways of capturing such data or doing such monitoring. Maybe you’d prefer to create a custom SQLDIAG package to monitor your server in more detail.

    The moral of this story being though, that even in the most critical system failures such as this, there may well be enough info to get a hint of where the problem lies, if you make sure you go through the error logs in detail. Remember not to be overwhelmed by the sheer volume of errors that might be reported; ensure you go through all the lines removing the duplicates to see whether something of interest is hidden amongst all the noise.

Page 1 of 1 (3 items)