SQL in Stockholm

SQL Server thoughts from the support team in Stockholm

SQL Server troubleshooting thoughts from Graham Kent, (ex) Senior Support Escalation Engineer at Microsoft Sweden.

  • SQL in Stockholm

    No SQL posts this week - skiing again!


    I am fortunate enough to be contining the testing of my hypothesis that skiing in Sweden is better than SQL Server support. This week I am testing this theory in Sälen, near the Sweden - Norway border. So far it's all going very much to plan.

    View from my bedroom window in Sälen

    enjoy the rest of your February. I shall return to SQL work on 2nd March :-)

  • SQL in Stockholm

    Kill with rollback estimation


    It's always amazing when you find something new in something so common that you thought you knew so well. When resolving the compilation bottleneck issue I mentioned in the previous 2 posts, I had to use the kill command. It's actually incredibly rare that I use this command, as I really don't believe in it in principal in production if at all avoidable. Anyway in that particular situation it just had to be done. The session that we were killing was doing a huge update, and as such it had to rollback and it was going to take a while (I could guess this from the fact that it had 39 million locks!). However I'd never noticed that the kill command now comes with a rollback estimation syntax. Definately a good feature, but not one to abuse :-)

    From BOL, it looks like this:


    --This is the progress report.
    spid 54: Transaction rollback in progress. Estimated rollback completion: 80% Estimated time left: 10 seconds.

  • SQL in Stockholm

    Compilation Bottlenecks – Error: 8628, Severity: 17, State: 0. – Part 2


    [follow up to RESOURCE_SEMAPHORE_QUERY_COMPILE diagnosis]

    Last week I wrote about the initial diagnosis of this problem, and now here's the follow up. So if you remember we had a DBCC MEMORYSTATUS and a load of other PSSDIAG data collected and we knew that the procedure cache seemed to be constantly flushing as fast as we put stuff into it.

    Now when you come to a situation like this, there's no perfect way to say, here's where to look for the answer. I certainly checked a few different things before I noticed the problem. I guess the bottom line here is that we have some data, we've collected a benchmark of when the system is operating normally, and we just need to review all of this to find something abnormal. We know in this situation that the cache is being flushed, and so we're looking for events or circumstances which potentially use up lots of buffer pool memory, or causes memory flush. One example of this I referred to in this earlier post, so I checked the error logs for this type of situation - nothing doing.

    So being as we're thinking memory related problems, I used some text tools to compare the output from current and previous DBCC MEMORYSTATUS. Out of interest my preferred text tool of note currently is notepad++, I can't say enough good things about this tool.....

    So in doing so I noticed the following significant change in one of the many values that the DBCC command collects.

    Problem occurring snippet:

    OBJECTSTORE_LOCK_MANAGER (Total)                                  KB
    ---------------------------------------------------------------- --------------------
    VM Reserved                                                     131072
    VM Committed                                                    131072
    AWE Allocated                                                   0
    SM Reserved                                                     0
    SM Commited                                                     0
    SinglePage Allocator                                            5769576
    MultiPage Allocator                                             0

    Baseline snippet:

    OBJECTSTORE_LOCK_MANAGER (Total)                                  KB
    ---------------------------------------------------------------- --------------------
    VM Reserved                                                     131072
    VM Committed                                                    131072
    AWE Allocated                                                   0
    SM Reserved                                                     0
    SM Commited                                                     0
    SinglePage Allocator                                            33232
    MultiPage Allocator                                             0

    In our baselines, the single page allocator for the lock manager consistently shows unless 50000KB, even in times of extreme user load. However now we suddenly have 5769576KB or about 5.5GB....so this would pretty much explain where all our memory went. We monitored this situation for a few minutes and this value was consistently climbing, so it seemed likely that a rogue process or user was out there issuing one a fairly crazy query.

    So all we had to do now was to identify this person or batch. There were about 3000 connections to this system, and so the easiest way I could think of, which I mentioned in a very short post at the time, was to run a query against the sys.dm_tran_locks DMV and count the total number of locks by sessionid. This showed that one particular session was holding over 39 million key locks, which pretty much explained the problem!

    After this it was pretty simple to track down this user, ask them what they were doing, and stop them doing it. Once this batch was terminated the lock manager released its memory and everything soon went back to normal.

  • SQL in Stockholm

    SQL Server Compilation Bottlenecks – Error: 8628, Severity: 17, State: 0. – Part 1


    [alternative title: Memory Troubleshooting using DBCC MEMORYSTATUS]

    The following samples are taken from a live problem I was working on for most of last week. The steps might well help you resolve an occurrence of the above error, but the methodology might help you with some general performance and memory troubleshooting as well, hence the alternative title.


    • A previously well running server starts exhibiting extremely slow performance for all users during the online day.
    • CPU time is maxed out at 100%
    • The issue is intermittent and has no correlation to batch business processing or periods of peak usage

    Initial Actions:

    • Configure a PSSDIAG (but could just have easily have been SQLDIAG) to collect perfstats script, perfmon counters
    • Don’t include profiler trace to begin with, as the server is already maxed out for CPU and you might skew the results or make the situation worse, especially as going into the problem you don’t actually know what the problem is related to
    • Configure the collection for long term running, as you don’t know when the error will next occur

    Initial Results

    • When reviewing the error logs you note that Error: 8628, Severity: 17, State: 0. was logged repeatedly when the problem occurred in the past.
    • RESOURCE_SEMAPHORE_QUERY_COMPILE  is the top aggregate wait stat in sys.dm_os_wait_stats
    • The server was clustered and a failover occurred when the problem last happened
    • No other items of note were recorded in the past
    • When the issue is not occurring everything looks fine on the server, there are no indications of dangerously long running queries or memory pressure etc.

    This is a common scenario that I come across. You know something bad has happened, you can reproduce it on demand, but you suspect it will happen again. At this point I always like to do the above, review the general health of the server with PSSDIAG / SQLDIAG and then do another round of monitoring, once I know what I’m looking for. In this case I’m targeting the 8628 error as it seems pretty conclusively tied to the outage.

    One of the principal hits you get if you search microsoft.com sites for error 8628 is the following page

    How to use the DBCC MEMORYSTATUS command to monitor memory usage on SQL Server 2005


    and it points out that this error is thrown when there are too many queries trying to compile simultaneously, and that the gateway structures are throttling the compilations. You can read more background about this theory in the above article and there’s also some reference to it in the following white paper

    Troubleshooting Performance Problems in SQL Server 2005


    which includes an excellent section on internal physical memory pressure.

    So now we’re on the track of the problem. We reconfigure the PSSDIAG to include a script which runs DBCC MEMORYSTATUS regularly so that we can be sure to capture this output when the problem next occurs. We also make sure that we keep a copy of the data from this query when the system is running well, so that we can have a baseline to compare against. At this point I still didn’t have a profiler running as I didn’t want to have to manage the disk space and have the performance overhead, and the perfstats script add-in gives you so much information anyway.

    So the problem eventually occurred again, and here are some key outputs from the DBCC MEMORYSTATUS related to the compilation gateways:


    Procedure Cache                Value     
    ------------------------------ -----------
    TotalProcs                            7614
    TotalPages                          570060
    InUsePages                           32736

    Small Gateway                  Value              
    ------------------------------ --------------------
    Configured Units                                 32
    Available Units                                  32
    Acquires                                          0
    Waiters                                           0
    Threshold Factor                             500000
    Threshold                                    500000

    Problem Occurring:

    Procedure Cache               Value
    ------------------------------ -----------
    TotalProcs                     397
    TotalPages                     66888
    InUsePages                     44924

    Small Gateway                 Value
    ------------------------------ --------------------
    Configured Units               32
    Available Units                0
    Acquires                       32
    Waiters                        59
    Threshold Factor               500000
    Threshold                      500000

    So what we can see here is that we are (as expected) encountering a big bottleneck on the gateways (the small one in this case) and at this moment in time 59 compilations are queued. What we can also note is that the reason that this appears to be the case is that the procedure cache is significantly smaller than its normal size, about 85% smaller in fact.

    This therefore, is why so many queries are suddenly trying to compile. For reasons currently unknown at this point 85 % of the cached plans have been thrown away from the cache and so the many thousands of connections on this system are now all having to compile new plans.

    Now you might be thinking, sometimes you see this type of behaviour when the cache is flushed. This is true, but in this case this wasn’t the problem. We sat and monitored the output above for quite a while, there was constant compilation queues but the cache was not growing. More to the point there were no cache flush events in the error logs, and we had not performed and maintenance operations which cause cache flush. I talked about a classic example of this behaviour in this post.

    So we need to look elsewhere, and ask ourselves the question, “Why is my procedure cache 85% of its normal size?”

    That’s for my next post.

  • SQL in Stockholm

    Which sessionid has the most locks in a SQL Server instance


    Quick post today. Was playing around with something on Friday night, thinking how to answer this question quickly whilst looking a live performance problem. I have a huge selection of ready to go DMV based queries, aimed at targetting all types of performance problems, but this was one I didn't have. The answer of course is beautifully simple through the magic of DMVs. Do you remember when these didn't exist? Sometimes it's easy to forget :-) The answer and a good one to have in your troubleshooting code toolbox:

    select request_session_id, count(*)

    from sys.dm_tran_locks

    group by request_session_id

Page 1 of 1 (5 items)