[follow up to RESOURCE_SEMAPHORE_QUERY_COMPILE diagnosis]
Last week I wrote about the initial diagnosis of this problem, and now here's the follow up. So if you remember we had a DBCC MEMORYSTATUS and a load of other PSSDIAG data collected and we knew that the procedure cache seemed to be constantly flushing as fast as we put stuff into it.
Now when you come to a situation like this, there's no perfect way to say, here's where to look for the answer. I certainly checked a few different things before I noticed the problem. I guess the bottom line here is that we have some data, we've collected a benchmark of when the system is operating normally, and we just need to review all of this to find something abnormal. We know in this situation that the cache is being flushed, and so we're looking for events or circumstances which potentially use up lots of buffer pool memory, or causes memory flush. One example of this I referred to in this earlier post, so I checked the error logs for this type of situation - nothing doing.
So being as we're thinking memory related problems, I used some text tools to compare the output from current and previous DBCC MEMORYSTATUS. Out of interest my preferred text tool of note currently is notepad++, I can't say enough good things about this tool.....
So in doing so I noticed the following significant change in one of the many values that the DBCC command collects.
Problem occurring snippet:
OBJECTSTORE_LOCK_MANAGER (Total) KB ---------------------------------------------------------------- -------------------- VM Reserved 131072 VM Committed 131072 AWE Allocated 0 SM Reserved 0 SM Commited 0 SinglePage Allocator 5769576 MultiPage Allocator 0
Baseline snippet:
In our baselines, the single page allocator for the lock manager consistently shows unless 50000KB, even in times of extreme user load. However now we suddenly have 5769576KB or about 5.5GB....so this would pretty much explain where all our memory went. We monitored this situation for a few minutes and this value was consistently climbing, so it seemed likely that a rogue process or user was out there issuing one a fairly crazy query.
So all we had to do now was to identify this person or batch. There were about 3000 connections to this system, and so the easiest way I could think of, which I mentioned in a very short post at the time, was to run a query against the sys.dm_tran_locks DMV and count the total number of locks by sessionid. This showed that one particular session was holding over 39 million key locks, which pretty much explained the problem!
After this it was pretty simple to track down this user, ask them what they were doing, and stop them doing it. Once this batch was terminated the lock manager released its memory and everything soon went back to normal.
PingBack from http://pooltoysite.info/story.php?id=6659