SQL Server troubleshooting thoughts from Graham Kent, (ex) Senior Support Escalation Engineer at Microsoft Sweden.
Here's something I was looking at yesterday which is worth a quick look. It's quite a common scenario and one that's relatively easy to address. Your SQL Server 2005 instance suddenly throws the following to the error log (apparently without warning):
2008-09-09 08:30:44.810 Server Using 'dbghelp.dll' version '4.0.5'2008-09-09 08:30:44.960 Server **Dump thread - spid = 0, PSS = 0x0000000000000000, EC = 0x00000000000000002008-09-09 08:30:45.000 Server ***Stack Dump being sent to F:\MSSQL.1\MSSQL\LOG\SQLDump0001.txt 2008-09-09 08:30:45.000 Server * ******************************************************************************* 2008-09-09 08:30:45.010 Server * 2008-09-09 08:30:45.010 Server * BEGIN STACK DUMP: 2008-09-09 08:30:45.010 Server * 09/09/08 08:30:45 spid 0 2008-09-09 08:30:45.010 Server * 2008-09-09 08:30:45.010 Server * Non-yielding IOCP Listener 2008-09-09 08:30:45.010 Server * 2008-09-09 08:30:45.010 Server * ******************************************************************************* 2008-09-09 08:30:45.010 Server * ------------------------------------------------------------------------------- 2008-09-09 08:30:45.010 Server * Short Stack Dump 2008-09-09 08:30:45.320 Server Stack Signature for the dump is 0x000000000000020B 2008-09-09 04:47:06.180 spid3s LazyWriter: warning, no free buffers found.
Following this several hundred lines of output similar to the following format are shown
2008-09-09 08:31:08.750 spid3s Memory Manager VM Reserved = 16963064 KB VM Committed = 16942872 KB AWE Allocated = 0 KB Reserved Memory = 1024 KB Reserved Memory In Use = 0 KB 2008-09-09 08:31:08.750 spid3s Memory node Id = 0 VM Reserved = 16957304 KB VM Committed = 16937200 KB AWE Allocated = 0 KB SinglePage Allocator = 2519016 KB MultiPage Allocator = 39632 KB 2008-09-09 08:31:08.750 spid3s MEMORYCLERK_SQLGENERAL (Total) VM Reserved = 0 KB VM Committed = 0 KB AWE Allocated = 0 KB SM Reserved = 0 KB SM Committed = 0 KB SinglePage Allocator = 48576 KB MultiPage Allocator = 5136 KB
You may well recognize that this latter output is basically the same as you would get if you run:DBCC MEMORYSTATUS
You'll also have noted that a small dump file has been produced in the default error log directory of this SQL instance (the path of which is noted in the error log output above).
In normal circumstances you can submit this dump file to us in support, and we can take a look at it and, all being well tell you what the problem is. The thing you have to remember here is that the production of the dump file does not necessarily mean that you have encountered a bug in SQL Server (as some people often believe). In the more recent versions of SQL there are innumerable scenarios where we control and catch certain scenarios well known to us and produce a dump file automatically.
This particular scenario was caused by inappropriate configuration of SQL Server memory settings, and should be easily rectified, so I thought it a good example to show why.
Although there is little that can normally be gleaned from debugging a SQL Server dump in the public domain, this one does contain a few hints. We don't publish private symbols for SQL Server, meaning that the amount you can do yourself with a dump file is quite limited (since you only have access to retail symbols), however you can look at the thread stacks and see if anything of note can be seen. Anyone who has read Ken Henderson's fantastic book about SQL Server internals will have used these techniques before. Sometimes it's just interesting to look a little deeper after all.
I'll assume that you're already familiar with installing and configuring the debugger, but if you're not here's a good place to start. Debugging Tools and Symbols: Getting Started. This example just uses the retail symbol server address from this page:
If you open this dump, you'll find yourself starting with the context set to the thread that generated the dump, in this case:
0:004> kn# Child-SP RetAddr Call Site00 00000000`0474de58 00000000`77d704bf ntdll!NtWaitForSingleObject+0xa01 00000000`0474de60 00000000`016be0f9 kernel32!WaitForSingleObjectEx+0x13002 00000000`0474df00 00000000`016bdaeb sqlservr!CDmpDump::DumpInternal+0x4a303 00000000`0474dff0 00000000`01c7c7b3 sqlservr!CDmpDump::Dump+0x3b04 00000000`0474e040 00000000`01d3af67 sqlservr!CImageHelper::DoMiniDump+0x41305 00000000`0474e1a0 00000000`025f8cf5 sqlservr!stackTrace+0x6e706 00000000`0474f6d0 00000000`01a3b025 sqlservr!SQL_SOSNonYieldIOCPCallback+0x1ef07 00000000`0476f7e0 00000000`018662b6 sqlservr!SOS_OS::ExecuteNonYieldIOCPCallbacks+0x4408 00000000`0476f830 00000000`010b9628 sqlservr!alloca_probe+0x22081509 00000000`0476f990 00000000`029002c9 sqlservr!SchedulerMonitor::Run+0x1640a 00000000`0476fac0 00000000`010079fe sqlservr!SchedulerMonitor::EntryPoint+0x90b 00000000`0476faf0 00000000`01008362 sqlservr!SOS_Task::Param::Execute+0xee0c 00000000`0476fc00 00000000`01007da1 sqlservr!SOS_Scheduler::RunTask+0xc90d 00000000`0476fc90 00000000`014787a7 sqlservr!SOS_Scheduler::ProcessTasks+0xb40e 00000000`0476fd00 00000000`014789ca sqlservr!SchedulerManager::WorkerEntryPoint+0xe70f 00000000`0476fda0 00000000`01478911 sqlservr!SystemThread::RunWorker+0x5910 00000000`0476fde0 00000000`014795d9 sqlservr!SystemThreadDispatcher::ProcessWorker+0x13011 00000000`0476fe80 00000000`781337a7 sqlservr!SchedulerManager::ThreadEntryPoint+0x12812 00000000`0476ff20 00000000`78133864 msvcr80!_callthreadstartex+0x1713 00000000`0476ff50 00000000`77d6b6da msvcr80!_threadstartex+0x84
This is the thread that is actually doing the monitoring and generating the dump. You can see in frame 02 the name of the function doing this. If you had the private symbols at this point, you can look through some of the earlier frames and determine which thread caused this condition to occur, but I can't show that here :-)
The thread in question here was actually thread 8, we can switch to this and take a look
0:008> kn 50# Child-SP RetAddr Call Site00 00000000`04f6f008 00000000`01750bd0 sqlservr!NumaNode::RampUp01 00000000`04f6f010 00000000`012f1523 sqlservr!alloca_probe+0xbf42002 00000000`04f6f110 00000000`0103f136 sqlservr!BPool::Steal+0x3ec03 00000000`04f6f1c0 00000000`0103f32e sqlservr!SQLSinglePageAllocator::AllocatePages+0x2504 00000000`04f6f200 00000000`0103f1ce sqlservr!MemoryNode::AllocatePagesInternal+0xce05 00000000`04f6f4a0 00000000`0103f7ba sqlservr!MemoryClerkInternal::AllocatePages+0x6e06 00000000`04f6f500 00000000`010d4bb4 sqlservr!CMemThread::PbGetNewPages+0x6a07 00000000`04f6f540 00000000`01068bba sqlservr!CFixedPageMgr::PbAllocate+0xcd08 00000000`04f6f5b0 00000000`01004d37 sqlservr!CMemThread::Alloc+0x1d709 00000000`04f6f640 00000000`0147dc3a sqlservr!operator new+0x260a 00000000`04f6f680 00000000`01045589 sqlservr!SOS_Node::EnqueueTaskDirectInternal+0x5870b 00000000`04f6f730 00000000`01045b03 sqlservr!SOS_Node::EnqueueTask+0x490c 00000000`04f6f780 00000000`01046581 sqlservr!CNetConnection::EnqueuePacket+0x8f0d 00000000`04f6f890 00000000`01045cab sqlservr!TDSSNIClient::ReadHandler+0x1710e 00000000`04f6f900 00000000`010459bc sqlservr!SNIReadDone+0xb70f 00000000`04f6f960 00000000`010079fe sqlservr!SOS_Node::ListenOnIOCompletionPort+0x25110 00000000`04f6faf0 00000000`01008362 sqlservr!SOS_Task::Param::Execute+0xee11 00000000`04f6fc00 00000000`01007da1 sqlservr!SOS_Scheduler::RunTask+0xc912 00000000`04f6fc90 00000000`014787a7 sqlservr!SOS_Scheduler::ProcessTasks+0xb413 00000000`04f6fd00 00000000`014789ca sqlservr!SchedulerManager::WorkerEntryPoint+0xe714 00000000`04f6fda0 00000000`01478911 sqlservr!SystemThread::RunWorker+0x5915 00000000`04f6fde0 00000000`014795d9 sqlservr!SystemThreadDispatcher::ProcessWorker+0x13016 00000000`04f6fe80 00000000`781337a7 sqlservr!SchedulerManager::ThreadEntryPoint+0x12817 00000000`04f6ff20 00000000`78133864 msvcr80!_callthreadstartex+0x1718 00000000`04f6ff50 00000000`77d6b6da msvcr80!_threadstartex+0x8419 00000000`04f6ff80 00000000`00000000 kernel32!BaseThreadStart+0x3a
So again, just by looking at the function names, we can make a guess that some internal memory routines related to the buffer pool are going on here, particularly from frames 02 and 03. If you ever attach a debugger into a running SQL Server instance (obviously never in production!) you'd see functions like this occurring regularly as SQL Server manages the buffer pool pages.
There's not much more that can be seen from this stack, so we'll take a look through the rest of the threads, in this case by running
When browsing through these threads, we notice lots that also appear to be involved in memory related events, just one example of which is
53 Id: 554.a6c Suspend: 1 Teb: 000007ff`fffd9000 Unfrozen# Child-SP RetAddr Call Site00 00000000`1521e140 00000000`012f1523 sqlservr!BPool::ReplenishFreeList+0x10d01 00000000`1521e240 00000000`0103f136 sqlservr!BPool::Steal+0x3ec02 00000000`1521e2f0 00000000`0103f32e sqlservr!SQLSinglePageAllocator::AllocatePages+0x2503 00000000`1521e330 00000000`0103f1ce sqlservr!MemoryNode::AllocatePagesInternal+0xce04 00000000`1521e5d0 00000000`0103f676 sqlservr!MemoryClerkInternal::AllocatePages+0x6e05 00000000`1521e630 00000000`0103f737 sqlservr!CVarPageMgr::PviNewVarPage+0x5606 00000000`1521e6a0 00000000`0100ca58 sqlservr!CVarPageMgr::PbAllocate+0x3ce07 00000000`1521e710 00000000`01004d37 sqlservr!CMemObj::Alloc+0xcc08 00000000`1521e780 00000000`0109b203 sqlservr!operator new+0x2609 00000000`1521e7c0 00000000`014607a1 sqlservr!operator new+0x230a 00000000`1521e800 00000000`0145d4c8 sqlservr!CErrorReportingManager::CwchFormatToBuffer+0x1800b 00000000`1521e970 00000000`0128d202 sqlservr!CErrorReportingManager::CwchCallFormatToBuffer+0x180c 00000000`1521e9b0 00000000`0128d0f8 sqlservr!CLoginClientInfo::CLoginClientInfo+0x710d 00000000`1521ea30 00000000`01285265 sqlservr!IssueLoginSuccessReport+0x670e 00000000`1521eb40 00000000`0128494f sqlservr!login+0x5840f 00000000`1521f480 00000000`01285610 sqlservr!process_login+0xee10 00000000`1521f840 00000000`010079fe sqlservr!process_commands+0x40d11 00000000`1521faf0 00000000`01008362 sqlservr!SOS_Task::Param::Execute+0xee12 00000000`1521fc00 00000000`01007da1 sqlservr!SOS_Scheduler::RunTask+0xc913 00000000`1521fc90 00000000`014787a7 sqlservr!SOS_Scheduler::ProcessTasks+0xb4
We note that sqlservr!BPool::ReplenishFreeList is common to a lot of the threads, and other BPool functions crop up frequently as well.
At this point there is not much else to be gleaned from the dump without private symbols, however it's clear that there appears to be a significant number of threads waiting on what appear to be internal memory routines. In this case, we also have the fact that SQL Server itself has detected that it believes that memory allocations were the cause of the problem, which is why it has dumped out the DBCC MEMORYSTATUS to help us. The error log specifically states that it has "no free buffers".
If you were to look through the whole of the DBCC MEMORYSTATUS and decode some of the latter items using this article as a guide
then you would note that the buffer pool has been recently grown, and it is issuing a notification that it wished to grow further.
So what next, is the question; you've noted that SQL Server appears to be having memory management problems, so maybe you think that you should add some more memory to alleviate this? Well in a word - "NO".
The answer to this problem is that we have to look at the configuration of the hardware, and the configuration of SQL Server first (which in any troubleshooting scenario is always a good idea).
What we saw here at the time of the dump was:
MemoryLoad = 95% Total Physical = 16382 MB Available Physical = 792 MB
And in fact over the preceding minutes before the dump we saw even lower memory availability than this, with physical utilization up to 98%. (this is something that we can extract from the dump with private symbols).
So we do appear short of physical memory, but when we run sp_configure on the server, we see this:
name minimum maximum config_value run_value----------------------------------- ----------- ----------- ------------ -----------max server memory (MB) 16 2147483647 2147483647 2147483647
Therefore the max server memory setting has not been set. Although SQL will dynamically attempt to manage its memory, there can be circumstances in large memory installations where if it runs unlimited you can get into situations where it ends up fighting with other processes or the OS for memory, and situations like this can occur. This is why we recommend running a "max server memory" setting in this type of scenario, to limit the size of the buffer pool to an amount lower than the total memory on the server itself. The most referenced article on this is probably here
We see numerous cases revolving around this scenario and derivatives of it, and many can simply be avoided by limiting the memory of the SQL Server buffer pool with this setting, so that you don't ever get into this memory bound scenario. So the bottom line here is change the max server memory setting to something around the 13000 mark, and then monitor your memory usage to see if you're still encountering this type of problem, making further adjustments if necessary.
If you have any questions surrounding this topic, feel free to post them.
So, no posts for 10 weeks, but that's because sometimes there are other things going on in Stockholm as opposed to SQL Server support :-) Here's what I've been doing over the summer....
I've been fortunate enough to have several weeks paternity leave off to spend with the family and enjoy the Swedish summer. But I'm back to the technical stuff this week and of course to the release of SQL 2008.
This week I've been working on some proactive advisory cases with Swedish customers, reviewing their installations for best practices and providing some tips and tricks. We do this quite frequently for Premier customers and it's good for all involved. Customers get some of their servers checked out in advance of any potential technical problems, and I get to stop things happening in advance, as opposed to my normal "fix it urgently" role :-)
I looked back on my past few reports of this nature today, and I thought I'd just mention the one point that has come up in every review that I've done this year. This may just be co-incidence but maybe word isn't spread as well as it could be.
In SQL 2005 there were significant numbers of enhancement to the implementation of tempdb, both from an implementation and a supportability standpoint. We now recommend that by default you create one tempdb data file per visible scheduler within the SQLOS, or in other words (for most people) one per logical processor core that you expose or affinitize to SQL Server. Then test and benchmark the results and do some fine tuning if you need be.
Remember, you can look at the schedulers that have been created by looking in the following DMV
Here are the 2 main articles that I always point customers to on this tempdb issue:
I'll be back in coming weeks with "proper troubleshooting".