With this event you will get a message stating, “A process serving application pool '%1' failed to respond to a ping. The process id was '%2'.” When an Event ID 5010 appears in your system log and WAS is the source, you can be assured of two things. The first is that a worker process failed to respond to a ping from the Windows Process Activation Service (WAS). The second is that WAS attempted to shut down the offending process. Generally, this means that either the worker process hung or it did not have enough resources to respond to the ping request. The resolution in the documentation for this Event is listed as:
A worker process that fails to respond may exhibit one of the following symptoms:
There are some links provided as well that provide guidance on troubleshooting memory leaks, long response times, and high CPU usage in the worker process. In each of those 3 cases, there is a deterministic approach with which you can take a memory dump and investigate the issue. However, in practice there are times when the worker process may simply hang without displaying any of those behaviors. So the question becomes, how do we know when the worker process is hung? The answer is, WAS knows, so we just have to ask WAS.
I’m going to go through the steps of finding the breakpoints and getting the needed information because the methodology can be used elsewhere. For example, though the steps at the bottom of the page work for IIS 7, 7.5, and 8, I haven’t tested them with IIS 6 or below. If you’re currently experiencing this issue and just want to know how to get the memory dump of the Worker Process, feel free to skip down to the bottom.
When working with an issue like this, our first step is always to clarify what we know and what we need to know.
To get to the answers we need, we’ll investigate the WAS process with WinDbg. Our goal will be to find a method call that signifies a failed ping, set a breakpoint on it, then identify the associated worker process at the time of failure. We’ll start by attaching the debugger to the WAS process and reloading symbols. Symbols are important here because we’ll need them to investigate the method calls. Once we have the debugger attached, we can take a look at the loaded modules in an effort to make some guesses at where we should look. We’ll do this by running lm, for list modules. We can append 1m and sm to only list the module names and sort them alphabetically.
0:000> lm 1m sm ADVAPI32 bcryptPrimitives clbcatq combase CRYPT32 CRYPTBASE CRYPTSP DPAPI GDI32 HTTPAPI IISRES iisutil iisw3adm KERNEL32 KERNELBASE ktmw32 logoncli mlang MSASN1 msvcrt nativerd NSI ntdll ntmarta ole32 OLEAUT32 pcwum profapi RPCRT4 rsaenh sechost Secur32 SSPICLI svchost USER32 USERENV w3ctrlps W3TP WS2_32 XmlLite
Now that we have the modules, we can make some guesses as to which ones might be related to the failed ping request processing. It seems likely that the module responsible will either be one of the IIS processes or a process with w3 in the name. Next in the debugger we can search for method calls in those modules that could likely be related. We can use the x command to search for the methods using wild cards.
0:000> x *iis*!*ping* 000007fb`45efb2d4 iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem (<no parameter info>) 000007fb`45f2c330 iisw3adm!g_aIISWPSiteMappings = <no type information> 000007fb`45f247dc iisw3adm!AddDomainMapping (<no parameter info>) 000007fb`45f13fb0 iisw3adm!CONFIG_CS_PATH_MAPPER::SetConfigFileMappings (<no parameter info>) 000007fb`45f2c4d0 iisw3adm!g_aIISWPGlobalMappings = <no type information> 000007fb`45efdde4 iisw3adm!MESSAGING_HANDLER::HandlePingReply (<no parameter info>) 000007fb`45efc59c iisw3adm!PingResponseTimerExpiredCallback (<no parameter info>) 000007fb`45f24918 iisw3adm!EnsureLSAMapping (<no parameter info>) 000007fb`45f2c490 iisw3adm!g_aIISULGlobalMappings = <no type information> 000007fb`45f2c2d0 iisw3adm!aIISULSiteMappings = <no type information> 000007fb`45f059f4 iisw3adm!PERF_MANAGER::ProcessPerfCounterPingFired (<no parameter info>) 000007fb`45f2c168 iisw3adm!g_cIISWPSiteMappings = <no type information> 000007fb`45efb64c iisw3adm!WORKER_PROCESS::CancelPingResponseTimer (<no parameter info>) 000007fb`45f2d3cc iisw3adm!g_fWASStoppingInProgress = <no type information> 000007fb`45f06be0 iisw3adm!PerfCountPing (<no parameter info>) 000007fb`45f31dc0 iisw3adm!_imp_LsaManageSidNameMapping = <no type information> 000007fb`45ed98a0 iisw3adm!CheckWASIsStopping (<no parameter info>) 000007fb`45f31eb0 iisw3adm!_imp_CreateFileMappingW = <no type information> 000007fb`45f31eb8 iisw3adm!_imp_OpenFileMappingW = <no type information> 000007fb`45f2c164 iisw3adm!cIISULSiteMappings = <no type information> 000007fb`45f2c484 iisw3adm!g_cIISULGlobalMappings = <no type information> 000007fb`45efb164 iisw3adm!WORKER_PROCESS::SendPingWorkItem (<no parameter info>) 000007fb`45efc51c iisw3adm!SendPingTimerCallback (<no parameter info>) 000007fb`45f31df0 iisw3adm!_imp_CreateFileMappingA = <no type information> 000007fb`45f31df8 iisw3adm!_imp_OpenFileMappingA = <no type information> 000007fb`45f2c488 iisw3adm!g_cIISWPGlobalMappings = <no type information>
The first method call returned looks very promising and is a good place to start. We can set a breakpoint on that line item, tell the process to continue, then intentionally cause a failed ping request to see if the breakpoint trips.
0:000> bp iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem 0:000> g
To cause a failed ping request, we can suspend a worker process with a tool such as Process Explorer.
After the time specified in the Application Pool’s configuration, the WAS process should identify the worker process as hung at attempt to shut it down. If our breakpoint is correct, the debugger, still attached to the WAS process, should break in. As it turns out, that is exactly what happens in this case.
Breakpoint 0 hit iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem: 000007fb`45efb2d4 48895c2410 mov qword ptr [rsp+10h],rbx ss:000000cd`a86cfba8=705541a9cd000000
The breakpoint has been hit, and if we check Process Explorer, we can see the process is still alive. Further, we know because of the calling convention that we’re dealing with a WORKER_PROCESS object. At this point, we could absolutely write a script to dump all of the worker processes on the machine and we would know we got a dump of the hung process. If however, we’re working on a machine that has a lot of worker processes, it may be more beneficial to find the actual worker process being referred to and just dump that object. The WORKER_PROCESS object will almost undoubtedly contain a PID for the worker process with which we could use with a script to dump just that worker process. We can investigate the current thread and try to find the worker process object. We can then look to see if there is an offset from the object that points to the PID.
We’ll start by dumping the call stack.
0:002> k Child-SP RetAddr Call Site 000000cd`a86cfb98 000007fb`45ef7385 iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem 000000cd`a86cfba0 000007fb`45ede8e3 iisw3adm!WORKER_PROCESS::ExecuteWorkItem+0x41 000000cd`a86cfbf0 000007fb`45ede442 iisw3adm!WORK_ITEM::Execute+0x13 000000cd`a86cfc30 000007fb`45edb129 iisw3adm!WORK_QUEUE::ProcessWorkItem+0xa2 000000cd`a86cfc70 000007fb`45edafb7 iisw3adm!WEB_ADMIN_SERVICE::MainWorkerThread+0x35 000000cd`a86cfcb0 000007fb`45ed9aec iisw3adm!WEB_ADMIN_SERVICE::ExecuteService+0xdf 000000cd`a86cfcf0 000007fb`4e8f167e iisw3adm!WASServiceThreadProc+0x94 000000cd`a86cfd60 000007fb`50533501 KERNEL32!BaseThreadInitThunk+0x1a 000000cd`a86cfd90 00000000`00000000 ntdll!RtlUserThreadStart+0x1d
This gives us an idea of the calling order, but the public symbols we’re using won’t give us local variables and in this instance they won’t tell us if and what parameters are passed. x86 Calling conventions dictate that in a 64 bit process, the first 4 parameters are stored in the CPU registers RCX, RDX, R8, then R9 in that order. Any parameters in addition to that will be stored on the stack. If this were a 32 bit process, the parameters would most likely all be on the call stack. Knowing this, we can start by searching the call stack and the 4 registers mentioned for the PID we identified. In fact, any of the non null registers displayed with the r command are probably a good place to look, excluding RIP which will point to code.
0:002> r rax=0000000000000000 rbx=0000000000000000 rcx=000000cda910e520 rdx=000000cda94159d0 rsi=0000000000000001 rdi=000000cda910e520 rip=000007fb45efb2d4 rsp=000000cda86cfb98 rbp=000000cda8709ec0 r8=000000cda86cfb28 r9=000000cda86cfc10 r10=0000000000000000 r11=0000000000000246 r12=0000000000000000 r13=0000000000000000 r14=0000000000000000 r15=0000000000000000 iopl=0 nv up ei pl zr na po nc cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000246 iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem: 000007fb`45efb2d4 48895c2410 mov qword ptr [rsp+10h],rbx ss:000000cd`a86cfba8=705d41a9cd000000
We can see that RAX and RBX are null, but RCX and RDX are not. We’ll start with the call stack then move on to the other registers. For the call stack, we’ll search between 100 bytes under the top of the stack pointer (RSP) and up to the stack base (RBP). Remember, stacks call stacks start at a higher address and grow down as they move up the stack. We’re searching for a word value because that’s plenty of space for the PID to be stored in and searching for our known PID, 8012.
0:002> s -[1]w @rsp-100 @rbp 0n8012
This returned no results. So, we don’t have the PID on the call stack. Next we’ll search from RCX to 300 Bytes greater than RCX.
0:002> s -[1]w @rcx L300 0n8012 0x000000cd`a910e568 0x000000cd`a910e56c
In this case we have two results. We can see how far from RCX the first result is.
0:002> ?0x000000cd`a910e568-@rcx Evaluate expression: 72 = 00000000`00000048
We get an offset of 72 bytes, which is very close. We can do a couple things to verify that we can use this offset in the future. First, we can check to see if we have any symbol information near RCX.
0:002> dps @rcx 000000cd`a910e520 000007fb`45ecc2b8 iisw3adm!WORKER_PROCESS::`vftable' 000000cd`a910e528 000000cd`43525057 000000cd`a910e530 000000cd`a90f4998 000000cd`a910e538 000000cd`a90f4998 000000cd`a910e540 00000000`00000005 000000cd`a910e548 000000cd`a9390db0 000000cd`a910e550 00000000`00000006 000000cd`a910e558 000000cd`a90f4930 000000cd`a910e560 000000cd`a90f4b40 000000cd`a910e568 00001f4c`00001f4c 000000cd`a910e570 00000000`0000031c 000000cd`a910e578 00000000`00000000 000000cd`a910e580 000000cd`a7f7ed20 000000cd`a910e588 00000000`00000001 000000cd`a910e590 000000cd`00000000 000000cd`a910e598 00000000`00000000
It seems RCX is a pointer to a WORKER_PROCESS virtual function table. This means it is essentially the address for the WORKER_PROCESS object. It also means our offset of RCX+72 is probably correct and could be used consistently. We can be sure with one last check. We need to let the debugger continue, create and pause another worker process, let the debugger break in at our earlier breakpoint, then validate that RCX+72 points to the new PID.
With the worker process paused, we note our new PID is 4356. In Hex that comes to 1104, so that’s what we’re looking for.
0:002> g Breakpoint 0 hit iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem: 000007fb`45efb2d4 48895c2410 mov qword ptr [rsp+10h],rbx ss:000000cd`a86cfba8=705d41a9cd000000 0:002> dw @rcx+0n72 l1 000000cd`a910e568 1104
Clearly, our offset works as it points to the correct new PID. This means we can use our breakpoint and this offset to create a script to dump the hung worker process. Below, I’ve listed the steps to script this with DebugDiag, but it could easily be adapted to work with ADPlus or even CDB and WinDbg directly.
(On a side note, review of the disassembly for the PingResponseTimerExpiredWorkItem showed that it did not have any parameters. RCX was clearly not a parameter, but it’s a common enough register that it is still good to check. It definitely paid off in this instance.)
The assumption is made that the production machine with the issue does not have internet access as this is a common practice. If it’s not the case and internet access is available, you can complete Step 1 on the production machine.
Dim w3wpProcessIdrcx = CausingThread.Register("rcx")w3wpProcessId = Debugger.ReadDWORD(rcx + 72)WriteToLog "w3wp process id = " & w3wpProcessId Debugger.CreateDumpForProcessID w3wpProcessID,"Failed Ping",False
[9/30/2012 2:46:53 PM] Initializing control script [9/30/2012 2:46:53 PM] Clearing any existing breakpoints [9/30/2012 2:46:53 PM] [9/30/2012 2:46:53 PM] Attempting to set unmanaged breakpoint at iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem [9/30/2012 2:46:53 PM] bpID = 0 [9/30/2012 2:46:53 AM] Current Breakpoint List(BL) 0 e 000007fb`45efb2d4 0001 (0001) 0:**** iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem
You can test the above set up by suspending a worker process with Process Explorer and making sure a memory dump is created. That’s all there is to it. Of course, when you have the memory dump, that’s just the beginning as you will have to debug that dump to identify the cause of the hang.
-Jarrod