I ran into a case recently where there was a large amount of memory being used in a 32 bit IIS process leading to an out of memory exception. The steps to getting the correct data are documented all over the place. Here’s the link to Tess’ site that I learned on: http://blogs.msdn.com/b/tess/archive/2008/02/15/net-debugging-demos-lab-3-memory.aspx
I’ll skip the setup and go straight to the debug of the high memory usage dump. To start with, as always when I’m debugging, we start with what we know:
Our next step is to list what we need to know:
The first thing to do is figure out where the memory usage is, and whether it’s managed or Native.
0:000> !address -summary
--- Usage Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal Free 280 355ac000 ( 853.672 Mb) 41.68% <unknown> 830 34808000 ( 840.031 Mb) 70.34% 41.02% Image 1013 12479000 ( 292.473 Mb) 24.49% 14.28% Heap 218 3189000 ( 49.535 Mb) 4.15% 2.42% Stack 129 ac0000 ( 10.750 Mb) 0.90% 0.52% Other 26 14e000 ( 1.305 Mb) 0.11% 0.06% TEB 43 2b000 ( 172.000 kb) 0.01% 0.01% PEB 1 1000 ( 4.000 kb) 0.00% 0.00%
--- Type Summary (for busy) ------ RgnCount ----------- Total Size -------- %ofBusy %ofTotal MEM_PRIVATE 892 36629000 ( 870.160 Mb) 72.86% 42.49% MEM_IMAGE 1314 13df7000 ( 317.965 Mb) 26.62% 15.53% MEM_MAPPED 54 624000 ( 6.141 Mb) 0.51% 0.30%
--- State Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal MEM_COMMIT 1823 369e6000 ( 873.898 Mb) 73.17% 42.67% MEM_FREE 280 355ac000 ( 853.672 Mb) 41.68% MEM_RESERVE 437 1405e000 ( 320.367 Mb) 26.83% 15.64%
The memory usage we’re concerned about is any committed memory as that will usually be close to the Private Bytes that value in PerfMon. This is shown in the State Summary section as MEM_COMMIT and is currently at approximately 874 MB. We can tell a few other things from the results above by comparing State, Type, and Usage. For one, generally the majority of Image memory will be of state MEM_COMMIT. Right off the bat, we can extract about 292 MB from MEM_COMMIT leaving us with about 582 MB to place still. If we look at the Usage summary, we can see that 840 MB is from <unknown>. That memory usage most often comes from Virtual Allocations. It’s also where any managed memory will be. Going back to our MEM_COMMIT memory, it seems likely that the majority of the 58 2MB will be handled fall under this bucket. We don’t see much memory at all in Heap, so we can conclude that the memory is either allocated by managed memory, or by a native component making use of a managed allocator.
The next step is to look at managed memory and see if we have around 500 MB of memory usage. We can do that by loading psscor and running !clrusage.
0:000> .load psscor2 0:000> !clrusage Native Heap for aspnet_filter: 0x01220000 Native Heap for .NET: 0x01fd0000 Number of GC Heaps: 8 Heap Size 0x62fb198(103,788,952) Heap Size 0x303c5b8(50,578,872) Heap Size 0x41e9e58(69,115,480) Heap Size 0x2ca3010(46,805,008) Heap Size 0x3f8209c(66,592,924) Heap Size 0x25b3b68(39,533,416) Heap Size 0x20b5a18(34,298,392) Heap Size 0x17a65a8(24,798,632) May contain Reserved segments, run with -v switch to see them. ------------------------------ GC Heap Size 0x19f5617c(435,511,676) Total Commit Size 1f9f8000 (505 MB) Total Reserved Size 11608000 (278 MB) Initial reservation type: All at once Initial Allocation Size: 18000000 (402,653,184) (384 MB) Reserved Memory Size: 31000000 (822,083,584) (784 MB) Reserved Memory Limit Size: 31000000 (822,083,584) (784 MB)
The managed committed memory is 505MB. This lines up with what we were expecting to see if the memory pressure came from managed memory. That means we’re investigating a managed memory issue. We can now move on to seeing what is taking up all of the managed memory by using !dumpheap -stat
0:000> !dumpheap -stat Using our cache to search the heap. 0x7aa50c98 1 12 System.Diagnostics.CounterCreationDataCollection 0x7aa40c38 1 12 System.ComponentModel.Design.RuntimeLicenseContext 0x7a5eed4c 1 12 System.ComponentModel.WeakHashtable+WeakKeyComparer 0x7a5eeb38 1 12 System.ComponentModel.BooleanConverter
<snipped for brevity>
0x000e8600 521 24,483,528 Free 0x02af0c6c 119,732 74,368,372 System.String 0x02af36dc 11,634 313,028,160 System.Byte[] Total 672,044 objects, Total size: 435,460,280
Right away, those Byte arrays stand out. It should be stated, that we usually expect to see Byte arrays at the bottom of the list, but as there are 313 MB of them here, it’s definitely something we want to look into.
There are a couple ways to proceed. We could dump all byte arrays and look for a pattern in the sizes, or we could look for large byte arrays. We’ll start with large byte arrays. By large, we mean anything that would have ended up in the large object heap. Since we know LoH objects are all greater than or equal to 85,000 bytes, we can specify a minimum size when we view the heap to limit to these large objects.
0:000> !dumpheap -mt 0x02af36dc -min 85000 -stat Using our cache to search the heap. Address MT Size Gen Statistics: MT Count TotalSize Class Name 0x02af36dc 139 305,074,736 System.Byte[] Total 139 objects, Total size: 305,074,736
This shows us we have 139 Byte Arrays in the large object heap that account for 290.94 MB of our managed space. Since there were over 11 thousand byte arrays total and all together they accounted for 298 MB, it’s fair to say the LoH Byte Arrays should be the focus of our investigation. Our next step then is to dump out the Byte Arrays and begin looking to see what they are and if they’re still rooted.
0:000> !dumpheap -mt 0x02af36dc -min 85000 Using our cache to search the heap. Address MT Size Gen 0x202fb838 0x02af36dc 2,077,524 3 System.Byte[] 0x205dc320 0x02af36dc 3,704,228 3 System.Byte[] 0x209648d8 0x02af36dc 4,990,036 3 System.Byte[] 0x20e26d40 0x02af36dc 125,592 3 System.Byte[] 0x3b1f0038 0x02af36dc 3,449,224 3 System.Byte[] 0x3b53a1d0 0x02af36dc 109,140 3 System.Byte[] 0x3b554c38 0x02af36dc 2,625,400 3 System.Byte[]
I ran !gcroot on several of the address listed, but did not find any roots. As it turns out, the dump I was looking at was taken on a terminate process breakpoint and many of the managed structures were already in the process of being torn down. Since we couldn’t find any roots explaining why the Byte Arrays were still around, we have to look at the Arrays themselves to figure out what they are and hopefully tie them back to some source code.
A quick db on one of the Byte Arrays shows
0:000> db 0x202fb838 202fb838 dc 36 af 02 46 b3 1f 00-ff d8 ff e1 63 1d 45 78 .6..F.......c.Ex 202fb848 69 66 00 00 4d 4d 00 2a-00 00 00 08 00 0a 01 0f if..MM.*........ 202fb858 00 02 00 00 00 16 00 00-01 b2 01 10 00 02 00 00 ................ 202fb868 00 25 00 00 01 c8 01 12-00 03 00 00 00 01 00 01 .%.............. 202fb878 00 00 01 1a 00 05 00 00-00 01 00 00 01 ee 01 1b ................ 202fb888 00 05 00 00 00 01 00 00-01 f6 01 28 00 03 00 00 ...........(.... 202fb898 00 01 00 02 00 00 01 32-00 02 00 00 00 14 00 00 .......2........ 202fb8a8 01 fe 02 13 00 03 00 00-00 01 00 01 00 00 47 46 ..............GF
Most of what we see is unhelpful, but we see the string “Exif” suggesting this might be a digital photo. Dumping further into the array we see another interesting string:
202fb9f8 00 00 00 00 00 00 45 41 53 54 4d 41 4e 20 4b 4f ......EASTMAN KO 202fba08 44 41 4b 20 43 4f 4d 50 41 4e 59 00 4b 4f 44 41 DAK COMPANY.KODA 202fba18 4b 20 45 41 53 59 53 48 41 52 45 20 56 31 30 37 K EASYSHARE V107 202fba28 33 20 44 49 47 49 54 41 4c 20 43 41 4d 45 52 41 3 DIGITAL CAMERA
That certainly supports the idea that we have a digital photo taken with a digital camera (a Kodak Easyshare to be precise). Checking a few of the other byte arrays, we can see the same pattern.
0:000> db 0x202fb838 202fb838 dc 36 af 02 46 b3 1f 00-ff d8 ff e1 63 1d 45 78 .6..F.......c.Ex 202fb848 69 66 00 00 4d 4d 00 2a-00 00 00 08 00 0a 01 0f if..MM.*........ 0:000> db 0x3c478e80 3c478e80 dc 36 af 02 1c c6 49 00-ff d8 ff e0 00 10 4a 46 .6....I.......JF 3c478e90 49 46 00 01 01 01 00 60-00 60 00 00 ff e1 19 78 IF.....`.`.....x 3c478ea0 45 78 69 66 00 00 49 49-2a 00 08 00 00 00 11 00 Exif..II*....... 0:000> db 0x3a1f0038 3a1f0038 dc 36 af 02 0a ff 32 00-ff d8 ff e0 00 10 4a 46 .6....2.......JF 3a1f0048 49 46 00 01 01 01 00 48-00 48 00 00 ff e1 23 27 IF.....H.H....#' 3a1f0058 45 78 69 66 00 00 49 49-2a 00 08 00 00 00 10 00 Exif..II*.......
At this point, we’ll assume they were all digital photos. We can revise that assumption if source review doesn’t agree with us. At this point, we really need to look at the source to see why the pictures are staying around. I don’t have the source available as this was given to me as a dump taken during shutdown, so exporting the modules didn’t work. The only thing left to really get from the dump is the pictures themselves. All that’s left to do is export a few of the byte arrays to disk so we can open them and maybe narrow down what we need to look at in the source (are these employee photos? a family outing? product photos?)
We’ll use the writemem command in windbg and simply write the byte array from the address of the array plus 8 to avoid the metadata.
0:000> !do 0x202fb838 Name: System.Byte[] MethodTable: 02af36dc EEClass: 028aeb7c Size: 2077522(0x1fb352) bytes GC Generation: 3 Array: Rank 1, Number of elements 2077510, Type Byte Element Type: System.Byte Fields: None 0:000> .writemem f:\output.jpg 0x202fb838+8 L?0x1fb352 Writing 1fb352 bytes........................................
Then we can open it up and view our result. I dumped several out, all of which were pictures of kids on what looked to be a field trip. That helped narrow down the user activity and focus the source review for the resolution. If the source code didn’t provide a resolution, our next step would have been to take another dump when the memory was high but the process wasn’t shutting down. This should have shown us how the Byte Arrays were rooted at which point we would have been able to handle them.
In this case, I used the following command to export all of the pictures at once:
.foreach (var {!dumpheap -mt 0x02af36dc -min 85000 -short}) {.writemem f:\Pictures\${var}.jpg ${var}+8 L?poi(${var}+4)}
-Jarrod
With this event you will get a message stating, “A process serving application pool '%1' failed to respond to a ping. The process id was '%2'.” When an Event ID 5010 appears in your system log and WAS is the source, you can be assured of two things. The first is that a worker process failed to respond to a ping from the Windows Process Activation Service (WAS). The second is that WAS attempted to shut down the offending process. Generally, this means that either the worker process hung or it did not have enough resources to respond to the ping request. The resolution in the documentation for this Event is listed as:
A worker process that fails to respond may exhibit one of the following symptoms:
There are some links provided as well that provide guidance on troubleshooting memory leaks, long response times, and high CPU usage in the worker process. In each of those 3 cases, there is a deterministic approach with which you can take a memory dump and investigate the issue. However, in practice there are times when the worker process may simply hang without displaying any of those behaviors. So the question becomes, how do we know when the worker process is hung? The answer is, WAS knows, so we just have to ask WAS.
I’m going to go through the steps of finding the breakpoints and getting the needed information because the methodology can be used elsewhere. For example, though the steps at the bottom of the page work for IIS 7, 7.5, and 8, I haven’t tested them with IIS 6 or below. If you’re currently experiencing this issue and just want to know how to get the memory dump of the Worker Process, feel free to skip down to the bottom.
When working with an issue like this, our first step is always to clarify what we know and what we need to know.
To get to the answers we need, we’ll investigate the WAS process with WinDbg. Our goal will be to find a method call that signifies a failed ping, set a breakpoint on it, then identify the associated worker process at the time of failure. We’ll start by attaching the debugger to the WAS process and reloading symbols. Symbols are important here because we’ll need them to investigate the method calls. Once we have the debugger attached, we can take a look at the loaded modules in an effort to make some guesses at where we should look. We’ll do this by running lm, for list modules. We can append 1m and sm to only list the module names and sort them alphabetically.
0:000> lm 1m sm ADVAPI32 bcryptPrimitives clbcatq combase CRYPT32 CRYPTBASE CRYPTSP DPAPI GDI32 HTTPAPI IISRES iisutil iisw3adm KERNEL32 KERNELBASE ktmw32 logoncli mlang MSASN1 msvcrt nativerd NSI ntdll ntmarta ole32 OLEAUT32 pcwum profapi RPCRT4 rsaenh sechost Secur32 SSPICLI svchost USER32 USERENV w3ctrlps W3TP WS2_32 XmlLite
Now that we have the modules, we can make some guesses as to which ones might be related to the failed ping request processing. It seems likely that the module responsible will either be one of the IIS processes or a process with w3 in the name. Next in the debugger we can search for method calls in those modules that could likely be related. We can use the x command to search for the methods using wild cards.
0:000> x *iis*!*ping* 000007fb`45efb2d4 iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem (<no parameter info>) 000007fb`45f2c330 iisw3adm!g_aIISWPSiteMappings = <no type information> 000007fb`45f247dc iisw3adm!AddDomainMapping (<no parameter info>) 000007fb`45f13fb0 iisw3adm!CONFIG_CS_PATH_MAPPER::SetConfigFileMappings (<no parameter info>) 000007fb`45f2c4d0 iisw3adm!g_aIISWPGlobalMappings = <no type information> 000007fb`45efdde4 iisw3adm!MESSAGING_HANDLER::HandlePingReply (<no parameter info>) 000007fb`45efc59c iisw3adm!PingResponseTimerExpiredCallback (<no parameter info>) 000007fb`45f24918 iisw3adm!EnsureLSAMapping (<no parameter info>) 000007fb`45f2c490 iisw3adm!g_aIISULGlobalMappings = <no type information> 000007fb`45f2c2d0 iisw3adm!aIISULSiteMappings = <no type information> 000007fb`45f059f4 iisw3adm!PERF_MANAGER::ProcessPerfCounterPingFired (<no parameter info>) 000007fb`45f2c168 iisw3adm!g_cIISWPSiteMappings = <no type information> 000007fb`45efb64c iisw3adm!WORKER_PROCESS::CancelPingResponseTimer (<no parameter info>) 000007fb`45f2d3cc iisw3adm!g_fWASStoppingInProgress = <no type information> 000007fb`45f06be0 iisw3adm!PerfCountPing (<no parameter info>) 000007fb`45f31dc0 iisw3adm!_imp_LsaManageSidNameMapping = <no type information> 000007fb`45ed98a0 iisw3adm!CheckWASIsStopping (<no parameter info>) 000007fb`45f31eb0 iisw3adm!_imp_CreateFileMappingW = <no type information> 000007fb`45f31eb8 iisw3adm!_imp_OpenFileMappingW = <no type information> 000007fb`45f2c164 iisw3adm!cIISULSiteMappings = <no type information> 000007fb`45f2c484 iisw3adm!g_cIISULGlobalMappings = <no type information> 000007fb`45efb164 iisw3adm!WORKER_PROCESS::SendPingWorkItem (<no parameter info>) 000007fb`45efc51c iisw3adm!SendPingTimerCallback (<no parameter info>) 000007fb`45f31df0 iisw3adm!_imp_CreateFileMappingA = <no type information> 000007fb`45f31df8 iisw3adm!_imp_OpenFileMappingA = <no type information> 000007fb`45f2c488 iisw3adm!g_cIISWPGlobalMappings = <no type information>
The first method call returned looks very promising and is a good place to start. We can set a breakpoint on that line item, tell the process to continue, then intentionally cause a failed ping request to see if the breakpoint trips.
0:000> bp iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem 0:000> g
To cause a failed ping request, we can suspend a worker process with a tool such as Process Explorer.
After the time specified in the Application Pool’s configuration, the WAS process should identify the worker process as hung at attempt to shut it down. If our breakpoint is correct, the debugger, still attached to the WAS process, should break in. As it turns out, that is exactly what happens in this case.
Breakpoint 0 hit iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem: 000007fb`45efb2d4 48895c2410 mov qword ptr [rsp+10h],rbx ss:000000cd`a86cfba8=705541a9cd000000
The breakpoint has been hit, and if we check Process Explorer, we can see the process is still alive. Further, we know because of the calling convention that we’re dealing with a WORKER_PROCESS object. At this point, we could absolutely write a script to dump all of the worker processes on the machine and we would know we got a dump of the hung process. If however, we’re working on a machine that has a lot of worker processes, it may be more beneficial to find the actual worker process being referred to and just dump that object. The WORKER_PROCESS object will almost undoubtedly contain a PID for the worker process with which we could use with a script to dump just that worker process. We can investigate the current thread and try to find the worker process object. We can then look to see if there is an offset from the object that points to the PID.
We’ll start by dumping the call stack.
0:002> k Child-SP RetAddr Call Site 000000cd`a86cfb98 000007fb`45ef7385 iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem 000000cd`a86cfba0 000007fb`45ede8e3 iisw3adm!WORKER_PROCESS::ExecuteWorkItem+0x41 000000cd`a86cfbf0 000007fb`45ede442 iisw3adm!WORK_ITEM::Execute+0x13 000000cd`a86cfc30 000007fb`45edb129 iisw3adm!WORK_QUEUE::ProcessWorkItem+0xa2 000000cd`a86cfc70 000007fb`45edafb7 iisw3adm!WEB_ADMIN_SERVICE::MainWorkerThread+0x35 000000cd`a86cfcb0 000007fb`45ed9aec iisw3adm!WEB_ADMIN_SERVICE::ExecuteService+0xdf 000000cd`a86cfcf0 000007fb`4e8f167e iisw3adm!WASServiceThreadProc+0x94 000000cd`a86cfd60 000007fb`50533501 KERNEL32!BaseThreadInitThunk+0x1a 000000cd`a86cfd90 00000000`00000000 ntdll!RtlUserThreadStart+0x1d
This gives us an idea of the calling order, but the public symbols we’re using won’t give us local variables and in this instance they won’t tell us if and what parameters are passed. x86 Calling conventions dictate that in a 64 bit process, the first 4 parameters are stored in the CPU registers RCX, RDX, R8, then R9 in that order. Any parameters in addition to that will be stored on the stack. If this were a 32 bit process, the parameters would most likely all be on the call stack. Knowing this, we can start by searching the call stack and the 4 registers mentioned for the PID we identified. In fact, any of the non null registers displayed with the r command are probably a good place to look, excluding RIP which will point to code.
0:002> r rax=0000000000000000 rbx=0000000000000000 rcx=000000cda910e520 rdx=000000cda94159d0 rsi=0000000000000001 rdi=000000cda910e520 rip=000007fb45efb2d4 rsp=000000cda86cfb98 rbp=000000cda8709ec0 r8=000000cda86cfb28 r9=000000cda86cfc10 r10=0000000000000000 r11=0000000000000246 r12=0000000000000000 r13=0000000000000000 r14=0000000000000000 r15=0000000000000000 iopl=0 nv up ei pl zr na po nc cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000246 iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem: 000007fb`45efb2d4 48895c2410 mov qword ptr [rsp+10h],rbx ss:000000cd`a86cfba8=705d41a9cd000000
We can see that RAX and RBX are null, but RCX and RDX are not. We’ll start with the call stack then move on to the other registers. For the call stack, we’ll search between 100 bytes under the top of the stack pointer (RSP) and up to the stack base (RBP). Remember, stacks call stacks start at a higher address and grow down as they move up the stack. We’re searching for a word value because that’s plenty of space for the PID to be stored in and searching for our known PID, 8012.
0:002> s -[1]w @rsp-100 @rbp 0n8012
This returned no results. So, we don’t have the PID on the call stack. Next we’ll search from RCX to 300 Bytes greater than RCX.
0:002> s -[1]w @rcx L300 0n8012 0x000000cd`a910e568 0x000000cd`a910e56c
In this case we have two results. We can see how far from RCX the first result is.
0:002> ?0x000000cd`a910e568-@rcx Evaluate expression: 72 = 00000000`00000048
We get an offset of 72 bytes, which is very close. We can do a couple things to verify that we can use this offset in the future. First, we can check to see if we have any symbol information near RCX.
0:002> dps @rcx 000000cd`a910e520 000007fb`45ecc2b8 iisw3adm!WORKER_PROCESS::`vftable' 000000cd`a910e528 000000cd`43525057 000000cd`a910e530 000000cd`a90f4998 000000cd`a910e538 000000cd`a90f4998 000000cd`a910e540 00000000`00000005 000000cd`a910e548 000000cd`a9390db0 000000cd`a910e550 00000000`00000006 000000cd`a910e558 000000cd`a90f4930 000000cd`a910e560 000000cd`a90f4b40 000000cd`a910e568 00001f4c`00001f4c 000000cd`a910e570 00000000`0000031c 000000cd`a910e578 00000000`00000000 000000cd`a910e580 000000cd`a7f7ed20 000000cd`a910e588 00000000`00000001 000000cd`a910e590 000000cd`00000000 000000cd`a910e598 00000000`00000000
It seems RCX is a pointer to a WORKER_PROCESS virtual function table. This means it is essentially the address for the WORKER_PROCESS object. It also means our offset of RCX+72 is probably correct and could be used consistently. We can be sure with one last check. We need to let the debugger continue, create and pause another worker process, let the debugger break in at our earlier breakpoint, then validate that RCX+72 points to the new PID.
With the worker process paused, we note our new PID is 4356. In Hex that comes to 1104, so that’s what we’re looking for.
0:002> g Breakpoint 0 hit iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem: 000007fb`45efb2d4 48895c2410 mov qword ptr [rsp+10h],rbx ss:000000cd`a86cfba8=705d41a9cd000000 0:002> dw @rcx+0n72 l1 000000cd`a910e568 1104
Clearly, our offset works as it points to the correct new PID. This means we can use our breakpoint and this offset to create a script to dump the hung worker process. Below, I’ve listed the steps to script this with DebugDiag, but it could easily be adapted to work with ADPlus or even CDB and WinDbg directly.
(On a side note, review of the disassembly for the PingResponseTimerExpiredWorkItem showed that it did not have any parameters. RCX was clearly not a parameter, but it’s a common enough register that it is still good to check. It definitely paid off in this instance.)
The assumption is made that the production machine with the issue does not have internet access as this is a common practice. If it’s not the case and internet access is available, you can complete Step 1 on the production machine.
Dim w3wpProcessIdrcx = CausingThread.Register("rcx")w3wpProcessId = Debugger.ReadDWORD(rcx + 72)WriteToLog "w3wp process id = " & w3wpProcessId Debugger.CreateDumpForProcessID w3wpProcessID,"Failed Ping",False
[9/30/2012 2:46:53 PM] Initializing control script [9/30/2012 2:46:53 PM] Clearing any existing breakpoints [9/30/2012 2:46:53 PM] [9/30/2012 2:46:53 PM] Attempting to set unmanaged breakpoint at iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem [9/30/2012 2:46:53 PM] bpID = 0 [9/30/2012 2:46:53 AM] Current Breakpoint List(BL) 0 e 000007fb`45efb2d4 0001 (0001) 0:**** iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem
You can test the above set up by suspending a worker process with Process Explorer and making sure a memory dump is created. That’s all there is to it. Of course, when you have the memory dump, that’s just the beginning as you will have to debug that dump to identify the cause of the hang.
With the recent release of Visual Studio 2012, there are many new features and updates to explore. One of the new features is the concept of Caller Information, available in both C# and VB. Caller Information allows you to obtain certain information about the caller to a method. More specifically, it allows you to get the name of the calling method as well as the source file information and line number.
Many customers currently use catch blocks to document the name of the current method and maybe even the source file from which an exception is thrown. But Caller Information provides a way to get this information (and more – source line info), even if no exception occurs. The way it works is that you include attributes as optional parameters in your function’s signature. There are three of these, and they’re defined in the System.Runtime.CompilerOptions namespace. You should supply a default value to whichever ones you include in your function’s signature.
CallerFilePath – The full path and filename (during compile time) of the caller.
CallerLineNumber – The line number (in the source code) where the caller calls the callee.
CallerMembername – The method or property name of the caller.
Caller Information can be of particular value for profiling or tracing tools. However, you don’t need to write a profiler in order to take advantage of this new feature. One example where an application could make use of Caller Information would be if you have a function (say, MyFunc) which is called from many different functions. Let’s say MyFunc takes an argument or two, and occasionally MyFunc gets an exception when using the arguments – for example, a NullReferenceException. But the fact that MyFunc doesn’t reproduce this exception consistently may make it difficult to determine what the code path is that results in the exception.
If MyFunc captured the calling function, you’d have a much better idea of how to reproduce the problem and, therefore, find root cause and a resolution.
Here’s a demonstration of how to use CallerInformation. I created a console application with the following code. (Sincere apologies as to the format shown below. I struggled for a while to remove the individual boxes around each block of code, but doing so only removed all the indentation, which made the code more difficult to read).
using System.Runtime.CompilerServices;namespace CallerInfoSample{
class Program{
static void Main(string[] args){
string str = "";str = Func1();Console.WriteLine(str);Console.ReadLine();
} static string Func1([CallerFilePath] string sFilePath = "", [CallerLineNumber] int iLineNbr = 0, [CallerMemberName] string sMemberName = ""){
string s = "";s = "Before calling Func2, Func1 reports the line number where " + sMemberName + " called the current function is " + sFilePath + ", Line " + iLineNbr + ".";Console.WriteLine(s);Console.WriteLine("=====");bool b = Func2();s = "After calling Func2, Func1 reports the line number where " + sMemberName + " called the current function is still " + iLineNbr + " since we haven't finished executing Func1.";Console.WriteLine("Func2 returns " + b.ToString() + ".");Console.WriteLine(s);Console.WriteLine("=====");return "Done Processing";
} static bool Func2([CallerFilePath] string sFilePath = "", [CallerLineNumber] int iLineNbr = 0, [CallerMemberName] string sMemberName = ""){
string s = "";s = "Before calling Func3, Func2 reports the line number where " + sMemberName + " called the current function is " + sFilePath + ", Line " + iLineNbr + ".";Console.WriteLine(s);Console.WriteLine("=====");int j = 0;int i = Func3(j);return true;
} static int Func3(int i, [CallerFilePath] string sFilePath = "", [CallerLineNumber] int iLineNbr = 0, [CallerMemberName] string sMemberName = ""){
string s = "";s = "Before returning to Func2, Func3 reports the line number where " + sMemberName + " called the current function is " + sFilePath + ", Line " + iLineNbr + ".";Console.WriteLine(s);Console.WriteLine("=====");return (i + 1);
}
It results in the following output:
Before calling Func2, Func1 reports the line number where Main called the current function is c:\Samples\CallerInfoSample\CallerInfoSample\Program.cs, Line 16.=====Before calling Func3, Func2 reports the line number where Func1 called the current function is c:\Samples\CallerInfoSample\CallerInfoSample\Program.cs, Line 29.=====Before returning to Func2, Func3 reports the line number where Func2 called the current function is c:\Samples\CallerInfoSample\CallerInfoSample\Program.cs, Line 47.=====Func2 returns True.After calling Func2, Func1 reports the line number where Main called the current function is still 16 since we haven't finished executing Func1.=====Done Processing
For more information on Caller Information, see MSDN at http://msdn.microsoft.com/en-us/library/hh534540.aspx.
NOTE: I could only make use of Caller information when I configured VS2012 to compile against 4.5 of the .NET Framework; the functionality isn't there in 4.0.
With the recent release of Visual Studio 2012, there are many new features and updates to explore. One of the new features is Exploratory Testing, which allows you to test your software without a pre-defined test or script. It’s perfect for simply “exploring” your application to see what issues you may encounter. Exploratory Testing comes with the Test Professional, Premium, and Ultimate editions of VS 2012.
Technically, Exploratory Testing is part of Microsoft Test Manager (MTM). In VS2010, one could achieve exploratory testing by filing an exploratory bug in Test Runner. From there, users/testers could execute unscripted actions until a bug was found. But in VS2012, exploratory testing is pulled into MTM along with your other tests, plans, suites, tracks, work items, etc. You can learn about some of the other new features in MTM 2012 in MSDN Magazine. In order to use Exploratory Testing in MTM, you must be connected to Team Foundation Server (TFS) 2012. If you don’t have access to a TFS 2012 instance, you can try out TFS in the cloud free, for a limited time, at http://tfspreview.com. The instructions at that site are simple and include everything you need to create an account, project, upload source, automate builds, etc.
NOTE: At the time of this writing, no information was available as to how long tfspreview.com would be available (or at least, available for free). While the end goal of TFS Azure is to have all the features of TFS, you may find that not all features of TFS - such as Reporting Services, AD Federation, etc. - are available at tfspreview.com.
With exploratory testing, you can literally browse through your application at your leisure, until you find a bug or some other part of the software that should be considered for change – editing text in a label, background color change, adding or removing a feature, etc. And the whole time you’re “exploring” the application, MTM records your actions. By default, actions you execute in MTM or Office applications are not recorded, though you can change this in your test plan properties (Testing Center, Plan, Properties). Exploratory testing in MTM allows you to do the following:
Exploratory testing can be used on web applications, windows applications, and if it’s a client/server application, some steps on the server side can also be recorded. More details on this, as well as the steps to run an exploratory test can be found at http://msdn.microsoft.com/en-us/library/tfs/hh191621(v=vs.110).aspx. While I have no desire to copy what can already be found on MSDN, I also know that sometimes links change, content is moved, etc. Therefore, I’ll repeat some of those steps in this blog entry.
b. If you do have an existing Work Item, you can perform exploratory testing on a specific work item by going to Test, Do Exploratory Testing. Then right-click on your work item of choice and select Explore.
2. When you select Explore, the Exploratory Testing window opens, waiting for you to Start your test. Once clicked, MTM records your actions.
3. You can pause or end the test, and this pauses or ends recording of your actions. At any time you can enter comments and/or add screen captures in the Exploratory Testing window. Additionally, if you run into a bug you can click the Create Bug button during recording, and fill out details of your newfound bug. When you save your new bug, you’ll have the option to create a Test Case for the bug you just created.
These are just the steps to get you started, and there are many other options for saving Test Cases, assigning & editing bugs, and more. Exploratory testing is a great option when you don’t have a defined test, a defined script as to how to navigate directly to a specific bug, or even when you don’t have a specific idea of what problem you seek to uncover in your application.
DebugDiag and Procdump are two of my favorite tools. They're both incredibly useful and I rely on them heavily to gather MemoryDumps when debugging production issues. They have a myriad of overlapping features, but they each have (at least) one feature that makes them stand apart from each other:
The catch is, DebugDiag currently can't create reflected process dumps, and ProcDump doesn't have a way to inject the LeakTrack dll to track allocations. We can get around this by working with both tools. Assuming both tools are installed on the server with a leaky application, the process is as follows:
Step 1 is easy and documented elsewhere. Step 2 is easy if we have direct GUI access to the machine. We can simply go to the Processes tab in DebugDiag, right click the process, and chose "Start Monitory for Leaks." For the sake of argument, let's say we want to script the whole process. We can do that by scripting DebugDiag and ProcDump to do the individual tasks we've set out for them. Once we have the PID of the troubled process, we can use a script to inject the LeakTrack dll into the process. The following vbs script will, when used by DebugDiag, start the native leak tracking process:
Sub Debugger_OnInitialBreakpoint() Debugger.Write "Injecting LeakTrack" & vbLf Debugger.InjectLeakTrack Debugger.Write "Detaching2" & vbLf Debugger.DetachEnd Sub
The above code can be saved into a vbs file. For this post we'll assume it was saved to c:\scripts\InjectLeakTrack.vbs. We'll also assume our PID was 1234
With the PID known and the script created, we can launch DebugDiag from a command line as such (if you installed DebugDiag elsewhere, you'll have to adjust the install location): C:\PROGRA~1\DEBUGD~1\DbgHost.exe -script "c:\scripts\InjectLeakTrack.vbs" -attach 1234
This will cause DbgHost, the debugger process DebugDiag uses, to attach to the process with PID 1234 and run the script we provided. The script simply injects the LeakTrack dll and detaches the debugger. Now that LeakTrack is injected, we can launch ProcDump and tell it to reflect the process and create a memory dump when we've crossed a memory threshold. Let's say we know our application has leaked if it ever crosses 600MB. We can kick off ProcDump as follows: procdump 1234 C:\Dumps\Over600.dmp -accepteula -r -m 600 -ma
ProcDump will then monitor the process and create a reflected process dump when the committed bytes of the process breach 600 MB. This memory dump can be analyzed by DebugDiag on the Advanced Analysis tab, and the offending callstacks can be identified.
In this type of scenario, it’s generally not immediately clear what steps we should take for a resolution. This means we often have to guess. Guessing is somewhat of a misnomer in this situation. I’m not advocating random guessing by any means, but rather taking a logical approach to guessing. When selecting an action based on guess work, you should be able to explain each step along the way. This means we need a way to guess well.
The key to guessing well is leveraging what we do know about a system to in an attempt to make some logical assumptions to fill in the gaps of information that we don’t know. These are still guesses, but they’re logical guesses based on reason and sound thinking.
Below is a simple case study of a guess the number game written in Silverlight. The goal of the case study is not to demonstrate how to hack a guess the number game, but rather to illustrate the reasoning behind the guesswork needed to beat the game without relying on…well, guessing.
The game works by prompting the user to guess a number and click a button. If the user guesses incorrectly, they see the following screen:
Our goal is to figure out what number to enter in order to guess correctly. To start with, we want to aggregate what we know about the application:
Next we have to figure out what we want to know.
With these questions, the logical first step is to talk to one of the developers that support the project. We’ll assume we got the following back:
Ok, we have a couple of good answers now. We’ve expanded what we know about the application. Now it’s time to take some action. Since the developer told us this is a Silverlight application hosted in a browser we also know that to debug this we can connect a debugger to the browser.
So, we hook up to the browser, and now we have to decide what to look for. We don’t know much about how the application works. We don’t know where the application stores the correct value. We don’t know enough about the application to know what method does the comparison of values. We don’t even know if the comparison code is hosted locally or if a call is made to a server.
That’s a lot of things we don’t know, so let’s go back to what we do know. We know that to kick off the comparison operation the user has to click a button. We also know this is a Silverlight application. We can guess that the button will be a Silverlight button.
This means we have a first step. We can use the debugger to search for the button in the managed heap.
First step is to load SOS for Silverlight:
0:038> .loadby sos coreclr
With SOS loaded, we can look on the managed heap for a button object. We might not know going into to it exactly what type is needed to create a Silverlight button, but we can guess that it probably contains the word “Button.”
0:038> !dumpheap -stat -type Button
Statistics:
MT Count TotalSize Class Name
50878ec8 1 32 MS.Internal.FXCallbackDelegates+PasswordBox_AttachRevealButtonListenerDelegate
50878c1c 1 32 MS.Internal.FXCallbackDelegates+TextBox_AttachDeleteButtonClickHandlerDelegate
508ad1c0 1 128 System.Windows.Controls.Button
Total 3 objects
System.Windows.Controls.Button looks promising. There is only 1 of them and it ‘feels’ right. Next step is to find the address for that type:
0:038> !dumpheap -mt 508ad1c0
Address MT Size
0c862790 508ad1c0 128
Now we have an address, so clearly the next step is to dump the object:
0:038> !do 0c862790
Name: System.Windows.Controls.Button
MethodTable: 508ad1c0
EEClass: 504f3fd4
Size: 128(0x80) bytes
File: C:\Program Files (x86)\Microsoft Silverlight\5.0.61118.0\System.Windows.dll
Fields:
MT Field Offset Type VT Attr Value Name
50898494 4000499 4 ...eObjectSafeHandle 0 instance 0c86281c m_nativePtr
508986f4 400049a 8 ... System.Windows]] 0 instance 0c862ad4 _valueTable
50e36b40 400049b 24 System.Boolean 1 instance 0 _propagateInheritanceChanged
50894fa4 400049c c ....DependencyObject 0 instance 00000000 _inheritanceParent
508989fc 400049d 10 ...reTypeEventHelper 0 instance 0c862810 _coreTypeEventHelper
508777b8 400049e 14 ...angedEventHandler 0 instance 00000000 DPChanged
50e139ec 400049f 18 System.EventHandler 0 instance 00000000 _inheritaneContextChanged
50875ab4 40004a1 1c ...bject, mscorlib]] 0 instance 0c8628a8 _treeChildren
508735b0 40004a2 20 ...rnal.IManagedPeer 0 instance 0c836118 _treeParent
50e42e50 40004a0 10 ...flection.Assembly 0 shared static _executingAssembly
…
50e36b40 4001120 64 System.Boolean 1 instance 1 _isLoaded
50e36b40 4001121 65 System.Boolean 1 instance 0 _isMouseCaptured
50e36b40 4001122 66 System.Boolean 1 instance 0 _isSpaceOrEnterKeyDown
50e36b40 4001123 67 System.Boolean 1 instance 0 _isMouseLeftButtonDown
508aa2a4 4001124 6c System.Windows.Point 1 instance 0c8627fc _mousePosition
5089acf4 4001125 58 ...mation.Storyboard 0 instance 00000000 _currentState
50e36b40 4001126 68 System.Boolean 1 instance 0 _suspendStateChanges
50e36b40 4001127 69 System.Boolean 1 instance 0 _isSuspendingIsEnabled
5089eb88 4001128 5c ...tArgs, mscorlib]] 0 instance 00000000 _weakCanExecuteChangedListener
50873a68 4001129 60 ...outedEventHandler 0 instance 0c862fb0 Click
508987dc 400111a e80 ...ependencyProperty 0 shared static ClickModeProperty
>> Domain:Value 0b0b0cd0:NotInit 0b0807c0:0c84dc6c <<
508987dc 400111b e84 ...ependencyProperty 0 shared static IsFocusedProperty
>> Domain:Value 0b0b0cd0:NotInit 0b0807c0:0c84dd4c <<
508987dc 400111c e88 ...ependencyProperty 0 shared static IsMouseOverProperty
>> Domain:Value 0b0b0cd0:NotInit 0b0807c0:0c84dd88 <<
508987dc 400111d e8c ...ependencyProperty 0 shared static IsPressedProperty
>> Domain:Value 0b0b0cd0:NotInit 0b0807c0:0c84ddf4 <<
508987dc 400111e e90 ...ependencyProperty 0 shared static CommandProperty
>> Domain:Value 0b0b0cd0:NotInit 0b0807c0:0c84df18 <<
508987dc 400111f e94 ...ependencyProperty 0 shared static CommandParameterProperty
>> Domain:Value 0b0b0cd0:NotInit 0b0807c0:0c84df78 <<
5088fe30 400112a e98 ...tArgs, mscorlib]] 0 shared static CS$<>9__CachedAnonymousMethodDelegate5
>> Domain:Value 0b0b0cd0:NotInit 0b0807c0:00000000 <<
Some of the above output was trimmed for brevity. We have some more information. We might not have the internals of the System.Windows.Controls.Button memorized, but we do know enough to know that there is very likely a click event. We can guess that event will have the word click in it. Looking in the output above, we find this line:
Dumping this object, we find:
0:038> !do 0c862fb0
Name: System.Windows.RoutedEventHandler
MethodTable: 50873a68
EEClass: 504ddb60
Size: 32(0x20) bytes
50e36684 400002d 4 System.Object 0 instance 0c8163e8 _target
50e36684 400002e 8 System.Object 0 instance 00000000 _methodBase
50e3dfd8 400002f c System.IntPtr 1 instance 803c430 _methodPtr
50e3dfd8 4000030 10 System.IntPtr 1 instance 0 _methodPtrAux
50e36684 4000031 14 System.Object 0 instance 00000000 _invocationList
50e3dfd8 4000032 18 System.IntPtr 1 instance 0 _invocationCount
Again, we have an object whose internals may not be familiar to us, so we have to guess. _target and _methodPtr both seem like good guesses. We’ll start by dumping _target:
0:038> !do 0c8163e8
Name: GuessTheNumber.MainPage
MethodTable: 080356b8
EEClass: 0835374c
Size: 120(0x78) bytes
File: GuessTheNumber, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
50898494 4000499 4 ...eObjectSafeHandle 0 instance 0c81646c m_nativePtr
508986f4 400049a 8 ... System.Windows]] 0 instance 00000000 _valueTable
508989fc 400049d 10 ...reTypeEventHelper 0 instance 0c816460 _coreTypeEventHelper
50875ab4 40004a1 1c ...bject, mscorlib]] 0 instance 0c8637c8 _treeChildren
508735b0 40004a2 20 ...rnal.IManagedPeer 0 instance 00000000 _treeParent
08035848 4000002 54 ...er.MyRandomObject 0 instance 0c8164a8 randomObject
5089f2f4 4000003 58 ...ows.Controls.Grid 0 instance 0c836118 LayoutRoot
08035b18 4000004 5c ...ws.Controls.Label 0 instance 0c81dd78 label1
08035b18 4000005 60 ...ws.Controls.Label 0 instance 0c836730 label2
50897f28 4000006 64 ....Controls.TextBox 0 instance 0c83779c guessTextBox
508ad1c0 4000007 68 ...s.Controls.Button 0 instance 0c862790 submitButton
08035b18 4000008 6c ...ws.Controls.Label 0 instance 0c863124 resultLabel
50e36b40 4000009 70 System.Boolean 1 instance 1 _contentLoaded
The abbreviated content above looks promising. The controls listed as members of the object look promising. We still don’t have a click event though. We’re left guessing one more time. This time, we’ll guess that the click method is associated with the MethodTable for GuessTheNumber.MainPage. Based on this guess, we’ll dump the method descriptors for the MethodTable:
0:038> !dumpmt -md 080356b8
Module: 080346e4
mdToken: 02000003
BaseSize: 0x78
ComponentSize: 0x0
Slots in VTable: 63
Number of IFaces in IFaceMap: 4
--------------------------------------
MethodDesc Table
Entry MethodDe JIT Name
50c32330 50a1b150 PreJIT System.Object.ToString()
50c32350 50a1b158 PreJIT System.Object.Equals(System.Object)
50c323c0 50a1b178 PreJIT System.Object.GetHashCode()
50c323d0 50a1b190 PreJIT System.Object.Finalize()
5074cd34 50508d44 PreJIT
50785800 50514cbc PreJIT System.Windows.Controls.Control.OnManipulationCompleted(System.Windows.Input.ManipulationCompletedEventArgs)
50785828 50514cc4 PreJIT System.Windows.Controls.Control.OnTap(System.Windows.Input.GestureEventArgs)
50785850 50514ccc PreJIT System.Windows.Controls.Control.OnDoubleTap(System.Windows.Input.GestureEventArgs)
50785878 50514cd4 PreJIT System.Windows.Controls.Control.OnHold(System.Windows.Input.GestureEventArgs)
08370240 0803566c JIT GuessTheNumber.MainPage..ctor()
0803c071 08035674 NONE GuessTheNumber.MainPage.submitButton_Click(System.Object, System.Windows.RoutedEventArgs)
083703d0 08035680 JIT GuessTheNumber.MainPage.InitializeComponent()
Looks like we’re on to something. The submitButton_Click event looks promising. We can guess that is the method we’ve been looking for and dump out the IL:
0:038> !dumpil 08035674
ilAddr = 083302c8
IL_0000: nop
IL_0001: ldarg.0
IL_0002: ldfld GuessTheNumber.MainPage::guessTextBox
IL_0007: callvirt System.Windows.Controls.TextBox::get_Text
IL_000c: ldloca.s VAR OR ARG 0
IL_000e: call System.Int32::TryParse
IL_0013: pop
IL_0014: ldloc.0
IL_0015: ldarg.0
IL_0016: ldfld GuessTheNumber.MainPage::randomObject
IL_001b: callvirt GuessTheNumber.MyRandomObject::get_randomValue
IL_0020: ceq
IL_0022: ldc.i4.0
IL_0023: ceq
IL_0025: stloc.1
IL_0026: ldloc.1
IL_0027: brtrue.s IL_003c
IL_0029: ldarg.0
IL_002a: ldfld GuessTheNumber.MainPage::resultLabel
IL_002f: ldstr "You chose wisely."
IL_0034: callvirt System.Windows.Controls.ContentControl::set_Content
IL_0039: nop
IL_003a: br.s IL_0058
IL_003c: ldarg.0
IL_003d: ldfld GuessTheNumber.MainPage::resultLabel
IL_0042: ldstr "Your guess of {0} was incorrect."
IL_0047: ldloc.0
IL_0048: box System.Int32
IL_004d: call System.String::Format
IL_0052: callvirt System.Windows.Controls.ContentControl::set_Content
IL_0057: nop
IL_0058: ret
IL isn’t always easy to follow without practice, but the above makes it clear we are in the right place since we can see both, “You chose wisely,” and “Your guess of {0} was incorrect.” To make the code easier to read, we can extract the module and decompile the IL into C#
0:038> !dumpmodule 080346e4
Name: GuessTheNumber, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
Attributes: PEFile
Assembly: 0b0c2310
LoaderHeap: 00000000
TypeDefToMethodTableMap: 08030124
TypeRefToMethodTableMap: 0803013c
MethodDefToDescMap: 0803020c
FieldDefToDescMap: 08030248
MemberRefToDescMap: 0803027c
FileReferencesMap: 0803035c
AssemblyReferencesMap: 08030360
MetaData start address: 08330d00 (4512 bytes)
0:038> lmm GuessTheNumber
start end module name
08330000 08332600 GuessTheNumber C (no symbols)
0:038> !savemodule 08330000 d:\GuessTheNumberModule.dll
3 sections in file
section 0 - VA=2000, VASize=1da4, FileAddr=200, FileSize=1e00
section 1 - VA=4000, VASize=370, FileAddr=2000, FileSize=400
section 2 - VA=6000, VASize=c, FileAddr=2400, FileSize=200
Using a decompiler such as Redgate’s reflector, we see the following C#:
private void submitButton_Click(object sender, RoutedEventArgs e)
{
int num;
int.TryParse(this.guessTextBox.get_Text(), out num);
if (num == this.randomObject.randomValue)
this.resultLabel.set_Content("You chose wisely.");
else
this.resultLabel.set_Content(string.Format("Your guess of {0} was incorrect.", num));
Now it’s clear what’s going on here. The value entered by the user into the text box is parsed into an int then compared to another int stored in this.randomObject.randomValue. All that’s left for us to do is to find the current value for randomValue. We know that in this case, this refers to a MainPage object. Interestingly, that’s the exact type we dumped previously as the _target from the event handler that we found. Reviewing our previous output, we find:
Dumping the object we see:
0:038> !do 0c8164a8
Name: GuessTheNumber.MyRandomObject
MethodTable: 08035848
EEClass: 08353800
Size: 12(0xc) bytes
50e371d4 4000001 4 System.Int32 1 instance 174 <randomValue>k__BackingField
We see a member variable named randomValue which is exactly what we were looking for. Using the value current value of 174, we’ll try another guess in the application:
Good guess.
To recap, we were tasked with troubleshooting an application that we knew very little about. We asked a few questions, used a bit of prior knowledge, and a handful of reason driven guesses to get us to a solution.
Although it’s unlikely you will ever need to cheat at a Guess the Number game in a production environment, the approach here is applicable to any situation in which you are approaching with limited domain specific knowledge.
Guess wisely my friends.
My name is Brad and I’ve been on the PFE team here at Microsoft for many years. Suffice to say, I’m overdue for contributing to the team blog. I’ve seen lots of interesting (and not-so-interesting) issues with customers all over the world in my time at PFE. What follows is an issue I worked earlier this year. For me, the most interesting part of this issue was not so much in finding root cause as it was the process of discovering who was behind root cause.
The Problem
It all started with a customer who reported their ASP.NET application had an OutOfMemory issue. These kinds of issues are not at all uncommon in the .NET world, and the trick usually comes down to finding what object(s) are rooted so that the .NET Garbage Collector can’t reclaim the memory associated with said object(s).
Getting data from the problem
They sent me a dump of the problematic application pool, mentioning that they dumped it two hours after they received notice about the OutOfMemoryException (OOM). My initial thought on this was that they had recycled the process and then obtained a dump of the fresh new process. This would obviously be no good since the w3wp instance exhibiting the OOM was gone, and the dump instead represented a new process instance with no memory pressure. Unfortunately, this wouldn’t be the first time I’d had this problem.
However, when I received the dump, I was pleasantly surprised to see that they had dumped the correct process instance. It was over 1GB in size, representing a 32-bit process with significant memory pressure. And when I looked at the length of time the process had been alive, that verified this was the same instance that threw the OOM.
Windows Server 2003 Version 3790 (Service Pack 2) MP (4 procs) Free x86 compatible Product: Server, suite: TerminalServer SingleUserTS Machine Name: Debug session time: Thu Feb 25 15:07:20.000 2010 (UTC - 5:00) System Uptime: 194 days 18:11:06.875 Process Uptime: 0 days 15:03:28.000
Analyzing the data, Part 1
After searching through the dump I found the problematic object that led to the OOM. And after talking with the right folks at the customer, we pieced back together how the problem arose in addition to a resolution to this problem.
So one question was answered, but another one remained: How did this process manage to stay alive for a full two hours after getting the OOM? I’ve been debugging issues like this since .NET 1.0 was released, and I can tell you that this wasn’t a normal set of circumstances.
The thread that threw the OOM turned out to be the same thread that answered the question.
# ChildEBP RetAddr 00 0c3aedb8 7c82775b ntdll!KiFastSystemCallRet 01 0c3aedbc 773d7a4b ntdll!NtRaiseHardError+0xc 02 0c3aee18 773b8377 user32!ServiceMessageBox+0x145 03 0c3aef74 7739eec9 user32!MessageBoxWorker+0x13e 04 0c3aefcc 773d7d0d user32!MessageBoxTimeoutW+0x7a 05 0c3af000 773c42c8 user32!MessageBoxTimeoutA+0x9c 06 0c3af020 773c42a4 user32!MessageBoxExA+0x1b 07 0c3af03c 7c34c224 user32!MessageBoxA+0x45 08 0c3af070 7c348e6c msvcr71!__crtMessageBoxA+0xf4 09 0c3af294 7c34cf83 msvcr71!_NMSG_WRITE+0x12e 0a 0c3af2cc 7a09bea7 msvcr71!abort+0x7 0b 0c3af2d8 77e761b7 mscorwks!InternalUnhandledExceptionFilter+0x16 0c 0c3af530 77e792a3 kernel32!UnhandledExceptionFilter+0x12a 0d 0c3af538 77e61ac1 kernel32!BaseThreadStart+0x4a 0e 0c3af560 7c828752 kernel32!_except_handler3+0x61 0f 0c3af584 7c828723 ntdll!ExecuteHandler2+0x26 10 0c3af62c 7c82863c ntdll!ExecuteHandler+0x24 11 0c3af90c 77e4bee7 ntdll!RtlRaiseException+0x3d 12 0c3af96c 78158e89 kernel32!RaiseException+0x53 13 0c3af9a4 7a14fa70 msvcr80!_CxxThrowException+0x46 14 0c3af9b8 7a108013 mscorwks!ThrowOutOfMemory+0x24 15 0c3afae4 7a109f7d mscorwks!Thread::RaiseCrossContextException+0x408 16 0c3afb98 79fd878b mscorwks!Thread::DoADCallBack+0x2a2 17 0c3afbb4 79e983fb mscorwks!ManagedThreadBase_DispatchInner+0x35 18 0c3afc48 79e98321 mscorwks!ManagedThreadBase_DispatchMiddle+0xb1 19 0c3afc84 79e984ad mscorwks!ManagedThreadBase_DispatchOuter+0x6d <clipped for brevity>
As seen in the stack above, after the OOM was thrown in frame 0x14, a message box was thrown in frame 0x8. A message box will keep the process alive until someone clicks OK, Cancel, etc. and the message box goes away. In short, message boxes in server-side processes are never a good thing since they will hang your application!
Once again, we have answered one question, but another question remains: the process stayed alive for two hours after it threw the OOM because a message box was thrown, but who threw the message box, and why?
Analyzing the data, Part 2
As is common when finding a message box that’s popped up in a process, I wanted to see what the message box said. From MSDN, we learn that the second parameter to user32!MessageBoxExA is the text of the message, and the third parameter is the text of the caption in the message box. Using the kb command, we can retrieve these parameters, and then dump them to find the values:
0:023> kbn7 # ChildEBP RetAddr Args to Child 00 0c3aedb8 7c82775b 773d7a4b 50000018 00000004 ntdll!KiFastSystemCallRet 01 0c3aedbc 773d7a4b 50000018 00000004 00000003 ntdll!NtRaiseHardError+0xc 02 0c3aee18 773b8377 0dbec7a8 0bcda228 00012010 user32!ServiceMessageBox+0x145 03 0c3aef74 7739eec9 0c3aef80 00000028 00000000 user32!MessageBoxWorker+0x13e 04 0c3aefcc 773d7d0d 00000000 0dbec7a8 0bcda228 user32!MessageBoxTimeoutW+0x7a 05 0c3af000 773c42c8 00000000 0c3af0a4 7c37f480 user32!MessageBoxTimeoutA+0x9c 06 0c3af020 773c42a4 00000000 0c3af0a4 7c37f480 user32!MessageBoxExA+0x1b 0:023> da 7c37f480 7c37f480 "Microsoft Visual C++ Runtime Lib" 7c37f4a0 "rary" 0:023> da 0c3af0a4 0c3af0a4 "Runtime Error!..Program: c:\wind" 0c3af0c4 "ows\system32\inetsrv\w3wp.exe..." 0c3af0e4 ".This application has requested " 0c3af104 "the Runtime to terminate it in a" 0c3af124 "n unusual way..Please contact th" 0c3af144 "e application's support team for" 0c3af164 " more information..."
I’ve seen this message before, and it can have different underlying causes. The bottom line – at this point in the troubleshooting phase – is that it isn’t a custom message box thrown carelessly by application code. The trick now is to determine how the message box was thrown.
I made a few unusual observations about the call stack in question.
- First, there are two versions of the C++ runtime on the stack (7.1 and 8.0). This isn’t common. - Secondly, the sequence of events as told by the stack seems very unusual. It appears the .NET Framework throws an OOM (frame 0x14). Then eventually when the underlying OS handles the exception in frame 0xe, it somehow goes back to BaseThreadStart in frame 0xd where the default unhandled exception filter (UEF) is called in frame 0xc. From there we wind up back in the .NET Framework’s UEF in frame 0xb that appears to call the 7.1 CRunTime’s abort() function in frame 0xa. Finally, a message box results. What a wild ride!! - Finally… wait, the .NET Framework appears to throw a message box (if you observe frames 0xb through 0x8 of the stack)??? This is something I never would have predicted!
- First, there are two versions of the C++ runtime on the stack (7.1 and 8.0). This isn’t common.
- Secondly, the sequence of events as told by the stack seems very unusual. It appears the .NET Framework throws an OOM (frame 0x14). Then eventually when the underlying OS handles the exception in frame 0xe, it somehow goes back to BaseThreadStart in frame 0xd where the default unhandled exception filter (UEF) is called in frame 0xc. From there we wind up back in the .NET Framework’s UEF in frame 0xb that appears to call the 7.1 CRunTime’s abort() function in frame 0xa. Finally, a message box results. What a wild ride!!
- Finally… wait, the .NET Framework appears to throw a message box (if you observe frames 0xb through 0x8 of the stack)??? This is something I never would have predicted!
Fortunately, I’d been fooled by such an appearance before, so I knew to take a look at the raw stack first, before I went reading through the .NET source to see if mscorwks!InternalUnhandledExceptionFilter really does call for a message box to be thrown. When I looked at the raw stack around the frames 0xb through 0x8, this is what I found in between all the frames displayed from a kb command:
0:023> dds 0c3af070 L90 0c3af070 0c3af208 0c3af074 7c348e6c msvcr71!_NMSG_WRITE+0x12e 0c3af078 0c3af0a4 0c3af07c 7c37f480 msvcr71!`string' <clip> 0c3af26c 77e66ebd kernel32!VirtualQueryEx+0x1d 0c3af270 000d0f48 0c3af274 0000003a 0c3af278 0008b788 0c3af27c 02403a0a 0c3af280 0000003a 0c3af284 013af29c*** WARNING: Unable to verify checksum for ThirdParty.dll *** ERROR: Symbol file could not be found. Defaulted to export symbols for ThirdParty.dll - ThirdParty!FooClass::m_bVar+0x9cc8 0c3af288 0c3af2c4 0c3af28c 7c349600 msvcr71!_beginthreadex+0x7 0c3af290 90a99269 0c3af294 0c3af2c4 0c3af298 7c34cf83 msvcr71!abort+0x7 <—frame 0xa of the kn callstack above 0c3af29c 0000000a 0c3af2a0 00000000 0c3af2a4 0c3af558 0c3af2a8 7a3b2000 mscorwks!COMUnhandledExceptionFilter 0c3af2ac 0c3af2a0 0c3af2b0 00000103 0c3af2b4 0c3af520 0c3af2b8 7c34240d msvcr71!_except_handler3 0c3af2bc 7c380ea0 msvcr71!type_info::`vftable'+0x94 0c3af2c0 ffffffff 0c3af2c4 0c3af2d8 0c3af2c8 7c35f0a8 msvcr71!__CxxUnhandledExceptionFilter+0x2b 0c3af2cc 77ecb7c0 kernel32!gcsUEFInfo 0c3af2d0 7a09bea7 mscorwks!InternalUnhandledExceptionFilter+0x16 <—frame 0xb of the kn callstack above
Aha! So there is, in fact, someone else in between the .NET Framework and the call to throw the message box.
Now, let’s say you still want some kind of proof that the .NET Framework doesn’t make the call to msvcr71!abort(), which results in a call to show the message box. I had my doubts that mscorwks.dll had the 7.1 CRT as a dependency. On a tip from fellow PFE Zach Kramer, you can prove this by running dumpbin /imports on the binary (Dumpbin is an old – but very handy - utility that still ships with Visual Studio). I could always obtain the binary by asking my customer for their mscorwks.dll, but it’s much easier to just use psscor2!savemodule.
0:023> lmvmmscorwks start end module name 79e70000 7a400000 mscorwks Loaded symbol image file: mscorwks.dll 0:023> !savemodule 79e70000 F:\modules\mscorwks.dll 5 sections in file at 79e70000 Successfully saved file: F:\Modules\mscorwks.dll
Then from a command window:
f:\Modules>dumpbin /imports mscorwks.dll >Imports.txt
Searching Imports.txt for the string ‘msvcr7’ comes up empty. In fact, the build of mscorwks used by my customer’s application was msvcr80.dll. This makes sense when you look at frames 0x14-0x13 in our call stack.
What if we employ the same technique on this ThirdParty.dll – will it show the assembly depends on the 7.1 CRT? Scanning the output for dumpbin /imports on ThirdParty dll shows the following:
Microsoft (R) COFF/PE Dumper Version 9.00.30729.01 Copyright (C) Microsoft Corporation. All rights reserved. Dump of file ThirdParty.dll File Type: DLL Section contains the following imports: MSVCR71.dll 1000E154 Import Address Table 10010554 Import Name Table 0 time date stamp 0 Index of first forwarder reference <clip>
So we know this ThirdParty.dll has the 7.1 CRT as a dependency, but that alone doesn’t prove this component throws the message box. And according to my customer, the vendor of ThirdParty.dll had been approached many times in the past regarding these “phantom message boxes” hanging production applications. But the vendor had denied any involvement without explicit proof. So the presence of their dll on the stack next to a call to msvcr71!abort() might not suffice when I confronted them with this issue. I felt I still had some work ahead of me.
Deeper dive
Before continuing, let’s do a quick review of what we know and don’t know:
1. The process hung because a message box was thrown. Ironically, this kept the process alive so that a dump could be taken, and from this dump we learned root cause of the OOM. 2. The message box text and caption indicate it was thrown as a result of some systematic process – not due to some “rogue code” in the customer’s application. 3. Contrary to the appearance of the kb command, the message box was not thrown by the .NET Framework. 4. Based on the placement of ThirdParty.dll in the raw stack, the fact that ThirdParty.dll has the 7.1 CRT as a dependency, and the fact that the stack shows the message box was thrown directly by the CRT, and we’re likely to make the most progress by trying to rule out ThirdParty.dll as the culprit (or alternatively, prove that it is the culprit). 5. From the call stack, it appears the .NET Framework had already instructed the OS to handle the exception. Why does it appear that the .NET Framework got a second go-around at handling the exception, and how do we connect the dots from the .NET Framework, to this ThirdParty.dll, to the 7.1 CRT’s call to abort()? 6. Why does msvcr71!abort() throw a message box – was it explicitly instructed to do this by someone?
1. The process hung because a message box was thrown. Ironically, this kept the process alive so that a dump could be taken, and from this dump we learned root cause of the OOM.
2. The message box text and caption indicate it was thrown as a result of some systematic process – not due to some “rogue code” in the customer’s application.
3. Contrary to the appearance of the kb command, the message box was not thrown by the .NET Framework.
4. Based on the placement of ThirdParty.dll in the raw stack, the fact that ThirdParty.dll has the 7.1 CRT as a dependency, and the fact that the stack shows the message box was thrown directly by the CRT, and we’re likely to make the most progress by trying to rule out ThirdParty.dll as the culprit (or alternatively, prove that it is the culprit).
5. From the call stack, it appears the .NET Framework had already instructed the OS to handle the exception. Why does it appear that the .NET Framework got a second go-around at handling the exception, and how do we connect the dots from the .NET Framework, to this ThirdParty.dll, to the 7.1 CRT’s call to abort()?
6. Why does msvcr71!abort() throw a message box – was it explicitly instructed to do this by someone?
First, let’s tackle the question about connecting the dots seen in the call stack frames. There’s a pretty clear and concise explanation for this in an MSDN Magazine article from 2008.
Unhandled Exception Processing In The CLR When an exception goes unhandled and the OS invokes the topmost [Unhandled Exception Filter], it will end up invoking the CLR's UEF callback. When this happens, the CLR will behave like a good citizen and will first chain back to the UEF callback that was registered prior to it. Again, if the original UEF callback returns indicating that it has handled the exception, then the CLR won't trigger its unhandled exception processing
Unhandled Exception Processing In The CLR
When an exception goes unhandled and the OS invokes the topmost [Unhandled Exception Filter], it will end up invoking the CLR's UEF callback. When this happens, the CLR will behave like a good citizen and will first chain back to the UEF callback that was registered prior to it. Again, if the original UEF callback returns indicating that it has handled the exception, then the CLR won't trigger its unhandled exception processing
In other words, ThirdParty.dll had registered its UEF before the .NET Framework. And its UEF took the default road of calling abort() and throwing a message box. After ThirdParty.dll registered its UEF, the .NET Framework then registered its UEF callback. But it didn’t want to be rude and step over the UEF that ThirdParty.dll had registered first, so it chains back to it. Therefore, the result of msvcr71!abort() being called is due to the UEF registered by ThirdParty.dll.
Next, let’s tackle the question about msvcr71!abort() throwing a message box. Since this ThirdParty.dll was using the 7.1 build of the CRT, let’s look in MSDN for the information on msvcr71!abort():
abort determines the destination of the message based on the type of application that called the routine. Console applications always receive the message through stderr. In a single or multithreaded Windows application, abort calls the Windows MessageBox API to create a message box to display the message with an OK button. When the user clicks OK, the program aborts immediately.
To influence the behavior of abort(), simply call _set_error_mode in the DllMain so that it doesn’t exercise the default behavior of throwing a message box. MSDN’s documentation on _set_error_mode states you can use _OUT_TO_STDERR for the lone parameter, and this will avoid the message box when abort() is called.
A colleague of mine, Senior Escalation Engineer Bret Bentzinger, offered to write some sample code that would load ThirdParty.dll and test this proposed resolution of passing _OUT_TO_STDERR to _set_error_mode. Doing this confirmed that no message box was thrown, and the thread exits without hanging the process.
In the end, this problem of throwing a message box from the CRT’s call to abort() isn’t a new one. This issue has been around for ages. But it was my first opportunity to drive such an issue and see it through to a resolution. My customer got a fix from the vendor, and we were able to help the vendor write better, more stable code. It was a win-win situation!