Posts
  • PFE - Developer Notes for the Field

    Extracting Pictures from a Byte Array in a Memory Dump

    • 0 Comments

     

    I ran into a case recently where there was a large amount of memory being used in a 32 bit IIS process leading to an out of memory exception.  The steps to getting the correct data are documented all over the place.  Here’s the link to Tess’ site that I learned on: http://blogs.msdn.com/b/tess/archive/2008/02/15/net-debugging-demos-lab-3-memory.aspx

     

    I’ll skip the setup and go straight to the debug of the high memory usage dump.  To start with, as always when I’m debugging, we start with what we know:

    • Managed application running in IIS
    • 32 bit
    • Higher than expected memory usage

     

    Our next step is to list what we need to know:

    • How much memory is being used?
    • Manged or Native?
    • Leak or heavy usage?
    • What can we do about it?

     

    The first thing to do is figure out where the memory usage is, and whether it’s managed or Native.

     

    0:000> !address -summary

    --- Usage Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
    Free                                    280          355ac000 ( 853.672 Mb)           41.68%
    <unknown>                     830          34808000 ( 840.031 Mb)  70.34%   41.02%
    Image                                 1013          12479000 ( 292.473 Mb)  24.49%   14.28%
    Heap                                    218           3189000 (  49.535 Mb)   4.15%    2.42%
    Stack                                   129            ac0000 (  10.750 Mb)   0.90%    0.52%
    Other                                    26            14e000 (   1.305 Mb)   0.11%    0.06%
    TEB                                      43             2b000 ( 172.000 kb)   0.01%    0.01%
    PEB                                       1              1000 (   4.000 kb)   0.00%    0.00%

    --- Type Summary (for busy) ------ RgnCount ----------- Total Size -------- %ofBusy %ofTotal
    MEM_PRIVATE                             892          36629000 ( 870.160 Mb)  72.86%   42.49%
    MEM_IMAGE                              1314          13df7000 ( 317.965 Mb)  26.62%   15.53%
    MEM_MAPPED                               54            624000 (   6.141 Mb)   0.51%    0.30%

    --- State Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
    MEM_COMMIT                             1823          369e6000 ( 873.898 Mb)  73.17%   42.67%
    MEM_FREE                                280          355ac000 ( 853.672 Mb)           41.68%
    MEM_RESERVE                             437          1405e000 ( 320.367 Mb)  26.83%   15.64%

    The memory usage we’re concerned about is any committed memory as that will usually be close to the Private Bytes that value in PerfMon.  This is shown in the State Summary section as MEM_COMMIT and is currently at approximately 874 MB.  We can tell a few other things from the results above by comparing State, Type, and Usage.    For one, generally the majority of Image memory will be of state MEM_COMMIT.  Right off the bat, we can extract about 292 MB from MEM_COMMIT leaving us with about 582 MB to place still.   If we look at the Usage summary, we can see that 840 MB is from <unknown>.  That memory usage most often comes from Virtual Allocations.  It’s also where any managed memory will be.  Going back to our MEM_COMMIT memory, it seems likely that the majority of the 58 2MB will be handled fall under this bucket.  We don’t see much memory at all in Heap, so we can conclude that the memory is either allocated by managed memory, or by a native component making use of a managed allocator.

    The next step is to look at managed memory and see if we have around 500 MB of memory usage.  We can do that by loading psscor and running !clrusage.

    0:000> .load psscor2
    0:000> !clrusage
    Native Heap for aspnet_filter: 0x01220000
    Native Heap for .NET: 0x01fd0000
    Number of GC Heaps: 8
    Heap Size  0x62fb198(103,788,952)
    Heap Size  0x303c5b8(50,578,872)
    Heap Size  0x41e9e58(69,115,480)
    Heap Size  0x2ca3010(46,805,008)
    Heap Size  0x3f8209c(66,592,924)
    Heap Size  0x25b3b68(39,533,416)
    Heap Size  0x20b5a18(34,298,392)
    Heap Size  0x17a65a8(24,798,632)
    May contain Reserved segments, run with -v switch to see them.
    ------------------------------
    GC Heap Size  0x19f5617c(435,511,676)
    Total Commit Size  1f9f8000 (505 MB)
    Total Reserved Size  11608000 (278 MB)
    Initial reservation type: All at once
    Initial Allocation Size: 18000000 (402,653,184) (384 MB)
    Reserved Memory Size: 31000000 (822,083,584) (784 MB)
    Reserved Memory Limit Size: 31000000 (822,083,584) (784 MB)

    The managed committed memory is 505MB.  This lines up with what we were expecting to see if the memory pressure came from managed memory.  That means we’re investigating a managed memory issue.  We can now move on to seeing what is taking up all of the managed memory by using !dumpheap -stat

    0:000> !dumpheap -stat
    Using our cache to search the heap.
    0x7aa50c98          1            12 System.Diagnostics.CounterCreationDataCollection
    0x7aa40c38          1            12 System.ComponentModel.Design.RuntimeLicenseContext
    0x7a5eed4c          1            12 System.ComponentModel.WeakHashtable+WeakKeyComparer
    0x7a5eeb38          1            12 System.ComponentModel.BooleanConverter

    <snipped for brevity>

    0x000e8600        521    24,483,528      Free
    0x02af0c6c    119,732    74,368,372 System.String
    0x02af36dc     11,634   313,028,160 System.Byte[]
    Total 672,044 objects, Total size: 435,460,280

     

    Right away, those Byte arrays stand out.  It should be stated, that we usually expect to see Byte arrays at the bottom of the list, but as there are 313 MB of them here, it’s definitely something we want to look into.

    There are a couple ways to proceed.  We could dump all byte arrays and look for a pattern in the sizes, or we could look for large byte arrays.  We’ll start with large byte arrays.  By large, we mean anything that would have ended up in the large object heap.  Since we know LoH objects are all greater than or equal to 85,000 bytes, we can specify a minimum size when we view the heap to limit to these large objects.

    0:000> !dumpheap -mt 0x02af36dc -min 85000 -stat
    Using our cache to search the heap.
       Address         MT     Size  Gen
    Statistics:
            MT    Count    TotalSize Class Name
    0x02af36dc      139  305,074,736 System.Byte[]
    Total 139 objects, Total size: 305,074,736

    This shows us we have 139 Byte Arrays in the large object heap that account for 290.94 MB of our managed space.  Since there were over 11 thousand byte arrays total and all together they accounted for 298 MB, it’s fair to say the LoH Byte Arrays should be the focus of our investigation.  Our next step then is to dump out the Byte Arrays and begin looking to see what they are and if they’re still rooted.

    0:000> !dumpheap -mt 0x02af36dc -min 85000 
    Using our cache to search the heap.
       Address         MT     Size  Gen
    0x202fb838 0x02af36dc 2,077,524 3 System.Byte[]
    0x205dc320 0x02af36dc 3,704,228 3 System.Byte[]
    0x209648d8 0x02af36dc 4,990,036 3 System.Byte[]
    0x20e26d40 0x02af36dc  125,592 3 System.Byte[]
    0x3b1f0038 0x02af36dc 3,449,224 3 System.Byte[]
    0x3b53a1d0 0x02af36dc  109,140 3 System.Byte[]
    0x3b554c38 0x02af36dc 2,625,400 3 System.Byte[]

    <snipped for brevity>

     

    I ran !gcroot on several of the address listed, but did not find any roots.  As it turns out, the dump I was looking at was taken on a terminate process breakpoint and many of the managed structures were already in the process of being torn down.  Since we couldn’t find any roots explaining why the Byte Arrays were still around, we have to look at the Arrays themselves to figure out what they are and hopefully tie them back to some source code.

    A quick db on one of the Byte Arrays shows

    0:000> db 0x202fb838
    202fb838  dc 36 af 02 46 b3 1f 00-ff d8 ff e1 63 1d 45 78  .6..F.......c.Ex
    202fb848  69 66 00 00 4d 4d 00 2a-00 00 00 08 00 0a 01 0f  if..MM.*........
    202fb858  00 02 00 00 00 16 00 00-01 b2 01 10 00 02 00 00  ................
    202fb868  00 25 00 00 01 c8 01 12-00 03 00 00 00 01 00 01  .%..............
    202fb878  00 00 01 1a 00 05 00 00-00 01 00 00 01 ee 01 1b  ................
    202fb888  00 05 00 00 00 01 00 00-01 f6 01 28 00 03 00 00  ...........(....
    202fb898  00 01 00 02 00 00 01 32-00 02 00 00 00 14 00 00  .......2........
    202fb8a8  01 fe 02 13 00 03 00 00-00 01 00 01 00 00 47 46  ..............GF

    Most of what we see is unhelpful, but we see the string “Exif” suggesting this might be a digital photo.  Dumping further into the array we see another interesting string:

    202fb9f8 00 00 00 00 00 00 45 41 53 54 4d 41 4e 20 4b 4f  ......EASTMAN KO
    202fba08 44 41 4b 20 43 4f 4d 50 41 4e 59 00 4b 4f 44 41  DAK COMPANY.KODA
    202fba18 4b 20 45 41 53 59 53 48 41 52 45 20 56 31 30 37  K EASYSHARE V107
    202fba28 33 20 44 49 47 49 54 41 4c 20 43 41 4d 45 52 41  3 DIGITAL CAMERA

    That certainly supports the idea that we have a digital photo taken with a digital camera (a Kodak Easyshare to be precise).  Checking a few of the other byte arrays, we can see the same pattern.

    0:000> db 0x202fb838
    202fb838  dc 36 af 02 46 b3 1f 00-ff d8 ff e1 63 1d 45 78  .6..F.......c.Ex
    202fb848  69 66 00 00 4d 4d 00 2a-00 00 00 08 00 0a 01 0f  if..MM.*........

    0:000> db 0x3c478e80
    3c478e80  dc 36 af 02 1c c6 49 00-ff d8 ff e0 00 10 4a 46  .6....I.......JF
    3c478e90  49 46 00 01 01 01 00 60-00 60 00 00 ff e1 19 78  IF.....`.`.....x
    3c478ea0  45 78 69 66 00 00 49 49-2a 00 08 00 00 00 11 00  Exif..II*.......

    0:000> db 0x3a1f0038
    3a1f0038  dc 36 af 02 0a ff 32 00-ff d8 ff e0 00 10 4a 46  .6....2.......JF
    3a1f0048  49 46 00 01 01 01 00 48-00 48 00 00 ff e1 23 27  IF.....H.H....#'
    3a1f0058  45 78 69 66 00 00 49 49-2a 00 08 00 00 00 10 00  Exif..II*.......

    At this point, we’ll assume they were all digital photos.  We can revise that assumption if source review doesn’t agree with us.  At this point, we really need to look at the source to see why the pictures are staying around.  I don’t have the source available as this was given to me as a dump taken during shutdown, so exporting the modules didn’t work.  The only thing left to really get from the dump is the pictures themselves.  All that’s left to do is export a few of the byte arrays to disk so we can open them and maybe narrow down what we need to look at in the source (are these employee photos? a family outing? product photos?)

    We’ll use the writemem command in windbg and simply write the byte array from the address of the array plus 8 to avoid the metadata.

    0:000> !do 0x202fb838
    Name: System.Byte[]
    MethodTable: 02af36dc
    EEClass: 028aeb7c
    Size: 2077522(0x1fb352) bytes
    GC Generation: 3
    Array: Rank 1, Number of elements 2077510, Type Byte
    Element Type: System.Byte
    Fields:
    None
    0:000> .writemem f:\output.jpg 0x202fb838+8 L?0x1fb352
    Writing 1fb352 bytes........................................

    Then we can open it up and view our result.  I dumped several out, all of which were pictures of kids on what looked to be a field trip.  That helped narrow down the user activity and focus the source review for the resolution.  If the source code didn’t provide a resolution, our next step would have been to take another dump when the memory was high but the process wasn’t shutting down.  This should have shown us how the Byte Arrays were rooted at which point we would have been able to handle them.

     

    In this case, I used the following command to export all of the pictures at once:

    .foreach (var {!dumpheap -mt 0x02af36dc -min 85000 -short}) {.writemem f:\Pictures\${var}.jpg ${var}+8 L?poi(${var}+4)}

     

    -Jarrod

  • PFE - Developer Notes for the Field

    Troubleshooting Event ID 5010 — IIS Application Pool Availability

    • 2 Comments

    With this event you will get a message stating, “A process serving application pool '%1' failed to respond to a ping. The process id was '%2'.”  When an Event ID 5010  appears in your system log and WAS is the source, you can be assured of two things.  The first is that a worker process failed to respond to a ping from the Windows Process Activation Service (WAS).  The second is that WAS attempted to shut down the offending process.  Generally, this means that either the worker process hung or it did not have enough resources to respond to the ping request.  The resolution in the documentation for this Event is listed as:

    Diagnose an unresponsive worker process

    A worker process that fails to respond may exhibit one of the following symptoms:

    • Ping failure.
    • Communication failure with WAS.
    • Worker process startup time-out.
    • Worker process shutdown time-out.
    • Worker process orphan action failure.

     

    There are some links provided as well that provide guidance on troubleshooting memory leaks, long response times, and high CPU usage in the worker process.   In each of those 3 cases, there is a deterministic approach with which you can take a memory dump and investigate the issue.  However, in practice there are times when the worker process may simply hang without displaying any of those behaviors.  So the question becomes, how do we know when the worker process is hung?  The answer is, WAS knows, so we just have to ask WAS.

    I’m going to go through the steps of finding the breakpoints and getting the needed information because the methodology can be used elsewhere.  For example, though the steps at the bottom of the page work for IIS 7, 7.5, and 8, I haven’t tested them with IIS 6 or below.  If you’re currently experiencing this issue and just want to know how to get the memory dump of the Worker Process, feel free to skip down to the bottom.

    When working with an issue like this, our first step is always to clarify what we know and what we need to know.

    What we know:

    • WAS knows when a worker process hangs
    • WAS knows enough about the worker process to shut it down if it appears hung

     

    What we need to know:

    • How to tell when WAS thinks a worker process is hung
    • How to identify the hung process
    • How to create a dump of the worker process before it’s terminated by WAS

     

    To get to the answers we need, we’ll investigate the WAS process with WinDbg.  Our goal will be to find a method call that signifies a failed ping, set a breakpoint on it, then identify the associated worker process at the time of failure.  We’ll start by attaching the debugger to the WAS process and reloading symbols.  Symbols are important here because we’ll need them to investigate the method calls.  Once we have the debugger attached, we can take a look at the loaded modules in an effort to make some guesses at where we should look.  We’ll do this by running lm, for list modules.  We can append 1m and sm to only list the module names and sort them alphabetically.

    0:000> lm 1m sm
    ADVAPI32
    bcryptPrimitives
    clbcatq
    combase
    CRYPT32
    CRYPTBASE
    CRYPTSP
    DPAPI
    GDI32
    HTTPAPI
    IISRES
    iisutil
    iisw3adm
    KERNEL32
    KERNELBASE
    ktmw32
    logoncli
    mlang
    MSASN1
    msvcrt
    nativerd
    NSI
    ntdll
    ntmarta
    ole32
    OLEAUT32
    pcwum
    profapi
    RPCRT4
    rsaenh
    sechost
    Secur32
    SSPICLI
    svchost
    USER32
    USERENV
    w3ctrlps
    W3TP
    WS2_32
    XmlLite

    Now that we have the modules, we can make some guesses as to which ones might be related to the failed ping request processing.  It seems likely that the module responsible will either be one of the IIS processes or a process with w3 in the name.  Next in the debugger we can search for method calls in those modules that could likely be related.  We can use the x command to search for the methods using wild cards.

    0:000> x *iis*!*ping*
    000007fb`45efb2d4 iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem (<no parameter info>)
    000007fb`45f2c330 iisw3adm!g_aIISWPSiteMappings = <no type information>
    000007fb`45f247dc iisw3adm!AddDomainMapping (<no parameter info>)
    000007fb`45f13fb0 iisw3adm!CONFIG_CS_PATH_MAPPER::SetConfigFileMappings (<no parameter info>)
    000007fb`45f2c4d0 iisw3adm!g_aIISWPGlobalMappings = <no type information>
    000007fb`45efdde4 iisw3adm!MESSAGING_HANDLER::HandlePingReply (<no parameter info>)
    000007fb`45efc59c iisw3adm!PingResponseTimerExpiredCallback (<no parameter info>)
    000007fb`45f24918 iisw3adm!EnsureLSAMapping (<no parameter info>)
    000007fb`45f2c490 iisw3adm!g_aIISULGlobalMappings = <no type information>
    000007fb`45f2c2d0 iisw3adm!aIISULSiteMappings = <no type information>
    000007fb`45f059f4 iisw3adm!PERF_MANAGER::ProcessPerfCounterPingFired (<no parameter info>)
    000007fb`45f2c168 iisw3adm!g_cIISWPSiteMappings = <no type information>
    000007fb`45efb64c iisw3adm!WORKER_PROCESS::CancelPingResponseTimer (<no parameter info>)
    000007fb`45f2d3cc iisw3adm!g_fWASStoppingInProgress = <no type information>
    000007fb`45f06be0 iisw3adm!PerfCountPing (<no parameter info>)
    000007fb`45f31dc0 iisw3adm!_imp_LsaManageSidNameMapping = <no type information>
    000007fb`45ed98a0 iisw3adm!CheckWASIsStopping (<no parameter info>)
    000007fb`45f31eb0 iisw3adm!_imp_CreateFileMappingW = <no type information>
    000007fb`45f31eb8 iisw3adm!_imp_OpenFileMappingW = <no type information>
    000007fb`45f2c164 iisw3adm!cIISULSiteMappings = <no type information>
    000007fb`45f2c484 iisw3adm!g_cIISULGlobalMappings = <no type information>
    000007fb`45efb164 iisw3adm!WORKER_PROCESS::SendPingWorkItem (<no parameter info>)
    000007fb`45efc51c iisw3adm!SendPingTimerCallback (<no parameter info>)
    000007fb`45f31df0 iisw3adm!_imp_CreateFileMappingA = <no type information>
    000007fb`45f31df8 iisw3adm!_imp_OpenFileMappingA = <no type information>
    000007fb`45f2c488 iisw3adm!g_cIISWPGlobalMappings = <no type information>

    The first method call returned looks very promising and is a good place to start.  We can set a breakpoint on that line item, tell the process to continue, then intentionally cause a failed ping request to see if the breakpoint trips.

     

    0:000> bp iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem
    0:000> g

    To cause a failed ping request, we can suspend a worker process with a tool such as Process Explorer.

    image

     

    After the time specified in the Application Pool’s configuration, the WAS process should identify the worker process as hung at attempt to shut it down.  If our breakpoint is correct, the debugger, still attached to the WAS process, should break in.  As it turns out, that is exactly what happens in this case.

    Breakpoint 0 hit
    iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem:
    000007fb`45efb2d4 48895c2410      mov     qword ptr [rsp+10h],rbx ss:000000cd`a86cfba8=705541a9cd000000

    The breakpoint has been hit, and if we check Process Explorer, we can see the process is still alive.  Further, we know because of the calling convention that we’re dealing with a WORKER_PROCESS object.  At this point, we could absolutely write a script to dump all of the worker processes on the machine and we would know we got a dump of the hung process.  If however, we’re working on a machine that has a lot of worker processes, it may be more beneficial to find the actual worker process being referred to and just dump that object.  The WORKER_PROCESS object will almost undoubtedly contain a PID for the worker process with which we could use with a script to dump just that worker process.  We can investigate the current thread and try to find the worker process object.  We can then look to see if there is an offset from the object that points to the PID.

    We’ll start by dumping the call stack.

    0:002> k
    Child-SP          RetAddr           Call Site
    000000cd`a86cfb98 000007fb`45ef7385 iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem
    000000cd`a86cfba0 000007fb`45ede8e3 iisw3adm!WORKER_PROCESS::ExecuteWorkItem+0x41
    000000cd`a86cfbf0 000007fb`45ede442 iisw3adm!WORK_ITEM::Execute+0x13
    000000cd`a86cfc30 000007fb`45edb129 iisw3adm!WORK_QUEUE::ProcessWorkItem+0xa2
    000000cd`a86cfc70 000007fb`45edafb7 iisw3adm!WEB_ADMIN_SERVICE::MainWorkerThread+0x35
    000000cd`a86cfcb0 000007fb`45ed9aec iisw3adm!WEB_ADMIN_SERVICE::ExecuteService+0xdf
    000000cd`a86cfcf0 000007fb`4e8f167e iisw3adm!WASServiceThreadProc+0x94
    000000cd`a86cfd60 000007fb`50533501 KERNEL32!BaseThreadInitThunk+0x1a
    000000cd`a86cfd90 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

     

    This gives us an idea of the calling order, but the public symbols we’re using won’t give us local variables and in this instance they won’t tell us if and what parameters are passed.  x86 Calling conventions dictate that in a 64 bit process, the first 4 parameters are stored in the CPU registers RCX, RDX, R8, then R9 in that order.  Any parameters in addition to that will be stored on the stack.  If this were a 32 bit process, the parameters would most likely all be on the call stack.  Knowing this, we can start by searching the call stack and the 4 registers mentioned for the PID we identified.  In fact, any of the non null registers displayed with the r command are probably a good place to look, excluding RIP which will point to code.

    0:002> r
    rax=0000000000000000 rbx=0000000000000000 rcx=000000cda910e520
    rdx=000000cda94159d0 rsi=0000000000000001 rdi=000000cda910e520
    rip=000007fb45efb2d4 rsp=000000cda86cfb98 rbp=000000cda8709ec0
      r8=000000cda86cfb28  r9=000000cda86cfc10 r10=0000000000000000
    r11=0000000000000246 r12=0000000000000000 r13=0000000000000000
    r14=0000000000000000 r15=0000000000000000
    iopl=0         nv up ei pl zr na po nc
    cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246
    iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem:
    000007fb`45efb2d4 48895c2410      mov     qword ptr [rsp+10h],rbx ss:000000cd`a86cfba8=705d41a9cd000000


    We can see that RAX and RBX are null, but RCX and RDX are not.  We’ll start with the call stack then move on to the other registers. For the call stack, we’ll search between 100 bytes under the top of the stack pointer (RSP) and up to the stack base (RBP).  Remember, stacks call stacks start at a higher address and grow down as they move up the stack.  We’re searching for a word value because that’s plenty of space for the PID to be stored in and searching for our known PID, 8012.

    0:002> s -[1]w @rsp-100 @rbp 0n8012

    This returned no results.  So, we don’t have the PID on the call stack.  Next we’ll search from RCX to 300 Bytes greater than RCX.

    0:002> s -[1]w @rcx L300 0n8012
    0x000000cd`a910e568
    0x000000cd`a910e56c

    In this case we have two results.  We can see how far from RCX the first result is.

    0:002> ?0x000000cd`a910e568-@rcx
    Evaluate expression: 72 = 00000000`00000048

    We get an offset of 72 bytes, which is very close.  We can do a couple things to verify that we can use this offset in the future.  First, we can check to see if we have any symbol information near RCX.

    0:002> dps @rcx
    000000cd`a910e520  000007fb`45ecc2b8 iisw3adm!WORKER_PROCESS::`vftable'
    000000cd`a910e528  000000cd`43525057
    000000cd`a910e530  000000cd`a90f4998
    000000cd`a910e538  000000cd`a90f4998
    000000cd`a910e540  00000000`00000005
    000000cd`a910e548  000000cd`a9390db0
    000000cd`a910e550  00000000`00000006
    000000cd`a910e558  000000cd`a90f4930
    000000cd`a910e560  000000cd`a90f4b40
    000000cd`a910e568  00001f4c`00001f4c
    000000cd`a910e570  00000000`0000031c
    000000cd`a910e578  00000000`00000000
    000000cd`a910e580  000000cd`a7f7ed20
    000000cd`a910e588  00000000`00000001
    000000cd`a910e590  000000cd`00000000
    000000cd`a910e598  00000000`00000000

     

    It seems RCX is a pointer to a WORKER_PROCESS virtual function table.  This means it is essentially the address for the WORKER_PROCESS object.  It also means our offset of RCX+72 is probably correct and could be used consistently.  We can be sure with one last check.  We need to let the debugger continue, create and pause another worker process, let the debugger break in at our earlier breakpoint, then validate that RCX+72 points to the new PID.

    With the worker process paused, we note our new PID is 4356.  In Hex that comes to 1104, so that’s what we’re looking for.

    0:002> g
    Breakpoint 0 hit
    iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem:
    000007fb`45efb2d4 48895c2410      mov     qword ptr [rsp+10h],rbx ss:000000cd`a86cfba8=705d41a9cd000000
    0:002> dw @rcx+0n72 l1
    000000cd`a910e568  1104

     

    Clearly, our offset works as it points to the correct new PID.  This means we can use our breakpoint and this offset to create a script to dump the hung worker process.  Below, I’ve listed the steps to script this with DebugDiag, but it could easily be adapted to work with ADPlus or even CDB and WinDbg directly.

    (On a side note, review of the disassembly for the PingResponseTimerExpiredWorkItem showed that it did not have any parameters.  RCX was clearly not a parameter, but it’s a common enough register that it is still good to check.  It definitely paid off in this instance.)

     

    Steps to get a memory dump using DebugDiag from the faulting worker process when a 5010 is raised.

    The assumption is made that the production machine with the issue does not have internet access as this is a common practice.  If it’s not the case and internet access is available, you can complete Step 1 on the production machine.

     

    Step 1: Get Symbols for WAS Service

    1. Open DebugDiag on a production server
    2. Navigate to the Processes tab
    3. Sort by the Service Column Descending and identify the line containing the process for the WAS service
    4. Right click the WAS service and choose create full memory dump
    5. Zip the dump and copy it to a machine with external internet access
    6. On the new machine use the LoadSymbols script found here: http://blogs.msdn.com/b/puneetgupta/archive/2010/05/14/debugdiag-script-to-load-all-symbols-in-a-dump-file.aspx to collect symbols for the WAS service
    7. On the analysis tab of DebugDiag, highlight the line for LoadSymbols
    8. Click Add Data File
    9. Navigate to the dump from the prod server and click ok
    10. Now click Start Analysis
    11. By default, DebugDiag will save the symbol files to c:\symcache.  Copy the contents of that folder to the same path on the production machine.  (If you’ve changed the symbol settings in DebugDiag previously, you must make sure you’re pointing to the Microsoft symbol server and make note of your local path)

     

    Step 2: Set up DebugDiag on Production machine

    1. On the Rules tab, click Add Rule and select a Crash Rule
    2. Select “A specific NT Service” and click Next
    3. Highlight the WAS service and click Next
    4. Click Breakpoints then Add Breakpoint
    5. In Breakpoint Expression enter: iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem
    6. Under Action Type select Custom
    7. Add the following script in the window that pops up:
    Dim w3wpProcessId
    rcx = CausingThread.Register("rcx")
    w3wpProcessId = Debugger.ReadDWORD(rcx + 72)
    WriteToLog "w3wp process id = " & w3wpProcessId
    Debugger.CreateDumpForProcessID w3wpProcessID,"Failed Ping",False

     

    1. Set the Action Limit to 5 then click ok
    2. Click Save & Close
    3. Click Next all the way through.  You can specify the location of the dump file on one of the screens before you click finish.
    4. Check the log for svchost under c:\program files\DebugDiag and make sure it shows that the breakpoint was set.  If it was not set, try restarting the Debug Diagnostic Service.  The log should have something like the following:

     

    [9/30/2012 2:46:53 PM] Initializing control script
    [9/30/2012 2:46:53 PM] Clearing any existing breakpoints
    [9/30/2012 2:46:53 PM]
    [9/30/2012 2:46:53 PM] Attempting to set unmanaged breakpoint at iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem
    [9/30/2012 2:46:53 PM] bpID = 0
    [9/30/2012 2:46:53 AM] Current Breakpoint List(BL)
      0 e 000007fb`45efb2d4     0001 (0001)  0:**** iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem

     

    You can test the above set up by suspending a worker process with Process Explorer and making sure a memory dump is created.  That’s all there is to it.  Of course, when you have the memory dump, that’s just the beginning as you will have to debug that dump to identify the cause of the hang.

     

     

    -Jarrod

  • PFE - Developer Notes for the Field

    Caller Information – a new concept in Visual Studio 2012

    • 2 Comments

    With the recent release of Visual Studio 2012, there are many new features and updates to explore.  One of the new features is the concept of Caller Information, available in both C# and VB.  Caller Information allows you to obtain certain information about the caller to a method.  More specifically, it allows you to get the name of the calling method as well as the source file information and line number.

     

    Many customers currently use catch blocks to document the name of the current method and maybe even the source file from which an exception is thrown.  But Caller Information provides a way to get this information (and more – source line info), even if no exception occurs.   The way it works is that you include attributes as optional parameters in your function’s signature.  There are three of these, and they’re defined in the System.Runtime.CompilerOptions namespace.  You should supply a default value to whichever ones you include in your function’s signature.

    CallerFilePath – The full path and filename (during compile time) of the caller.

    CallerLineNumber – The line number (in the source code) where the caller calls the callee.

    CallerMembername – The method or property name of the caller.

     

    Caller Information can be of particular value for profiling or tracing tools.  However, you don’t need to write a profiler in order to take advantage of this new feature.  One example where an application could make use of Caller Information would be if you have a function (say, MyFunc) which is called from many different functions.  Let’s say MyFunc takes an argument or two, and occasionally MyFunc gets an exception when using the arguments – for example, a NullReferenceException.  But the fact that MyFunc doesn’t reproduce this exception consistently may make it difficult to determine what the code path is that results in the exception.

    If MyFunc captured the calling function, you’d have a much better idea of how to reproduce the problem and, therefore, find root cause and a resolution.

     

    Here’s a demonstration of how to use CallerInformation.  I created a console application with the following code. (Sincere apologies as to the format shown below.  I struggled for a while to remove the individual boxes around each block of code, but doing so only removed all the indentation, which made the code more difficult to read).

    using System.Runtime.CompilerServices;

    namespace CallerInfoSample
    {
    class Program
    {
    static void Main(string[] args)
    {
    string str = "";
    str = Func1();

    Console.WriteLine(str);
    Console.ReadLine();
    }

    static string Func1([CallerFilePath] string sFilePath = "", [CallerLineNumber] int iLineNbr = 0, [CallerMemberName] string sMemberName = "")
    {
    string s = "";
    s = "Before calling Func2, Func1 reports the line number where " + sMemberName + " called the current function is " + sFilePath + ", Line " + iLineNbr + ".";
    Console.WriteLine(s);
    Console.WriteLine("=====");

    bool b = Func2();

    s = "After calling Func2, Func1 reports the line number where " + sMemberName + " called the current function is still " + iLineNbr + " since we haven't finished executing Func1.";
    Console.WriteLine("Func2 returns " + b.ToString() + ".");
    Console.WriteLine(s);
    Console.WriteLine("=====");

    return "Done Processing";
    }

    static bool Func2([CallerFilePath] string sFilePath = "", [CallerLineNumber] int iLineNbr = 0, [CallerMemberName] string sMemberName = "")
    {
    string s = "";
    s = "Before calling Func3, Func2 reports the line number where " + sMemberName + " called the current function is " + sFilePath + ", Line " + iLineNbr + ".";
    Console.WriteLine(s);
    Console.WriteLine("=====");

    int j = 0;
    int i = Func3(j);

    return true;
    }

    static int Func3(int i, [CallerFilePath] string sFilePath = "", [CallerLineNumber] int iLineNbr = 0, [CallerMemberName] string sMemberName = "")
    {
    string s = "";
    s = "Before returning to Func2, Func3 reports the line number where " + sMemberName + " called the current function is " + sFilePath + ", Line " + iLineNbr + ".";
    Console.WriteLine(s);
    Console.WriteLine("=====");

    return (i + 1);
    }
    }
    }

    It results in the following output:

    Before calling Func2, Func1 reports the line number where Main called the current function is c:\Samples\CallerInfoSample\CallerInfoSample\Program.cs, Line 16.
    =====
    Before calling Func3, Func2 reports the line number where Func1 called the current function is c:\Samples\CallerInfoSample\CallerInfoSample\Program.cs, Line 29.
    =====
    Before returning to Func2, Func3 reports the line number where Func2 called the current function is c:\Samples\CallerInfoSample\CallerInfoSample\Program.cs, Line 47.
    =====
    Func2 returns True.
    After calling Func2, Func1 reports the line number where Main called the current function is still 16 since we haven't finished executing Func1.
    =====
    Done Processing
     

    For more information on Caller Information, see MSDN at http://msdn.microsoft.com/en-us/library/hh534540.aspx.

    NOTE: I could only make use of Caller information when I configured VS2012 to compile against 4.5 of the .NET Framework; the functionality isn't there in 4.0.

  • PFE - Developer Notes for the Field

    Exploratory Testing in Visual Studio 2012

    • 3 Comments

    With the recent release of Visual Studio 2012, there are many new features and updates to explore.  One of the new features is Exploratory Testing, which allows you to test your software without a pre-defined test or script.  It’s perfect for simply “exploring” your application to see what issues you may encounter.  Exploratory Testing comes with the Test Professional, Premium, and Ultimate editions of VS 2012.

    Technically, Exploratory Testing is part of Microsoft Test Manager (MTM).  In VS2010, one could achieve exploratory testing by filing an exploratory bug in Test Runner.  From there, users/testers could execute unscripted actions until a bug was found.  But in VS2012, exploratory testing is pulled into MTM along with your other tests, plans, suites, tracks, work items, etc. You can learn about some of the other new features in MTM 2012 in MSDN Magazine.  In order to use Exploratory Testing in MTM, you must be connected to Team Foundation Server (TFS) 2012.  If you don’t have access to a TFS 2012 instance, you can try out TFS in the cloud free, for a limited time, at http://tfspreview.com.  The instructions at that site are simple and include everything you need to create an account, project, upload source, automate builds, etc. 

    NOTE: At the time of this writing, no information was available as to how long tfspreview.com would be available (or at least, available for free).  While the end goal of TFS Azure is to have all the features of TFS, you may find that not all features of TFS - such as Reporting Services, AD Federation, etc. - are available at tfspreview.com.

     

    With exploratory testing, you can literally browse through your application at your leisure, until you find a bug or some other part of the software that should be considered for change – editing text in a label, background color change, adding or removing a feature, etc.  And the whole time you’re “exploring” the application, MTM records your actions.  By default, actions you execute in MTM or Office applications are not recorded, though you can change this in your test plan properties (Testing Center, Plan, Properties).  Exploratory testing in MTM allows you to do the following:

    • Pause and resume recording of your exploratory test case
    • Enter comments and screenshots for your exploratory test.  This can help clarify recorded activities in your test.
    • Create a bug
    • Generate a Test Case for the bug you just created

     

    Exploratory testing can be used on web applications, windows applications, and if it’s a client/server application, some steps on the server side can also be recorded.  More details on this, as well as the steps to run an exploratory test can be found at http://msdn.microsoft.com/en-us/library/tfs/hh191621(v=vs.110).aspx.  While I have no desire to copy what can already be found on MSDN, I also know that sometimes links change, content is moved, etc.  Therefore, I’ll repeat some of those steps in this blog entry.

    1. Open MTM.
      1. You can perform exploratory testing with no existing work items.  To do this, simply right-click a Test Plan or Test Suite and select Explore.

    b. If you do have an existing Work Item, you can perform exploratory testing on a specific work item by going to Test, Do Exploratory Testing.  Then right-click on your work item of choice and select Explore.

     

    2. When you select Explore, the Exploratory Testing window opens, waiting for you to Start your test.  Once clicked, MTM records your actions.

     

    3. You can pause or end the test, and this pauses or ends recording of your actions. At any time you can enter comments and/or add screen captures in the Exploratory Testing window.  Additionally, if you run into a bug you can click the Create Bug button during recording, and fill out details of your newfound bug.  When you save your new bug, you’ll have the option to create a Test Case for the bug you just created.

     

    These are just the steps to get you started, and there are many other options for saving Test Cases, assigning & editing bugs, and more.  Exploratory testing is a great option when you don’t have a defined test, a defined script as to how to navigate directly to a specific bug, or even when you don’t have a specific idea of what problem you seek to uncover in your application. 

  • PFE - Developer Notes for the Field

    Using DebugDiag's LeakTrack with ProcDump's Reflected Process Dumps

    • 0 Comments

    DebugDiag and Procdump are two of my favorite tools.  They're both incredibly useful and I rely on them heavily to gather MemoryDumps when debugging production issues.  They have a myriad of overlapping features, but they each have (at least) one feature that makes them stand apart from each other:

    • DebugDiag has the LeakTrack feature.  This lets users inject a module that tracks callstacks for allocations and stores them to heuristically identify the cause of native memory leaks in an application.  I use the feature whenever I have a native memory leak and find it indispensable.
    • ProcDump has the ability to reflect a process before creating a memory dump.  This substantially reduces the amount of time the process is interrupted when a memory dump is taken.  When dealing with production issues, this is an extremely efficacious feature.

    The catch is, DebugDiag currently can't create reflected process dumps, and ProcDump doesn't have a way to inject the LeakTrack dll to track allocations.  We can get around this by working with both tools.  Assuming both tools are installed on the server with a leaky application, the process is as follows:

    1. Identify the application that has a native memory leak: http://blogs.microsoft.co.il/blogs/sasha/archive/2008/07/13/is-it-a-managed-or-a-native-memory-leak.aspx
    2. Use DebugDiag to inject the LeakTrack dll and begin tracking allocations
    3. Set up ProcDump to create a dump of the reflected process when a memory threshold is breached

    Step 1 is easy and documented elsewhere.  Step 2 is easy if we have direct GUI access to the machine.  We can simply go to the Processes tab in DebugDiag, right click the process, and chose "Start Monitory for Leaks."  For the sake of argument, let's say we want to script the whole process.  We can do that by scripting DebugDiag and ProcDump to do the individual tasks we've set out for them.  Once we have the PID of the troubled process, we can use a script to inject the LeakTrack dll into the process.  The following vbs script will, when used by DebugDiag, start the native leak tracking process:

    Sub Debugger_OnInitialBreakpoint()
       Debugger.Write "Injecting LeakTrack" & vbLf
       Debugger.InjectLeakTrack
       Debugger.Write "Detaching2" & vbLf
       Debugger.Detach
    End Sub

    The above code can be saved into a vbs file.  For this post we'll assume it was saved to c:\scripts\InjectLeakTrack.vbs.  We'll also assume our PID was 1234

    With the PID known and the script created, we can launch DebugDiag from a command line as such (if you installed DebugDiag elsewhere, you'll have to adjust the install location): C:\PROGRA~1\DEBUGD~1\DbgHost.exe -script "c:\scripts\InjectLeakTrack.vbs" -attach 1234

    This will cause DbgHost, the debugger process DebugDiag uses, to attach to the process with PID 1234 and run the script we provided.  The script simply injects the LeakTrack dll and detaches the debugger.  Now that LeakTrack is injected, we can launch ProcDump and tell it to reflect the process and create a memory dump when we've crossed a memory threshold.  Let's say we know our application has leaked if it ever crosses 600MB.  We can kick off ProcDump as follows: procdump 1234 C:\Dumps\Over600.dmp -accepteula -r -m 600 -ma

    ProcDump will then monitor the process and create a reflected process dump when the committed bytes of the process breach 600 MB.  This memory dump can be analyzed by DebugDiag on the Advanced Analysis tab, and the offending callstacks can be identified.

     

    -Jarrod

     

     

  • PFE - Developer Notes for the Field

    The Art of Guessing

    • 2 Comments
    Often when debugging, we’re presented with a problem scenario where our application specific knowledge is limited.  This could mean we don’t have access to the source code or even that we do have access to the source code but simply don’t have time to read millions of lines of code to completely grasp the situation at hand.

    In this type of scenario, it’s generally not immediately clear what steps we should take for a resolution.  This means we often have to guess.  Guessing is somewhat of a misnomer in this situation.  I’m not advocating random guessing by any means, but rather taking a logical approach to guessing.  When selecting an action based on guess work, you should be able to explain each step along the way.  This means we need a way to guess well.

     The key to guessing well is leveraging what we do know about a system to in an attempt to make some logical assumptions to fill in the gaps of information that we don’t know.   These are still guesses, but they’re logical guesses based on reason and sound thinking.

    Below is a simple case study of a guess the number game written in Silverlight.  The goal of the case study is not to demonstrate how to hack a guess the number game, but rather to illustrate the reasoning behind the guesswork needed to beat the game without relying on…well, guessing. 

    The game works by prompting the user to guess a number and click a button.  If the user guesses incorrectly, they see the following screen:

    Our goal is to figure out what number to enter in order to guess correctly.  To start with, we want to aggregate what we know about the application:

    1. The user enters a number into a text box and clicks a button.
    2. The number is likely evaluated against something in the application and if they match, we expect to get a message indicating success.  Currently we are trying randomly and only getting failures.

    Next we have to figure out what we want to know.

    1. What happens if the user enters the correct value?
    2. How was the application created and where is it hosted?
    3. How can we figure out what the correct value is?

    With these questions, the logical first step is to talk to one of the developers that support the project. We’ll assume we got the following back:

    1. What happens if the user enters the correct value? The documentation says the label at the bottom should read, “You chose wisely.”
    2. How was the application created and where is it hosted? All I know is that it’s a Silverlight application that is viewed through the browser.
    3. How can we figure out what the correct value is?  I have no idea.  The code is all on Tommy’s computer and Tommy is on vacation in Jamaica.

    Ok, we have a couple of good answers now.  We’ve expanded what we know about the application.  Now it’s time to take some action.  Since the developer told us this is a Silverlight application hosted in a browser we also know that to debug this we can connect a debugger to the browser.

    So, we hook up to the browser, and now we have to decide what to look for.   We don’t know much about how the application works.  We don’t know where the application stores the correct value.  We don’t know enough about the application to know what method does the comparison of values.  We don’t even know if the comparison code is hosted locally or if a call is made to a server.

    That’s a lot of things we don’t know, so let’s go back to what we do know.  We know that to kick off the comparison operation the user has to click a button.  We also know this is a Silverlight application.  We can guess that the button will be a Silverlight button.

    This means we have a first step.  We can use the debugger to search for the button in the managed heap.

    First step is to load SOS for Silverlight:

    0:038> .loadby sos coreclr

    With SOS loaded, we can look on the managed heap for a button object.  We might not know going into to it exactly what type is needed to create a Silverlight button, but we can guess that it probably contains the word “Button.”

    0:038> !dumpheap -stat -type Button

    Statistics:

          MT    Count    TotalSize Class Name

    50878ec8        1           32 MS.Internal.FXCallbackDelegates+PasswordBox_AttachRevealButtonListenerDelegate

    50878c1c        1           32 MS.Internal.FXCallbackDelegates+TextBox_AttachDeleteButtonClickHandlerDelegate

    508ad1c0        1          128 System.Windows.Controls.Button

    Total 3 objects

     

    System.Windows.Controls.Button looks promising.  There is only 1 of them and it ‘feels’ right.  Next step is to find the address for that type:

    0:038> !dumpheap -mt 508ad1c0       

     Address       MT     Size

    0c862790 508ad1c0      128

    Now we have an address, so clearly the next step is to dump the object:

    0:038> !do 0c862790

    Name:        System.Windows.Controls.Button

    MethodTable: 508ad1c0

    EEClass:     504f3fd4

    Size:        128(0x80) bytes

    File:        C:\Program Files (x86)\Microsoft Silverlight\5.0.61118.0\System.Windows.dll

    Fields:

          MT    Field   Offset                 Type VT     Attr    Value Name

    50898494  4000499        4 ...eObjectSafeHandle  0 instance 0c86281c m_nativePtr

    508986f4  400049a        8 ... System.Windows]]  0 instance 0c862ad4 _valueTable

    50e36b40  400049b       24       System.Boolean  1 instance        0 _propagateInheritanceChanged

    50894fa4  400049c        c ....DependencyObject  0 instance 00000000 _inheritanceParent

    508989fc  400049d       10 ...reTypeEventHelper  0 instance 0c862810 _coreTypeEventHelper

    508777b8  400049e       14 ...angedEventHandler  0 instance 00000000 DPChanged

    50e139ec  400049f       18  System.EventHandler  0 instance 00000000 _inheritaneContextChanged

    50875ab4  40004a1       1c ...bject, mscorlib]]  0 instance 0c8628a8 _treeChildren

    508735b0  40004a2       20 ...rnal.IManagedPeer  0 instance 0c836118 _treeParent

    50e42e50  40004a0       10 ...flection.Assembly  0   shared   static _executingAssembly

    … 

    50e36b40  4001120       64       System.Boolean  1 instance        1 _isLoaded

    50e36b40  4001121       65       System.Boolean  1 instance        0 _isMouseCaptured

    50e36b40  4001122       66       System.Boolean  1 instance        0 _isSpaceOrEnterKeyDown

    50e36b40  4001123       67       System.Boolean  1 instance        0 _isMouseLeftButtonDown

    508aa2a4  4001124       6c System.Windows.Point  1 instance 0c8627fc _mousePosition

    5089acf4  4001125       58 ...mation.Storyboard  0 instance 00000000 _currentState

    50e36b40  4001126       68       System.Boolean  1 instance        0 _suspendStateChanges

    50e36b40  4001127       69       System.Boolean  1 instance        0 _isSuspendingIsEnabled

    5089eb88  4001128       5c ...tArgs, mscorlib]]  0 instance 00000000 _weakCanExecuteChangedListener

    50873a68  4001129       60 ...outedEventHandler  0 instance 0c862fb0 Click

    508987dc  400111a      e80 ...ependencyProperty  0   shared   static ClickModeProperty

        >> Domain:Value  0b0b0cd0:NotInit  0b0807c0:0c84dc6c <<

    508987dc  400111b      e84 ...ependencyProperty  0   shared   static IsFocusedProperty

        >> Domain:Value  0b0b0cd0:NotInit  0b0807c0:0c84dd4c <<

    508987dc  400111c      e88 ...ependencyProperty  0   shared   static IsMouseOverProperty

        >> Domain:Value  0b0b0cd0:NotInit  0b0807c0:0c84dd88 <<

    508987dc  400111d      e8c ...ependencyProperty  0   shared   static IsPressedProperty

        >> Domain:Value  0b0b0cd0:NotInit  0b0807c0:0c84ddf4 <<

    508987dc  400111e      e90 ...ependencyProperty  0   shared   static CommandProperty

        >> Domain:Value  0b0b0cd0:NotInit  0b0807c0:0c84df18 <<

    508987dc  400111f      e94 ...ependencyProperty  0   shared   static CommandParameterProperty

        >> Domain:Value  0b0b0cd0:NotInit  0b0807c0:0c84df78 <<

    5088fe30  400112a      e98 ...tArgs, mscorlib]]  0   shared   static CS$<>9__CachedAnonymousMethodDelegate5

        >> Domain:Value  0b0b0cd0:NotInit  0b0807c0:00000000 <<

     

    Some of the above output was trimmed for brevity.  We have some more information.  We might not have the internals of the System.Windows.Controls.Button memorized, but we do know enough to know that there is very likely a click event.  We can guess that event will have the word click in it.  Looking in the output above, we find this line:

    50873a68  4001129       60 ...outedEventHandler  0 instance 0c862fb0 Click

     

    Dumping this object, we find:

    0:038> !do 0c862fb0

    Name:        System.Windows.RoutedEventHandler

    MethodTable: 50873a68

    EEClass:     504ddb60

    Size:        32(0x20) bytes

    File:        C:\Program Files (x86)\Microsoft Silverlight\5.0.61118.0\System.Windows.dll

    Fields:

          MT    Field   Offset                 Type VT     Attr    Value Name

    50e36684  400002d        4        System.Object  0 instance 0c8163e8 _target

    50e36684  400002e        8        System.Object  0 instance 00000000 _methodBase

    50e3dfd8  400002f        c        System.IntPtr  1 instance  803c430 _methodPtr

    50e3dfd8  4000030       10        System.IntPtr  1 instance        0 _methodPtrAux

    50e36684  4000031       14        System.Object  0 instance 00000000 _invocationList

    50e3dfd8  4000032       18        System.IntPtr  1 instance        0 _invocationCount

     

    Again, we have an object whose internals may not be familiar to us, so we have to guess.   _target and _methodPtr both seem like good guesses.  We’ll start by dumping _target:

    0:038> !do 0c8163e8

    Name:        GuessTheNumber.MainPage

    MethodTable: 080356b8

    EEClass:     0835374c

    Size:        120(0x78) bytes

    File:        GuessTheNumber, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null

    Fields:

          MT    Field   Offset                 Type VT     Attr    Value Name

    50898494  4000499        4 ...eObjectSafeHandle  0 instance 0c81646c m_nativePtr

    508986f4  400049a        8 ... System.Windows]]  0 instance 00000000 _valueTable

    50e36b40  400049b       24       System.Boolean  1 instance        0 _propagateInheritanceChanged

    50894fa4  400049c        c ....DependencyObject  0 instance 00000000 _inheritanceParent

    508989fc  400049d       10 ...reTypeEventHelper  0 instance 0c816460 _coreTypeEventHelper

    508777b8  400049e       14 ...angedEventHandler  0 instance 00000000 DPChanged

    50e139ec  400049f       18  System.EventHandler  0 instance 00000000 _inheritaneContextChanged

    50875ab4  40004a1       1c ...bject, mscorlib]]  0 instance 0c8637c8 _treeChildren

    508735b0  40004a2       20 ...rnal.IManagedPeer  0 instance 00000000 _treeParent

    50e42e50  40004a0       10 ...flection.Assembly  0   shared   static _executingAssembly

    08035848  4000002       54 ...er.MyRandomObject  0 instance 0c8164a8 randomObject

    5089f2f4  4000003       58 ...ows.Controls.Grid  0 instance 0c836118 LayoutRoot

    08035b18  4000004       5c ...ws.Controls.Label  0 instance 0c81dd78 label1

    08035b18  4000005       60 ...ws.Controls.Label  0 instance 0c836730 label2

    50897f28  4000006       64 ....Controls.TextBox  0 instance 0c83779c guessTextBox

    508ad1c0  4000007       68 ...s.Controls.Button  0 instance 0c862790 submitButton

    08035b18  4000008       6c ...ws.Controls.Label  0 instance 0c863124 resultLabel

    50e36b40  4000009       70       System.Boolean  1 instance        1 _contentLoaded

     

    The abbreviated content above looks promising.  The controls listed as members of the object look promising.  We still don’t have a click event though.  We’re left guessing one more time.  This time, we’ll guess that the click method is associated with the MethodTable for GuessTheNumber.MainPage. Based on this guess, we’ll dump the method descriptors for the MethodTable:

    0:038> !dumpmt -md 080356b8

    EEClass:         0835374c

    Module:          080346e4

    Name:            GuessTheNumber.MainPage

    mdToken:         02000003

    File:            GuessTheNumber, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null

    BaseSize:        0x78

    ComponentSize:   0x0

    Slots in VTable: 63

    Number of IFaces in IFaceMap: 4

    --------------------------------------

    MethodDesc Table

       Entry MethodDe    JIT Name

    50c32330 50a1b150 PreJIT System.Object.ToString()

    50c32350 50a1b158 PreJIT System.Object.Equals(System.Object)

    50c323c0 50a1b178 PreJIT System.Object.GetHashCode()

    50c323d0 50a1b190 PreJIT System.Object.Finalize()

    5074cd34 50508d44 PreJIT

    50785800 50514cbc PreJIT System.Windows.Controls.Control.OnManipulationCompleted(System.Windows.Input.ManipulationCompletedEventArgs)

    50785828 50514cc4 PreJIT System.Windows.Controls.Control.OnTap(System.Windows.Input.GestureEventArgs)

    50785850 50514ccc PreJIT System.Windows.Controls.Control.OnDoubleTap(System.Windows.Input.GestureEventArgs)

    50785878 50514cd4 PreJIT System.Windows.Controls.Control.OnHold(System.Windows.Input.GestureEventArgs)

    08370240 0803566c    JIT GuessTheNumber.MainPage..ctor()

    0803c071 08035674   NONE GuessTheNumber.MainPage.submitButton_Click(System.Object, System.Windows.RoutedEventArgs)

    083703d0 08035680    JIT GuessTheNumber.MainPage.InitializeComponent() 

    Looks like we’re on to something.  The submitButton_Click event looks promising.  We can guess that is the method we’ve been looking for and dump out the IL:

    0:038> !dumpil 08035674  

    ilAddr = 083302c8

    IL_0000: nop

    IL_0001: ldarg.0

    IL_0002: ldfld GuessTheNumber.MainPage::guessTextBox

    IL_0007: callvirt System.Windows.Controls.TextBox::get_Text

    IL_000c: ldloca.s VAR OR ARG 0

    IL_000e: call System.Int32::TryParse

    IL_0013: pop

    IL_0014: ldloc.0

    IL_0015: ldarg.0

    IL_0016: ldfld GuessTheNumber.MainPage::randomObject

    IL_001b: callvirt GuessTheNumber.MyRandomObject::get_randomValue

    IL_0020: ceq

    IL_0022: ldc.i4.0

    IL_0023: ceq

    IL_0025: stloc.1

    IL_0026: ldloc.1

    IL_0027: brtrue.s IL_003c

    IL_0029: ldarg.0

    IL_002a: ldfld GuessTheNumber.MainPage::resultLabel

    IL_002f: ldstr "You chose wisely."

    IL_0034: callvirt System.Windows.Controls.ContentControl::set_Content

    IL_0039: nop

    IL_003a: br.s IL_0058

    IL_003c: ldarg.0

    IL_003d: ldfld GuessTheNumber.MainPage::resultLabel

    IL_0042: ldstr "Your guess of {0} was incorrect."

    IL_0047: ldloc.0

    IL_0048: box System.Int32

    IL_004d: call System.String::Format

    IL_0052: callvirt System.Windows.Controls.ContentControl::set_Content

    IL_0057: nop

    IL_0058: ret

    IL isn’t always easy to follow without practice, but the above makes it clear we are in the right place since we can see both, “You chose wisely,” and “Your guess of {0} was incorrect.” To make the code easier to read, we can extract the module and decompile the IL into C#

    0:038> !dumpmodule 080346e4

    Name:       GuessTheNumber, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null

    Attributes: PEFile

    Assembly:   0b0c2310

    LoaderHeap:              00000000

    TypeDefToMethodTableMap: 08030124

    TypeRefToMethodTableMap: 0803013c

    MethodDefToDescMap:      0803020c

    FieldDefToDescMap:       08030248

    MemberRefToDescMap:      0803027c

    FileReferencesMap:       0803035c

    AssemblyReferencesMap:   08030360

    MetaData start address:  08330d00 (4512 bytes)

    0:038> lmm GuessTheNumber

    start    end        module name

    08330000 08332600   GuessTheNumber C (no symbols)          

    0:038> !savemodule 08330000 d:\GuessTheNumberModule.dll

    3 sections in file

    section 0 - VA=2000, VASize=1da4, FileAddr=200, FileSize=1e00

    section 1 - VA=4000, VASize=370, FileAddr=2000, FileSize=400

    section 2 - VA=6000, VASize=c, FileAddr=2400, FileSize=200

     

    Using a decompiler such as Redgate’s reflector, we see the following C#:

    private void submitButton_Click(object sender, RoutedEventArgs e)

    {

        int num;

        int.TryParse(this.guessTextBox.get_Text(), out num);

        if (num == this.randomObject.randomValue)

        {

            this.resultLabel.set_Content("You chose wisely.");

        }

        else

        {

            this.resultLabel.set_Content(string.Format("Your guess of {0} was incorrect.", num));

        }

    }

    Now it’s clear what’s going on here.  The value entered by the user into the text box is parsed into an int then compared to another int stored in this.randomObject.randomValue.  All that’s left for us to do is to find the current value for randomValue.  We know that in this case, this refers to a MainPage object.  Interestingly, that’s the exact type we dumped previously as the _target from the event handler that we found.  Reviewing our previous output, we find:

    08035848  4000002       54 ...er.MyRandomObject  0 instance 0c8164a8 randomObject

    Dumping the object we see:

    0:038> !do 0c8164a8

    Name:        GuessTheNumber.MyRandomObject

    MethodTable: 08035848

    EEClass:     08353800

    Size:        12(0xc) bytes

    File:        GuessTheNumber, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null

    Fields:

          MT    Field   Offset                 Type VT     Attr    Value Name

    50e371d4  4000001        4         System.Int32  1 instance      174 <randomValue>k__BackingField

    We see a member variable named randomValue which is exactly what we were looking for.  Using the value current value of 174, we’ll try another guess in the application:

     

    Good guess.

    To recap, we were tasked with troubleshooting an application that we knew very little about.  We asked a few questions, used a bit of prior knowledge, and a handful of reason driven guesses to get us to a solution.

    Although it’s unlikely you will ever need to cheat at a Guess the Number game in a production environment, the approach here is applicable to any situation in which you are approaching with limited domain specific knowledge.

    Guess wisely my friends.

     

    -Jarrod

     

  • PFE - Developer Notes for the Field

    PInvoke Error in .NET 4: Array size control parameter index is out of range

    So in a code-base I was working in yesterday, we use PInvoke to call out to the Performance Data Helper (PDH) API’s to collect performance information for machines without using Perfmon.  One of those PInvoke calls looked like this: /* PDH_STATUS...
  • PFE - Developer Notes for the Field

    Starting a new Chapter in the Career of Mike Huguet

    A couple of months ago an opportunity unexpectedly made its way into my life that would offer something different for my career.  At first I ignored it, but after a while I started thinking about the possibilities before me.  There were a couple of job...
  • PFE - Developer Notes for the Field

    Research for the Week-Memory Management Part 1

    Do you know what happens when you create a new instance of a class?  Why do we need a value type and a reference type?  Do you know what is meant when the .NET framework is referred to as a “managed” platform?  What is managed versus unmanaged code? ...
  • PFE - Developer Notes for the Field

    Research for the Week–How IIS Works

    In the world of custom web application development a critical component of our success is the hosting environment and services.  We can choose to deploy our web applications using 3rd party hosting providers such as GoDaddy.com, Rackspace, DiscountASP...
  • PFE - Developer Notes for the Field

    Getting more information from the Word error box when troubleshooting OpenXML / WordML issues

    So, many apologies for dropping off the face of the blogosphere lately.  Fortunately (or unfortunately, depending on your perspective), I’ve been really busy at work.  I’ve been working on some really cool things that I hope I’ll be able to...
  • PFE - Developer Notes for the Field

    Upcoming Speaking Schedule for 2011 Q2

    I have stepped it up a notch and am speaking at numerous user groups and technical conferences in the South East in the coming months.  These include .NET & SharePoint user groups as well as SQL Saturday and SharePoint Saturday all day conferences...
  • PFE - Developer Notes for the Field

    Silverlight Error - “Failed to create a system.double from text ‘xx’”

    I encountered the error “Failed to create a system.double from text ‘50’” when trying to develop a SketchFlow project.  After doing some digging I found that it was just that I had a space after my number “50.”  After removing the space, SketchFlow was...
  • PFE - Developer Notes for the Field

    SketchFlow Build Error - “The Name ‘InitializeComponent’ does not exist in the current context”

    This is my first real go round with Expression Blend and SketchFlow and it is taking quite a while to get used to it.  After several iterations of names and screen dragging, dropping in the SketchFlow Map, and adding a few elements to some screens, I...
  • PFE - Developer Notes for the Field

    Visual Studio 2010 Type or namespace ‘xyz’ does not exist in…

    It pains me to write this post as I feel like an idiot for having wasted my time on this “problem.”  Hopefully in posting this, I can keep some other poor lost soul working at 4 AM in the morning from spending wasteful minutes scratching his head and...
  • PFE - Developer Notes for the Field

    The (potentially) dark side of parallelism…

    Read a great blog entry by Scott Hanselman recently talking about the parallel dilemma that I’m sure we’ll see folks face in the future with the (old/new) Parallel classes.  I wanted to add a few things to this discussion as he focused on the...
  • PFE - Developer Notes for the Field

    Yield Return…a little known but incredible language feature

    “We shall neither fail nor falter; we shall not weaken or tire…give us the tools and we will finish the job.” – Winston Churchill I don’t often blog about specific language features but over the past few weeks I’ve spoken to a few folks that...
  • PFE - Developer Notes for the Field

    Linking your pivot collections the fun and easy way

    In my previous post , I added to my series of entries on making sense of your ASP.NET event log error messages.  Note that this is entry #4 in this series.  The previous three entries can be found here: Part 1:  Parsing ASP...
  • PFE - Developer Notes for the Field

    Pivoting ASP.NET event log error messages

    Unless you’ve been hiding under the proverbial rock, you’ve probably seen the recent Pivot hoopla .  If you’re not familiar with it, it’s a way to visualize a large amount of data in a nice filterable format.  The nice thing about it is that...
  • PFE - Developer Notes for the Field

    Visualizing Azure performance counters…

    I’ve blogged before about how some things are different in the cloud, namely Azure .  That post dealt with finding out why your Azure Role crashed by using some of the logging facilities and storage available in Azure.  That’s kind of a worst...
  • PFE - Developer Notes for the Field

    Don’t guess when it comes to performance…a RegEx story.

    Many times, when I work with a customer, it’s because they’ve tried to accomplish something and need a little extra help.  Often, this falls into the application optimization area.  For example, a few years ago, I had a customer that was...
  • PFE - Developer Notes for the Field

    Optimizing your life: E-Mail Edition

    As with most people, I hate to do the same thing more than once.  In-fact, I had a job in college where I audited expense reports and then had to enter them into an application so they could be paid.  The auditing of the expense report was...
  • PFE - Developer Notes for the Field

    Exposing any class as a WCF endpoint…

    Often, I’ll have a customer that wants to utilize a new feature in the .NET framework but they don’t necessarily know where to start or they have a large investment in another area and they’re afraid of the amount of rework required to get something...
  • PFE - Developer Notes for the Field

    Parsing ASP.NET event log error messages for fun and profit

    Sometimes a customer will ask me to look at their site and make some recommendations on what can be improved.  One of the many things I’ll look at is their event logs.  One of the nice things about ASP.NET is that when you encounter an unhandled...
  • PFE - Developer Notes for the Field

    Whodunit: Who threw the message box, and why?

    • 1 Comments

    My name is Brad and I’ve been on the PFE team here at Microsoft for many years. Suffice to say, I’m overdue for contributing to the team blog. I’ve seen lots of interesting (and not-so-interesting) issues with customers all over the world in my time at PFE. What follows is an issue I worked earlier this year. For me, the most interesting part of this issue was not so much in finding root cause as it was the process of discovering who was behind root cause.

     

    The Problem

    It all started with a customer who reported their ASP.NET application had an OutOfMemory issue. These kinds of issues are not at all uncommon in the .NET world, and the trick usually comes down to finding what object(s) are rooted so that the .NET Garbage Collector can’t reclaim the memory associated with said object(s).

     

     Getting data from the problem

    They sent me a dump of the problematic application pool, mentioning that they dumped it two hours after they received notice about the OutOfMemoryException (OOM). My initial thought on this was that they had recycled the process and then obtained a dump of the fresh new process. This would obviously be no good since the w3wp instance exhibiting the OOM was gone, and the dump instead represented a new process instance with no memory pressure. Unfortunately, this wouldn’t be the first time I’d had this problem.

    However, when I received the dump, I was pleasantly surprised to see that they had dumped the correct process instance. It was over 1GB in size, representing a 32-bit process with significant memory pressure. And when I looked at the length of time the process had been alive, that verified this was the same instance that threw the OOM.


    Windows Server 2003 Version 3790 (Service Pack 2) MP (4 procs) Free x86 compatible
    Product: Server, suite: TerminalServer SingleUserTS
    Machine Name:
    Debug session time: Thu Feb 25 15:07:20.000 2010 (UTC - 5:00)
    System Uptime: 194 days 18:11:06.875 Process Uptime: 0 days 15:03:28.000

     

    Analyzing the data, Part 1

    After searching through the dump I found the problematic object that led to the OOM. And after talking with the right folks at the customer, we pieced back together how the problem arose in addition to a resolution to this problem.

    So one question was answered, but another one remained: How did this process manage to stay alive for a full two hours after getting the OOM? I’ve been debugging issues like this since .NET 1.0 was released, and I can tell you that this wasn’t a normal set of circumstances.

    The thread that threw the OOM turned out to be the same thread that answered the question.

    # ChildEBP RetAddr
    00 0c3aedb8 7c82775b ntdll!KiFastSystemCallRet
    01 0c3aedbc 773d7a4b ntdll!NtRaiseHardError+0xc
    02 0c3aee18 773b8377 user32!ServiceMessageBox+0x145
    03 0c3aef74 7739eec9 user32!MessageBoxWorker+0x13e
    04 0c3aefcc 773d7d0d user32!MessageBoxTimeoutW+0x7a
    05 0c3af000 773c42c8 user32!MessageBoxTimeoutA+0x9c
    06 0c3af020 773c42a4 user32!MessageBoxExA+0x1b
    07 0c3af03c 7c34c224 user32!MessageBoxA+0x45
    08 0c3af070 7c348e6c msvcr71!__crtMessageBoxA+0xf4
    09 0c3af294 7c34cf83 msvcr71!_NMSG_WRITE+0x12e
    0a 0c3af2cc 7a09bea7 msvcr71!abort+0x7
    0b 0c3af2d8 77e761b7 mscorwks!InternalUnhandledExceptionFilter+0x16
    0c 0c3af530 77e792a3 kernel32!UnhandledExceptionFilter+0x12a
    0d 0c3af538 77e61ac1 kernel32!BaseThreadStart+0x4a
    0e 0c3af560 7c828752 kernel32!_except_handler3+0x61
    0f 0c3af584 7c828723 ntdll!ExecuteHandler2+0x26
    10 0c3af62c 7c82863c ntdll!ExecuteHandler+0x24
    11 0c3af90c 77e4bee7 ntdll!RtlRaiseException+0x3d
    12 0c3af96c 78158e89 kernel32!RaiseException+0x53
    13 0c3af9a4 7a14fa70 msvcr80!_CxxThrowException+0x46
    14 0c3af9b8 7a108013 mscorwks!ThrowOutOfMemory+0x24
    15 0c3afae4 7a109f7d mscorwks!Thread::RaiseCrossContextException+0x408
    16 0c3afb98 79fd878b mscorwks!Thread::DoADCallBack+0x2a2
    17 0c3afbb4 79e983fb mscorwks!ManagedThreadBase_DispatchInner+0x35
    18 0c3afc48 79e98321 mscorwks!ManagedThreadBase_DispatchMiddle+0xb1
    19 0c3afc84 79e984ad mscorwks!ManagedThreadBase_DispatchOuter+0x6d

    <clipped for brevity>

    As seen in the stack above, after the OOM was thrown in frame 0x14, a message box was thrown in frame 0x8. A message box will keep the process alive until someone clicks OK, Cancel, etc. and the message box goes away. In short, message boxes in server-side processes are never a good thing since they will hang your application!

    Once again, we have answered one question, but another question remains: the process stayed alive for two hours after it threw the OOM because a message box was thrown, but who threw the message box, and why?

     

    Analyzing the data, Part 2

    As is common when finding a message box that’s popped up in a process, I wanted to see what the message box said. From MSDN, we learn that the second parameter to user32!MessageBoxExA is the text of the message, and the third parameter is the text of the caption in the message box. Using the kb command, we can retrieve these parameters, and then dump them to find the values:

    0:023> kbn7
    # ChildEBP RetAddr Args to Child
    00 0c3aedb8 7c82775b 773d7a4b 50000018 00000004 ntdll!KiFastSystemCallRet
    01 0c3aedbc 773d7a4b 50000018 00000004 00000003 ntdll!NtRaiseHardError+0xc
    02 0c3aee18 773b8377 0dbec7a8 0bcda228 00012010 user32!ServiceMessageBox+0x145
    03 0c3aef74 7739eec9 0c3aef80 00000028 00000000 user32!MessageBoxWorker+0x13e
    04 0c3aefcc 773d7d0d 00000000 0dbec7a8 0bcda228 user32!MessageBoxTimeoutW+0x7a
    05 0c3af000 773c42c8 00000000 0c3af0a4 7c37f480 user32!MessageBoxTimeoutA+0x9c
    06 0c3af020 773c42a4 00000000 0c3af0a4 7c37f480 user32!MessageBoxExA+0x1b

    0:023> da 7c37f480
    7c37f480 "Microsoft Visual C++ Runtime Lib"
    7c37f4a0 "rary"

    0:023> da 0c3af0a4
    0c3af0a4 "Runtime Error!..Program: c:\wind"
    0c3af0c4 "ows\system32\inetsrv\w3wp.exe..."
    0c3af0e4 ".This application has requested "
    0c3af104 "the Runtime to terminate it in a"
    0c3af124 "n unusual way..Please contact th"
    0c3af144 "e application's support team for"
    0c3af164 " more information..."

    I’ve seen this message before, and it can have different underlying causes. The bottom line – at this point in the troubleshooting phase – is that it isn’t a custom message box thrown carelessly by application code. The trick now is to determine how the message box was thrown.

    I made a few unusual observations about the call stack in question.

    - First, there are two versions of the C++ runtime on the stack (7.1 and 8.0). This isn’t common.

    - Secondly, the sequence of events as told by the stack seems very unusual. It appears the .NET Framework throws an OOM (frame 0x14). Then eventually when the underlying OS handles the exception in frame 0xe, it somehow goes back to BaseThreadStart in frame 0xd where the default unhandled exception filter (UEF) is called in frame 0xc. From there we wind up back in the .NET Framework’s UEF in frame 0xb that appears to call the 7.1 CRunTime’s abort() function in frame 0xa. Finally, a message box results.
    What a wild ride!!

    - Finally… wait, the .NET Framework appears to throw a message box (if you observe frames 0xb through 0x8 of the stack)??? This is something I never would have predicted!

    Fortunately, I’d been fooled by such an appearance before, so I knew to take a look at the raw stack first, before I went reading through the .NET source to see if mscorwks!InternalUnhandledExceptionFilter really does call for a message box to be thrown. When I looked at the raw stack around the frames 0xb through 0x8, this is what I found in between all the frames displayed from a kb command:

    0:023> dds 0c3af070 L90
    0c3af070 0c3af208
    0c3af074 7c348e6c msvcr71!_NMSG_WRITE+0x12e
    0c3af078 0c3af0a4
    0c3af07c 7c37f480 msvcr71!`string'
    <clip>
    0c3af26c 77e66ebd kernel32!VirtualQueryEx+0x1d
    0c3af270 000d0f48
    0c3af274 0000003a
    0c3af278 0008b788
    0c3af27c 02403a0a
    0c3af280 0000003a
    0c3af284 013af29c*** WARNING: Unable to verify checksum for ThirdParty.dll
    *** ERROR: Symbol file could not be found. Defaulted to export symbols for ThirdParty.dll -
    ThirdParty!FooClass::m_bVar+0x9cc8
    0c3af288 0c3af2c4
    0c3af28c 7c349600 msvcr71!_beginthreadex+0x7
    0c3af290 90a99269
    0c3af294 0c3af2c4
    0c3af298 7c34cf83 msvcr71!abort+0x7 <—frame 0xa of the kn callstack above
    0c3af29c 0000000a
    0c3af2a0 00000000
    0c3af2a4 0c3af558
    0c3af2a8 7a3b2000 mscorwks!COMUnhandledExceptionFilter
    0c3af2ac 0c3af2a0
    0c3af2b0 00000103
    0c3af2b4 0c3af520
    0c3af2b8 7c34240d msvcr71!_except_handler3
    0c3af2bc 7c380ea0 msvcr71!type_info::`vftable'+0x94
    0c3af2c0 ffffffff
    0c3af2c4 0c3af2d8
    0c3af2c8 7c35f0a8 msvcr71!__CxxUnhandledExceptionFilter+0x2b
    0c3af2cc 77ecb7c0 kernel32!gcsUEFInfo
    0c3af2d0 7a09bea7 mscorwks!InternalUnhandledExceptionFilter+0x16
    <—frame 0xb of the kn callstack above

    Aha! So there is, in fact, someone else in between the .NET Framework and the call to throw the message box.

    Now, let’s say you still want some kind of proof that the .NET Framework doesn’t make the call to msvcr71!abort(), which results in a call to show the message box. I had my doubts that mscorwks.dll had the 7.1 CRT as a dependency. On a tip from fellow PFE Zach Kramer, you can prove this by running dumpbin /imports on the binary (Dumpbin is an old – but very handy - utility that still ships with Visual Studio). I could always obtain the binary by asking my customer for their mscorwks.dll, but it’s much easier to just use psscor2!savemodule.

    0:023> lmvmmscorwks
    start end module name
    79e70000 7a400000 mscorwks
    Loaded symbol image file: mscorwks.dll

    0:023> !savemodule 79e70000 F:\modules\mscorwks.dll
    5 sections in file at 79e70000
    Successfully saved file: F:\Modules\mscorwks.dll

    Then from a command window:

    f:\Modules>dumpbin /imports mscorwks.dll >Imports.txt

     

     Searching Imports.txt for the string ‘msvcr7’ comes up empty. In fact, the build of mscorwks used by my customer’s application was msvcr80.dll. This makes sense when you look at frames 0x14-0x13 in our call stack.

    What if we employ the same technique on this ThirdParty.dll – will it show the assembly depends on the 7.1 CRT? Scanning the output for dumpbin /imports on ThirdParty dll shows the following:

    Microsoft (R) COFF/PE Dumper Version 9.00.30729.01
    Copyright (C) Microsoft Corporation. All rights reserved.

    Dump of file ThirdParty.dll
    File Type: DLL

      Section contains the following imports:
       MSVCR71.dll
            1000E154 Import Address Table
            10010554 Import Name Table 
               0 time date stamp
               0 Index of first forwarder reference

    <clip>

    So we know this ThirdParty.dll has the 7.1 CRT as a dependency, but that alone doesn’t prove this component throws the message box. And according to my customer, the vendor of ThirdParty.dll had been approached many times in the past regarding these “phantom message boxes” hanging production applications. But the vendor had denied any involvement without explicit proof. So the presence of their dll on the stack next to a call to msvcr71!abort() might not suffice when I confronted them with this issue. I felt I still had some work ahead of me.

     

    Deeper dive

    Before continuing, let’s do a quick review of what we know and don’t know:

    1. The process hung because a message box was thrown. Ironically, this kept the process alive so that a dump could be taken, and from this dump we learned root cause of the OOM.

    2. The message box text and caption indicate it was thrown as a result of some systematic process – not due to some “rogue code” in the customer’s application.

    3. Contrary to the appearance of the kb command, the message box was not thrown by the .NET Framework.

    4. Based on the placement of ThirdParty.dll in the raw stack, the fact that ThirdParty.dll has the 7.1 CRT as a dependency, and the fact that the stack shows the message box was thrown directly by the CRT, and we’re likely to make the most progress by trying to rule out ThirdParty.dll as the culprit (or alternatively, prove that it is the culprit).

    5. From the call stack, it appears the .NET Framework had already instructed the OS to handle the exception. Why does it appear that the .NET Framework got a second go-around at handling the exception, and how do we connect the dots from the .NET Framework, to this ThirdParty.dll, to the 7.1 CRT’s call to abort()?

    6. Why does msvcr71!abort() throw a message box – was it explicitly instructed to do this by someone?

    First, let’s tackle the question about connecting the dots seen in the call stack frames. There’s a pretty clear and concise explanation for this in an MSDN Magazine article from 2008.

    Unhandled Exception Processing In The CLR

    When an exception goes unhandled and the OS invokes the topmost [Unhandled Exception Filter], it will end up invoking the CLR's UEF callback. When this happens, the CLR will behave like a good citizen and will first chain back to the UEF callback that was registered prior to it. Again, if the original UEF callback returns indicating that it has handled the exception, then the CLR won't trigger its unhandled exception processing

    In other words, ThirdParty.dll had registered its UEF before the .NET Framework. And its UEF took the default road of calling abort()  and throwing a message box. After ThirdParty.dll registered its UEF, the .NET Framework then registered its UEF callback. But it didn’t want to be rude and step over the UEF that ThirdParty.dll had registered first, so it chains back to it. Therefore, the result of msvcr71!abort() being called is due to the UEF registered by ThirdParty.dll.

    Next, let’s tackle the question about msvcr71!abort() throwing a message box. Since this ThirdParty.dll was using the 7.1 build of the CRT, let’s look in MSDN for the information on msvcr71!abort():

    abort determines the destination of the message based on the type of application that called the routine. Console applications always receive the message through stderr. In a single or multithreaded Windows application, abort calls the Windows MessageBox API to create a message box to display the message with an OK button. When the user clicks OK, the program aborts immediately.

    To influence the behavior of abort(), simply call _set_error_mode in the DllMain so that it doesn’t exercise the default behavior of throwing a message box. MSDN’s documentation on _set_error_mode states you can use _OUT_TO_STDERR for the lone parameter, and this will avoid the message box when abort() is called.

    A colleague of mine, Senior Escalation Engineer Bret Bentzinger, offered to write some sample code that would load ThirdParty.dll and test this proposed resolution of passing _OUT_TO_STDERR to _set_error_mode. Doing this confirmed that no message box was thrown, and the thread exits without hanging the process.

     

    In the end, this problem of throwing a message box from the CRT’s call to abort() isn’t a new one. This issue has been around for ages. But it was my first opportunity to drive such an issue and see it through to a resolution. My customer got a fix from the vendor, and we were able to help the vendor write better, more stable code. It was a win-win situation!

Page 1 of 3 (75 items) 123