If broken it is, fix it you should

Using the powers of the debugger to solve the problems of the world - and a bag of chips    by Tess Ferrandez, ASP.NET Escalation Engineer (Microsoft)

.NET Debugging Demos Lab 1: Hang - Review

.NET Debugging Demos Lab 1: Hang - Review

Rate This
  • Comments 54

Earlier this week I published the first .NET Debugging Demos Lab. If you haven't downloaded and tried it out you can get it here.

This is a review of this lab with answers to the questions and comments inline in red.

Note: results and thread IDs will vary depending on how fast you click and what machine you have and what threads the request happen to run on, so I will answer with the results from my dual-proc W2k3 machine.

Reproduce the problem:

1. Browse to http://localhost/BuggyBits/FeaturedProducts.aspx
This should take about 5 seconds to show, you can see the start time and execution time the bottom of the page.

2. Open up 5 browsers, all browsing to this site and refresh them simultaneously

Note the execution time for each of them and make sure that the start time is pretty much the same on all (otherwise you probably didn’t run the reg file)

Q:  What are the execution times?

A:   Browser 1 - 5.0 s, Browser 2 - 9.1 s, Browser 3 - 12.57 s, Browser 4 - 16.07 s, Browser 5 - 18.61 s (if you don't see increasing execution times check that you ran the reg. file, otherwise there will be a 5 sec delay between the starttimes for all requests)

Q:  What is the CPU usage of the w3wp.exe process when reproing the problem? High or low CPU?

A:  Very low CPU usage, 0-5%

Q:  What are some potential reasons for a hang with these symptoms?

A:  Since the request times are increasing and CPU usage is low it appears that a) we are waiting on some common resource which can only be used by one thread at a time, and b) we are not in a tight loop so we are likely blocked waiting for some external resource or similar

Get a memory dump:

1. Start a command window and browse to your debuggers directory.  Type the following command to prepare taking the dump but don’t hit enter quiet yet.
adplus –hang –pn w3wp.exe –quiet

Note: The quiet switch is used so that we won't get popup boxes asking for more info since we don't have that much time to get the dump before the requests finish.  You will typically get a message box asking you to set cscript as your default script, one telling you that you haven't set up your symbol path in the environment variables and one informing you that you are about to gather a hang dump.  You can simply click yes to all of these.

2. Reproduce the problem either by browsing with 5 browsers as you did before or by stressing the site with tinyget with the following command line

tinyget -srv:localhost -uri:/BuggyBits/FeaturedProducts.aspx -threads:30 -loop:50

Note: This will start 30 threads and send 50 requests to the FeaturedProducts.aspx

3. Hit enter in the adplus command window to take the memory dump while the requests are still executing.

Q: In adplus hang mode, what triggers the generation of the memory dump?

A:  This is a bit of a trick question. The generation of the dump is triggered by you running adplus,  i.e. it doesn't wait for a hang condition of any kind (unlike the -crash switch) so you can take a dump with -hang anytime without the process actually hanging and this is what you do in high memory situations for example.


Q: What permissions do you need to take a memory dump of a process?

A: Debugging with windbg/adplus is no different than debugging with other debuggers in terms of permissions.  You need to own the process or have higher permissions.  In hang mode dumps are taken in non-invasive mode meaning that we don't actually break in to the process.  Therefore you don't need to be on the same winstation where the process started to take a -hang dump which is useful when debugging services through terminal server.  In crash mode on the other hand you need to be in the same winstation where the process started so for services this means you need to log in with /console if you log in with terminal server or remote desktop.


Q: Where are the dumps created? Hint: check the windbg help for adplus/hang mode

A: The default location for dumps is in the debuggers directory where a new folder will be created everytime you run adplus with -hang or -crash.  The folder will be named something like C:\debuggers\32bit\Hang_Mode__Date_02-06-2008__Time_11-15-0505. 

The folder will contain the following items

  • CDBScripts - a folder with a config file used to instruct windbg/cdb on what commands to run
  • ADPlus_report.txt - information about what happened, i.e. adplus attached to the process etc.
  • PID-5656__W3WP.EXE__Date_02-06-2008__Time_11-15-0505.log - log file of the run, in hang mode this includes stacks of all threads, loaded modules info and !runaway output.  If no dumpfile was created you should look at the bottom of this log to see why
  • PID-5656__W3WP.EXE__full_1c38_2008-02-06_11-15-08-005_1618.dmp - the dump file
  • Process_List.txt - tlist.exe output, showing all processes running on the system at the time the dump was taken

Open the dump in Windbg.exe

1. Open windbg and open the memory dump (.dmp file) with File/Open Crash dump.

2. Set up the symbol path (see Information and Setup Instructions for more info)

3. Load sos (see Information and Setup Instructions for more info)

Examine the stacks

1. Examine the native callstacks

~* kb 2000

2. Examine the .net callstacks

~* e !clrstack

Q:  Do you see any patterns or recognize any of the callstacks that suggests a thread is waiting for a synchronization mechanism?

A:  There are a lot of threads with the following native callstacks where the frames in italic are frames that windbg could not interpret because they are managed/.net

0:066> k 2000
ChildEBP RetAddr  
1156e7b0 7d4e286c ntdll!ZwWaitForMultipleObjects+0x15 
1156e858 79ed98fd kernel32!WaitForMultipleObjectsEx+0x11a 
1156e8c0 79ed9889 mscorwks!WaitForMultipleObjectsEx_SO_TOLERANT+0x6f
1156e8e0 79ed9808 mscorwks!Thread::DoAppropriateAptStateWait+0x3c
1156e964 79ed96c4 mscorwks!Thread::DoAppropriateWaitWorker+0x13c
1156e9b4 79ed9a62 mscorwks!Thread::DoAppropriateWait+0x40
1156ea10 79e78944 mscorwks!CLREvent::WaitEx+0xf7
1156ea24 79ed7b37 mscorwks!CLREvent::Wait+0x17
1156eab0 79ed7a9e mscorwks!AwareLock::EnterEpilog+0x8c
1156eacc 79f59024 mscorwks!AwareLock::Enter+0x61
1156eb34 79fc6352 mscorwks!AwareLock::Contention+0x199
1156ebd4 0ff10dd3 mscorwks!JITutil_MonContention+0xa3
WARNING: Frame IP not in any known module. Following frames may be wrong.
1156ec34 793ae896 0xff10dd3
1156ee8c 6614d8c3 mscorlib_ni+0x2ee896
1156eec4 6614d80f System_Web_ni+0x22d8c3
1156ef00 6614d72f System_Web_ni+0x22d80f
1156ef20 65fe6bfb System_Web_ni+0x22d72f
1156ef54 65fe3f51 System_Web_ni+0xc6bfb
1156ef90 65fe7733 System_Web_ni+0xc3f51
1156efe4 65fccbfe System_Web_ni+0xc7733
1156eff8 65fd19c5 System_Web_ni+0xacbfe
1156f01c 7938111c System_Web_ni+0xb19c5
00000000 00000000 mscorlib_ni+0x2c111c

And the following .NET callstacks

0:066> !clrstack
OS Thread Id: 0x960 (66)
ESP       EIP     
1156ea4c 7d61d051 [GCFrame: 1156ea4c] 
1156eb88 7d61d051 [HelperMethodFrame: 1156eb88] System.Threading.Monitor.Enter(System.Object)
1156ebdc 0ff10dd3 DataLayer.GetFeaturedProducts()
1156ec18 0ff10c6c FeaturedProducts.Page_Load(System.Object, System.EventArgs)
1156ec58 66f12980 System.Web.Util.CalliHelper.EventArgFunctionCaller(IntPtr, System.Object, System.Object, System.EventArgs)
1156ec68 6628efd2 System.Web.Util.CalliEventHandlerDelegateProxy.Callback(System.Object, System.EventArgs)
1156ec78 6613cb04 System.Web.UI.Control.OnLoad(System.EventArgs)
1156ec88 6613cb50 System.Web.UI.Control.LoadRecursive()
1156ec9c 6614e12d System.Web.UI.Page.ProcessRequestMain(Boolean, Boolean)
1156ee98 6614d8c3 System.Web.UI.Page.ProcessRequest(Boolean, Boolean)
1156eed0 6614d80f System.Web.UI.Page.ProcessRequest()
1156ef08 6614d72f System.Web.UI.Page.ProcessRequestWithNoAssert(System.Web.HttpContext)
1156ef10 6614d6c2 System.Web.UI.Page.ProcessRequest(System.Web.HttpContext)
1156ef24 0ff10205 ASP.featuredproducts_aspx.ProcessRequest(System.Web.HttpContext)
1156ef28 65fe6bfb System.Web.HttpApplication+CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()
1156ef5c 65fe3f51 System.Web.HttpApplication.ExecuteStep(IExecutionStep, Boolean ByRef)
1156ef9c 65fe7733 System.Web.HttpApplication+ApplicationStepManager.ResumeSteps(System.Exception)
1156efec 65fccbfe System.Web.HttpApplication.System.Web.IHttpAsyncHandler.BeginProcessRequest(System.Web.HttpContext, System.AsyncCallback, System.Object)
1156f008 65fd19c5 System.Web.HttpRuntime.ProcessRequestInternal(System.Web.HttpWorkerRequest)
1156f03c 65fd16b2 System.Web.HttpRuntime.ProcessRequestNoDemand(System.Web.HttpWorkerRequest)
1156f048 65fcfa6d System.Web.Hosting.ISAPIRuntime.ProcessRequest(IntPtr, Int32)
1156f258 79f047fd [ContextTransitionFrame: 1156f258] 
1156f28c 79f047fd [GCFrame: 1156f28c] 
1156f3e8 79f047fd [ComMethodFrame: 1156f3e8] 

If we put them together, the stacks for these threads would look something like this

0:066> k 2000
ChildEBP RetAddr  
1156e7b0 7d4e286c ntdll!ZwWaitForMultipleObjects+0x15 
1156e858 79ed98fd kernel32!WaitForMultipleObjectsEx+0x11a 
1156e8c0 79ed9889 mscorwks!WaitForMultipleObjectsEx_SO_TOLERANT+0x6f
1156e8e0 79ed9808 mscorwks!Thread::DoAppropriateAptStateWait+0x3c
1156e964 79ed96c4 mscorwks!Thread::DoAppropriateWaitWorker+0x13c
1156e9b4 79ed9a62 mscorwks!Thread::DoAppropriateWait+0x40
1156ea10 79e78944 mscorwks!CLREvent::WaitEx+0xf7
1156ea24 79ed7b37 mscorwks!CLREvent::Wait+0x17
1156eab0 79ed7a9e mscorwks!AwareLock::EnterEpilog+0x8c
1156eacc 79f59024 mscorwks!AwareLock::Enter+0x61
1156eb34 79fc6352 mscorwks!AwareLock::Contention+0x199
1156ebd4 0ff10dd3 mscorwks!JITutil_MonContention+0xa3
1156ea4c 7d61d051 [GCFrame: 1156ea4c] 
1156eb88 7d61d051 [HelperMethodFrame: 1156eb88] System.Threading.Monitor.Enter(System.Object)
1156ebdc 0ff10dd3 DataLayer.GetFeaturedProducts()
1156ec18 0ff10c6c FeaturedProducts.Page_Load(System.Object, System.EventArgs)
1156ec58 66f12980 System.Web.Util.CalliHelper.EventArgFunctionCaller(IntPtr, System.Object, System.Object, System.EventArgs)
1156ec68 6628efd2 System.Web.Util.CalliEventHandlerDelegateProxy.Callback(System.Object, System.EventArgs)
1156ec78 6613cb04 System.Web.UI.Control.OnLoad(System.EventArgs)
1156ec88 6613cb50 System.Web.UI.Control.LoadRecursive()
1156ec9c 6614e12d System.Web.UI.Page.ProcessRequestMain(Boolean, Boolean)
1156ee98 6614d8c3 System.Web.UI.Page.ProcessRequest(Boolean, Boolean)
1156eed0 6614d80f System.Web.UI.Page.ProcessRequest()
1156ef08 6614d72f System.Web.UI.Page.ProcessRequestWithNoAssert(System.Web.HttpContext)
1156ef10 6614d6c2 System.Web.UI.Page.ProcessRequest(System.Web.HttpContext)
1156ef24 0ff10205 ASP.featuredproducts_aspx.ProcessRequest(System.Web.HttpContext)
1156ef28 65fe6bfb System.Web.HttpApplication+CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()
1156ef5c 65fe3f51 System.Web.HttpApplication.ExecuteStep(IExecutionStep, Boolean ByRef)
1156ef9c 65fe7733 System.Web.HttpApplication+ApplicationStepManager.ResumeSteps(System.Exception)
1156efec 65fccbfe System.Web.HttpApplication.System.Web.IHttpAsyncHandler.BeginProcessRequest(System.Web.HttpContext, System.AsyncCallback, System.Object)
1156f008 65fd19c5 System.Web.HttpRuntime.ProcessRequestInternal(System.Web.HttpWorkerRequest)
1156f03c 65fd16b2 System.Web.HttpRuntime.ProcessRequestNoDemand(System.Web.HttpWorkerRequest)
1156f048 65fcfa6d System.Web.Hosting.ISAPIRuntime.ProcessRequest(IntPtr, Int32)
1156f258 79f047fd [ContextTransitionFrame: 1156f258] 
1156f28c 79f047fd [GCFrame: 1156f28c] 
1156f3e8 79f047fd [ComMethodFrame: 1156f3e8] 

So we can see from this stack (reading it from the bottom up just like in visual studio) that FeaturedProducts.Page_Load calls a method DataLayer.GetFeaturedProducts(). 
This in turn calls Monitor.Enter() which is done when trying to get a .net lock

The code for GetFeaturedProducts should look something like this

void GetFeaturedProducts(){
  ...
  lock(...){
    // do something
  }
  ...
}


and we are stuck on the lock(...) line because someone else has grabbed the lock

Troubleshoot the hang

1. Determine the ID of the thread owning the lock
!syncblk

Q: What thread owns the lock?

A: My !syncblk reports the following after reproing with tinyget

0:066> !syncblk
Index SyncBlock MonitorHeld Recursion Owning Thread Info  SyncBlock Owner
   20 001c6f74           85         1 0f4a0a70  1ea0  30   02f07964 System.Object
-----------------------------
Total           22
CCW             2
RCW             0
ComClassFactory 0
Free            5

The thread that owns the lock is the last column under Owning Thread Info (In this case thread 30), so this is the thread that is in the //do something portion
of the GetFeaturedProducts() above, or otherwise holds the lock that we are trying to enter
 

Q: How many threads are waiting for the lock?
Hint: MonitorHeld = 1 for each owner and 2 for each waiter.

A: MonitorHeld here is 85 which means that we have one owner and 84/2=42 waiters, i.e. 42 requests waiting for the lock, if i run

    .shell -ci "~* e !clrstack" FIND /C Monitor.Enter

to see how many callstacks are sitting in a Monitor.Enter it returns 42 which jives well with this.

The fact that 42 is the answer to Life, the Universe and Everything is pure coincidence:)

2. Pick one of the waiters (Hint: waiters will sit in AwareLock::Enter) and take a look at what it is doing.

~5s                          (move to thread 5, replace 5 with actual thread ID)
kb 2000                    (examine native stack)
!clrstack                    (examine .net stack)

Q: In which .net function is it waiting for the lock?

A: DataLayer.GetFeaturedProducts()

3. Determine what the owning thread is doing 

~5s                          (move to thread 5, replace 5 with actual thread ID)
kb 2000                    (examine native stack)
!clrstack                    (examine .net stack)

Q: Why is it blocking?

Thread 30 appears to be sitting in a Thread.Sleep() called from DataLayer.GetFeaturedProducts() so this would explain why we are blocking there (while holding the lock)

0:030> !clrstack
OS Thread Id: 0x1ea0 (30)
ESP       EIP     
1036ec88 7d61cca8 [HelperMethodFrame: 1036ec88] System.Threading.Thread.SleepInternal(Int32)
1036ecdc 0ff10ddd DataLayer.GetFeaturedProducts()
1036ed18 0ff10c6c FeaturedProducts.Page_Load(System.Object, System.EventArgs)
1036ed58 66f12980 System.Web.Util.CalliHelper.EventArgFunctionCaller(IntPtr, System.Object, System.Object, System.EventArgs)
1036ed68 6628efd2 System.Web.Util.CalliEventHandlerDelegateProxy.Callback(System.Object, System.EventArgs)
1036ed78 6613cb04 System.Web.UI.Control.OnLoad(System.EventArgs)
1036ed88 6613cb50 System.Web.UI.Control.LoadRecursive()
1036ed9c 6614e12d System.Web.UI.Page.ProcessRequestMain(Boolean, Boolean)
1036ef98 6614d8c3 System.Web.UI.Page.ProcessRequest(Boolean, Boolean)
1036efd0 6614d80f System.Web.UI.Page.ProcessRequest()
1036f008 6614d72f System.Web.UI.Page.ProcessRequestWithNoAssert(System.Web.HttpContext)
1036f010 6614d6c2 System.Web.UI.Page.ProcessRequest(System.Web.HttpContext)
1036f024 0ff10205 ASP.featuredproducts_aspx.ProcessRequest(System.Web.HttpContext)
1036f028 65fe6bfb System.Web.HttpApplication+CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()
1036f05c 65fe3f51 System.Web.HttpApplication.ExecuteStep(IExecutionStep, Boolean ByRef)
1036f09c 65fe7733 System.Web.HttpApplication+ApplicationStepManager.ResumeSteps(System.Exception)
1036f0ec 65fccbfe System.Web.HttpApplication.System.Web.IHttpAsyncHandler.BeginProcessRequest(System.Web.HttpContext, System.AsyncCallback, System.Object)
1036f108 65fd19c5 System.Web.HttpRuntime.ProcessRequestInternal(System.Web.HttpWorkerRequest)
1036f13c 65fd16b2 System.Web.HttpRuntime.ProcessRequestNoDemand(System.Web.HttpWorkerRequest)
1036f148 65fcfa6d System.Web.Hosting.ISAPIRuntime.ProcessRequest(IntPtr, Int32)
1036f358 79f047fd [ContextTransitionFrame: 1036f358] 
1036f38c 79f047fd [GCFrame: 1036f38c] 
1036f4e8 79f047fd [ComMethodFrame: 1036f4e8] 
 

4. Examine the code for .NET method owning the lock to verify your theory.

    public DataTable GetFeaturedProducts()
    {
        lock (syncobj)
        {
            //faking a long running query to the database
            Thread.Sleep(5000);
            
             //populate a table with the featured products
             DataTable dt = new DataTable();
             DataRow dr;
             ...
        }
        ...
    }

As expected GetFeaturedProducts() takes a lock and then sits in a sleep for 5 seconds,  this is of course a place holder for a long running operation, but a sleep is easier to use to get a consistent repro in a demo.  The resolution here will of course depend largely on the application but it would include looking into how long this operation is supposed to be and finding out if threads are blocking more often than expected, and after review finding ways to make sure we lock around as small a portion of code as possible to avoid having lots of requests waiting for a single resource.

Hope you found the lab interesting, I will try to get the next lab (a crash scenario) out by the end of the week. 

Feel free to add your own comments if you think I missed something in my answers or if your answers differ from mine (other than thread ids, number of waiters etc.) or if you didn't like the lab format at all and think it would be better to have more or for that matter less details in the instructions...

Have fun,

Tess

  • PingBack from http://blogspace4you.info/?p=11538

  • Tess -

     Your .shell command needs quotes on XP...

    .shell -ci "~* e !clrstack" FIND /C "Monitor.Enter"

  • Sharepoint Two New Tools for SharePoint Development [Via: axshon ] WPF Visio 2007 to XAML Export Add-in...

  • Link Listing - February 6, 2008

  • Brilliant, Tess!

    I hope more of these labs are coming. They are a super way to get started with WinDbg.

    Thanks,

    Ole

  • Great stuff.

  • Hi all, I realize that you probably haven't all had a chance to look at the Crash lab yet, considering

  • Hi all, I realize that you probably haven't all had a chance to look at the Crash lab yet, considering

  • Hint: MonitorHeld = 1 for each owner and 2 for each waiter.

    A: MonitorHeld here is 85 which means that we have one owner and 84/2=42 waiters

    can you give more words to explain it , i can't understand how the number "one owner" and "84/2=42 waiters"

    thank you!

  • Since the number of monitors held in my case was 85

    Index SyncBlock MonitorHeld Recursion Owning Thread Info  SyncBlock Owner

      20 001c6f74           85         1 0f4a0a70  1ea0  30   02f07964 System.Object

    and the owner of the lock holds one monitor, that means that there are 84 monitors left for the waiters (threads waiting for the lock).  84/2 = 42, meaning we have 42 threads waiting for the lock

  • Whenever I refresh the 5 browsers, I get almost the same starttime for each of them (differences are in tenths of seconds) and all of them have an execution time of 5 seconds.

    I checked the registry and the registry settings of the reg file have been imported.

    I tried this at least 20 times and always get the same result.

    When I use tinyget, I get the "desired outcome"; the hang and continue the lab.

    For completeness: I am running the labs on a Virtual Machine: Win 2003 Server, dual CPU, 2GB of memory.

    Why do I get this different behaviour with the 5 browsers?

  • I honestly can't answer that question without looking at the machine, especially since you can repro with tinyget.  Only thing i can think of is if perhaps you are getting a cached response in the browser...  

    if the registry keys weren't set correctly you would have gotten different start times so that is definitely not the case as you mention.

    Sorry that i can't give you a better answer:(

  • TGIF, almost time for the weekend... but before you leave, here is lab 3. Todays debugging puzzle will

  • One of the Best exmaples I have seen yet on Windbg in action and practically explained !!!!

Page 1 of 4 (54 items) 1234
Leave a Comment
  • Please add 3 and 1 and type the answer here:
  • Post