If broken it is, fix it you should

Using the powers of the debugger to solve the problems of the world - and a bag of chips    by Tess Ferrandez, ASP.NET Escalation Engineer (Microsoft)

.NET Hang Debugging Walkthrough

.NET Hang Debugging Walkthrough

Rate This
  • Comments 35

 

I have talked about a number of different hang/performance issues in my posts.  This post is a generic debugging walkthrough for hangs.  It applies to all types of .NET processes and to some extent also to non-.net processes even though it has some .net specific items.

Define the hang/performance issue

The first step in troubleshooting a performance issue/hang of any kind, whether it is .NET or native is to find out what we are dealing with and what the expected resolution is.

Asking yourself a few standard questions may help you zone in on the problem without even attaching a debugger, and if you need to attach a debugger you will have a lot better understanding of what data you should collect and what you should be looking for.

If you call into support with a hang situation it is also very good to have the answers to these questions handy.

My Standard questions are (in no particular order):

  • What is the user experience? For example. When they browse to your asp.net application do they get timeouts? Do the calls ever return? If they return, do they always seem to return after around 45 seconds? -- This helps you get a clue about if you have a dead-lock or if you are waiting for a resource that is timing out etc.
  • What is the CPU usage? Low CPU problems usually indicate that we are waiting for a lock or an external resource (web service, database etc), while a high CPU usage indicates that we have an infinite loop, high CPU in garbage collection or that the server simply can't handle the load.
    If it is a high CPU issue, it is also interesting to note if we are constantly at 100% CPU in the affected process, or if we are running high but not necessarily 100%. And also, if any other processes on the box run at high CPU.
  • Can you reproduce the problem? If you can reproduce the problem reliably, make an effort to try to reproduce it on a test server so you don't have to put any additional pressure on the production server. In addition, if you can repro you are half-way home because we have a much smaller subset of the application to work with.
  • Is there a temporary resolution? For example. Does the problem go away if you do an IISReset? This may give an indication of where the problem lies.
  • What external processes are involved? Does the application call web services, do database access, call into a remoting server , access a main frame computer etc. And if so, what do we know about these processes? Do they seem to be blocking, have they stopped receiving requests etc.
  • What modules are affected? For example in an ASP.NET application, do all pages hang or only a few (if so, what is special about the few?). Do html/asp pages work? etc.
  • Do we have any interesting eventlog entries during the hang? Anything can be interesting, network entries, eventlog entries from other applications etc. There might be a clue out there and our job is to play Sherlock.
  • What else do we know about the hang? Things that are typically interesting is "It only happens during load", "Seems to happen more frequently on Mondays around 9:45 am", "When the process hangs memory also seems to increase rapidly".

If you have any type of information that seems even remotely related to the problem you should write it down in your log book.

Finally, if you are dealing with a performance issue rather than a full blown hang it might be worthwhile to take a few moments and think about what you consider acceptable performance to be, so you don't spend hours and hours on optimizing after you have reached acceptable levels.

Gathering dumps:

Once you have defined the problem a bit you can start collecting dumps and needless to say, it is very important that you get the dumps while the problem is occurring.

To take memory dumps of a hanging process you can use adplus that comes with the debugging tools for windows by running

adplus -hang -pn <process name.exe>

If it is possible, it is good practice to take two consecutive dumps during the same hang so that you can compare the dumps and determine if the process is moving forward at all, but one is enough in most cases.

During the time the dump is taken the process will momentarily stall. You can usually count 1 second per 10MB data that needs to be written, and with a full memory dump the size of the dump will typically be the same as the private bytes shown in perfmon for that process.

If you do take multiple dumps, make sure that the first one is completely written before you start the second dump, and preferably let it go10-20 seconds after the first dump is completely written before you start the second dump.

Debugging:

Now we can start with the fun part of the process, yeah you guessed it right, debugging:)

  1. Open the memory dump in windbg File/Open Crash Dump
  2. Set up your symbol path

    .symfix c:\mysymcache
    .reload

    To learn more about why symbols are important read this: http://blogs.msdn.com/tess/archive/2005/12/05/500068.aspx
  3. Load sos.dll

    .load clr10\sos (for 1.1) or
    .load <framework directory>\sos.dll (for 2.0)
  4. Check out all the .NET stacks and the native stacks

    ~* kb 2000 native stacks
    ~* e !clrstack .net stacks

If you are debugging an ASP.NET app you can read this post to see which threads you can ignore right off the bat.

So far so good... now comes the tricky part, understanding what it is you are seeing...

Low CPU hangs

A low CPU hang is typically caused by one of these:

  • Lock
  • WaitOne
  • Critical Section
  • Waiting for external resources

Lock

A lock is a way to synchronize access to a resource and the code will look something like this

lock(_myLock){
    //do some work
}

 

A thread that is waiting for a lock like the one in the sample above looks like this:

0:003> kb 200 
ChildEBP RetAddr Args to Child
0334f9ec 77f7f49f 77e74bd8 00000001 0334fa38 SharedUserData!SystemCallStub+0x4 
0334f9f0 77e74bd8 00000001 0334fa38 00000000 ntdll!NtWaitForMultipleObjects+0xc 
0334fa8c 79281971 00000001 00179f88 00000001 KERNEL32!WaitForMultipleObjectsEx+0x12c 
0334fabc 79282444 00000001 00179f88 00000001 mscorwks!Thread::DoAppropriateWaitWorker+0xc1 
0334fb10 7929a8ed 00000001 00179f88 00000001 mscorwks!Thread::DoAppropriateWait+0x46 
0334fb94 7924884b 00184d88 ffffffff 00000000 mscorwks!AwareLock::EnterEpilog+0x9d 
0334fbb0 792d2d1e 00cfd47c 00d109bc 00d109a4 mscorwks!AwareLock::Enter+0x69

If you have any stacks that are waiting in this state you should take a look at this blog post to see how to troubleshoot them further

WaitOne

WaitOne is a function called on a ManualResetEvent or an AutoResetEvent. It is used to wait for the return from an async call. For example when you call a web service, under the covers it will make an async call to the web service and the code that called the web service will sit in a waitone waiting for the web service call to return so that you can process the results.

The call stack for a thread waiting in a waitone looks like this:

61 Id: 1164.a44 Suspend: 1 Teb: 7ff3a000 Unfrozen
ChildEBP RetAddr Args to Child
0d58f430 7c822114 77e6711b 00000001 0d58f480 ntdll!KiFastSystemCallRet
0d58f434 77e6711b 00000001 0d58f480 00000000 ntdll!NtWaitForMultipleObjects+0xc
0d58f4dc 791e0b3b 00000001 0d58f5bc 00000001 kernel32!WaitForMultipleObjectsEx+0x11a
0d58f50c 791e0bdd 00000001 0d58f5bc 00000001 mscorsvr!Thread::DoAppropriateWaitWorker+0xc1
0d58f560 791d5ba9 00000001 0d58f5bc 00000001 mscorsvr!Thread::DoAppropriateWait+0x46
0d58f5a8 024840c8 0d58f5b4 00000000 000186a0 mscorsvr!WaitHandleNative::CorWaitOneNative+0x6f

If you have any threads waiting in this type of call, check out this blog post

Critical Section

A Critical Section is the native equivalent of a lock, and the call stack looks something like this:

0:021> kb
ChildEBP RetAddr  Args to Child              
0210e7cc 7c822124 7c83970f 000009f8 00000000 ntdll!KiFastSystemCallRet
0210e7d0 7c83970f 000009f8 00000000 00000000 ntdll!NtWaitForSingleObject+0xc
0210e80c 7c839620 00000000 00000004 00000001 ntdll!RtlpWaitOnCriticalSection+0x19c
0210e82c 4bfa4c16 027c0714 0277a5b4 4bf727af ntdll!RtlEnterCriticalSection+0xa8

To troubleshoot these, look at this blog post

External Resource

If neither of the above reasons for a low CPU hang applies, look for threads that are making DB calls or calls to other external resources.

High CPU Hangs

A high CPU hang is typically caused by one of these:

  • Infinite loop
  • High CPU in GC
  • Busy Server

High CPU hangs are a little bit harder to troubleshoot since it is often hard to figure out if it is a single thread that is using up all the CPU or if it is a lot of small actions that make up for the high CPU.

If the process is steadily at 100% it is typically an infinite loop.

To determine what threads are using up most of the CPU you can run .time and !runaway to get an idea of how long your process has been up, how much time it has spent on the CPU (user mode time) and how much user mode time each thread has been using to find out what threads are using the most.

For example in this case we have spent a total of 14 seconds in user mode, and out of that, thread 14 has used 4 seconds.

0:000> .time
Debug session time: Mon Oct 16 12:20:22.000 2006 (GMT+2)
System Uptime: 3 days 17:11:24.778
Process Uptime: 0 days 0:09:07.000
Kernel time: 0 days 0:00:26.000
User time: 0 days 0:00:14.000 


0:000> !runaway
User Mode Time
Thread Time
14:554 0 days 0:00:04.656
25:2ec 0 days 0:00:02.906
27:1264 0 days 0:00:02.359
17:13b4 0 days 0:00:01.625
16:12dc 0 days 0:00:01.375
... 

Check out !runaway in the windbg helpfile and play around with some of its options.

Infinite loop

If we are in an infinite loop, the thread with the loop will normally show up on top of !runaway. You can also identify it by running ~* e !clrstack and ~* kb to see if any of the threads seem to be in a looping construct (if you know the code).

High CPU in GC

If the GC threads show up as the top threads in !runaway. Take a look at the perfmon counter .NET CLR Memory/% Time in GC to see if the average is very high. If it is follow this post to determine why.

Busy Server

If all the threads are just spread out, busy with different things you may have some optimization work ahead of you. There is no easy way to debug this type of thing to determine where to optimize. Instead you should pull out some of the most commonly used features and run profilers on them to see how much time / CPU time they take and optimize the pieces that take the longest time.

Phew, That was a long post:) But hopefully it has given you some insight in how to debug hangs and performance issues.





  • merry x-mas to you too:)

  • Tess, great blogg, I have learnt so much, take a look at these !runaway results, would do you make of them:

    0:000> !runaway

    User Mode Time

     Thread       Time

     25:cd0       0 days 3:29:28.093

     10:d48       0 days 2:35:30.109

     29:1014      0 days 2:26:05.265

     26:334       0 days 1:33:31.906

     21:a54       0 days 1:27:28.906

     20:14e4      0 days 0:08:33.265

     13:fe8       0 days 0:06:55.796

     12:118c      0 days 0:05:40.515

     15:a28       0 days 0:05:23.875

     14:113c      0 days 0:05:17.796

     33:1344      0 days 0:01:31.093

     37:16b0      0 days 0:01:29.125

     34:384       0 days 0:01:28.750

     36:4e8       0 days 0:01:27.062

     35:874       0 days 0:01:21.281

     38:f9c       0 days 0:01:20.343

     40:15a8      0 days 0:01:02.734

     43:958       0 days 0:01:01.343

     42:122c      0 days 0:00:57.843

     45:f90       0 days 0:00:53.125

     44:c9c       0 days 0:00:52.984

     46:1048      0 days 0:00:31.468

    After looking at !clrstack for thread 25:

    0:025> !clrstack

    OS Thread Id: 0xcd0 (25)

    ESP       EIP    

    00d0e614 7c8285ec [HijackFrame: 00d0e614]

    00d0e654 7940e095 System.Globalization.CompareInfo.IndexOf(System.String, System.String)

    00d0e660 793a6261 System.String.IndexOf(System.String)

    00d0e66c 1cae730e ePosV2_Web.PrintingControl.set_dealer(ePosCore.Business.Dealer.Dealer)

    00d0e6f8 1b934f3e ePosV2_Web.BrandingControl.buttonPressed(toolbarButton)

    00d0e89c 1cae71aa ePosV2_Web.BrandingControl.imgPrint_Click(System.Object, System.EventArgs)

    00d0e8ac 6881bdc6 System.Web.UI.WebControls.Button.OnClick(System.EventArgs)

    00d0e8c0 6881bfbc System.Web.UI.WebControls.Button.RaisePostBackEvent(System.String)

    I know this contains a loop, would this be an infinite loop in my app as the performance is terrible?

  • Hi Steve,

    IndexOf works a bit like a regular expression so if you have a very long string you may get a lot of backtracks and very high CPU on the thread.    The loop is not infinite but for very large strings it may take several minutes at 100% CPU so if possible you should do what you can to minimize the length of the string you're doing indexof on

  • Hi there Tess. I'm trying to find out the reason for some bad CPU spikes.

    By debugging the dump and getting the crl stack for all threads I identify some processes that may be involved in the problem, they are image processing methods and the spiking "would make sense". Is it dangerous to assume that the problem could be in some of the identifiable methods of the dump taken when the cpu hit 100% ? Or the stack could be somehow unrelated ?

    Also, I wonder what it means to see over 20+ threadpool workers waiting for recycling when I call !threads in the troublesome dump

    E.g: XXXX   30       1be4d390   1019820 Enabled  00000000:00000000 01fd0cc8     0 Ukn (Threadpool Worker)

    Thanks for your help :: )

    Cordially, Agustin.

  • I have a C# app in .NET 4 running as a Console app started by SQL agent every minute.

    It works like a charm until of a suddent, it hangs in the Task Manager with CPU to 0%

    I tried the adplus but all it does is generating a Dump file in which there's nothing informative in there.

    By the way all the application does is reading SQL, if it finds something to work with, it starts the SAP transaction then it prints a document and it updates the table to say ok I did it and I'm finished

    This job is perform a lot of time and works like a charm.  I can tell you that this process is done a lot.  Now why 2 or 3 times a day, it hangs there which makes the SQL agent not capable of starting it again of course.

    I don't have a lot of idea on how to debug this.

    Any thought ?

Page 3 of 3 (35 items) 123
Leave a Comment
  • Please add 4 and 7 and type the answer here:
  • Post