I have talked about a number of different hang/performance issues in my posts. This post is a generic debugging walkthrough for hangs. It applies to all types of .NET processes and to some extent also to non-.net processes even though it has some .net specific items.
Define the hang/performance issue
The first step in troubleshooting a performance issue/hang of any kind, whether it is .NET or native is to find out what we are dealing with and what the expected resolution is.
Asking yourself a few standard questions may help you zone in on the problem without even attaching a debugger, and if you need to attach a debugger you will have a lot better understanding of what data you should collect and what you should be looking for.
If you call into support with a hang situation it is also very good to have the answers to these questions handy.
My Standard questions are (in no particular order):
If you have any type of information that seems even remotely related to the problem you should write it down in your log book.
Finally, if you are dealing with a performance issue rather than a full blown hang it might be worthwhile to take a few moments and think about what you consider acceptable performance to be, so you don't spend hours and hours on optimizing after you have reached acceptable levels.
Once you have defined the problem a bit you can start collecting dumps and needless to say, it is very important that you get the dumps while the problem is occurring.
To take memory dumps of a hanging process you can use adplus that comes with the debugging tools for windows by running
adplus -hang -pn <process name.exe>
adplus -hang -pn <process name.exe>
If it is possible, it is good practice to take two consecutive dumps during the same hang so that you can compare the dumps and determine if the process is moving forward at all, but one is enough in most cases.
During the time the dump is taken the process will momentarily stall. You can usually count 1 second per 10MB data that needs to be written, and with a full memory dump the size of the dump will typically be the same as the private bytes shown in perfmon for that process.
If you do take multiple dumps, make sure that the first one is completely written before you start the second dump, and preferably let it go10-20 seconds after the first dump is completely written before you start the second dump.
Now we can start with the fun part of the process, yeah you guessed it right, debugging:)
If you are debugging an ASP.NET app you can read this post to see which threads you can ignore right off the bat.
So far so good... now comes the tricky part, understanding what it is you are seeing...
Low CPU hangs
A low CPU hang is typically caused by one of these:
A lock is a way to synchronize access to a resource and the code will look something like this
//do some work
A thread that is waiting for a lock like the one in the sample above looks like this:
0:003> kb 200
ChildEBP RetAddr Args to Child
0334f9ec 77f7f49f 77e74bd8 00000001 0334fa38 SharedUserData!SystemCallStub+0x4
0334f9f0 77e74bd8 00000001 0334fa38 00000000 ntdll!NtWaitForMultipleObjects+0xc
0334fa8c 79281971 00000001 00179f88 00000001 KERNEL32!WaitForMultipleObjectsEx+0x12c
0334fabc 79282444 00000001 00179f88 00000001 mscorwks!Thread::DoAppropriateWaitWorker+0xc1
0334fb10 7929a8ed 00000001 00179f88 00000001 mscorwks!Thread::DoAppropriateWait+0x46
0334fb94 7924884b 00184d88 ffffffff 00000000 mscorwks!AwareLock::EnterEpilog+0x9d
0334fbb0 792d2d1e 00cfd47c 00d109bc 00d109a4 mscorwks!AwareLock::Enter+0x69
If you have any stacks that are waiting in this state you should take a look at this blog post to see how to troubleshoot them further
WaitOne is a function called on a ManualResetEvent or an AutoResetEvent. It is used to wait for the return from an async call. For example when you call a web service, under the covers it will make an async call to the web service and the code that called the web service will sit in a waitone waiting for the web service call to return so that you can process the results.
The call stack for a thread waiting in a waitone looks like this:
61 Id: 1164.a44 Suspend: 1 Teb: 7ff3a000 Unfrozen
ChildEBP RetAddr Args to Child
0d58f430 7c822114 77e6711b 00000001 0d58f480 ntdll!KiFastSystemCallRet
0d58f434 77e6711b 00000001 0d58f480 00000000 ntdll!NtWaitForMultipleObjects+0xc
0d58f4dc 791e0b3b 00000001 0d58f5bc 00000001 kernel32!WaitForMultipleObjectsEx+0x11a
0d58f50c 791e0bdd 00000001 0d58f5bc 00000001 mscorsvr!Thread::DoAppropriateWaitWorker+0xc1
0d58f560 791d5ba9 00000001 0d58f5bc 00000001 mscorsvr!Thread::DoAppropriateWait+0x46
0d58f5a8 024840c8 0d58f5b4 00000000 000186a0 mscorsvr!WaitHandleNative::CorWaitOneNative+0x6f
If you have any threads waiting in this type of call, check out this blog post
A Critical Section is the native equivalent of a lock, and the call stack looks something like this:
ChildEBP RetAddr Args to Child
0210e7cc 7c822124 7c83970f 000009f8 00000000 ntdll!KiFastSystemCallRet
0210e7d0 7c83970f 000009f8 00000000 00000000 ntdll!NtWaitForSingleObject+0xc
0210e80c 7c839620 00000000 00000004 00000001 ntdll!RtlpWaitOnCriticalSection+0x19c
0210e82c 4bfa4c16 027c0714 0277a5b4 4bf727af ntdll!RtlEnterCriticalSection+0xa8
To troubleshoot these, look at this blog post
If neither of the above reasons for a low CPU hang applies, look for threads that are making DB calls or calls to other external resources.
High CPU Hangs
A high CPU hang is typically caused by one of these:
High CPU hangs are a little bit harder to troubleshoot since it is often hard to figure out if it is a single thread that is using up all the CPU or if it is a lot of small actions that make up for the high CPU.
If the process is steadily at 100% it is typically an infinite loop.
To determine what threads are using up most of the CPU you can run .time and !runaway to get an idea of how long your process has been up, how much time it has spent on the CPU (user mode time) and how much user mode time each thread has been using to find out what threads are using the most.
For example in this case we have spent a total of 14 seconds in user mode, and out of that, thread 14 has used 4 seconds.
Debug session time: Mon Oct 16 12:20:22.000 2006 (GMT+2)
System Uptime: 3 days 17:11:24.778
Process Uptime: 0 days 0:09:07.000
Kernel time: 0 days 0:00:26.000
User time: 0 days 0:00:14.000
User Mode Time
14:554 0 days 0:00:04.656
25:2ec 0 days 0:00:02.906
27:1264 0 days 0:00:02.359
17:13b4 0 days 0:00:01.625
16:12dc 0 days 0:00:01.375
Check out !runaway in the windbg helpfile and play around with some of its options.
If we are in an infinite loop, the thread with the loop will normally show up on top of !runaway. You can also identify it by running ~* e !clrstack and ~* kb to see if any of the threads seem to be in a looping construct (if you know the code).
High CPU in GC
If the GC threads show up as the top threads in !runaway. Take a look at the perfmon counter .NET CLR Memory/% Time in GC to see if the average is very high. If it is follow this post to determine why.
If all the threads are just spread out, busy with different things you may have some optimization work ahead of you. There is no easy way to debug this type of thing to determine where to optimize. Instead you should pull out some of the most commonly used features and run profilers on them to see how much time / CPU time they take and optimize the pieces that take the longest time.
Phew, That was a long post:) But hopefully it has given you some insight in how to debug hangs and performance issues.
Can you point me to the download link for adplus. I tried to look for it in the debugging tools link that you provided but couldnt find it.
When you install the debugging tools for windows, adplus.vbs is located in the c:\program files\debugging tools for windows directory, or the directory you specify during the installation if you use custom installation.
Nice article, but sometimes hangs can be in a different form too. For example, I have a ASP.NET web application which logs out all the users all of a sudden, no spikes in CPU usage, no entries in eventlogs, cannot be replicated in QA. Can this kind of scenario be called a "hang" and how do we use the debugging tools in this scenario. The mass logouts happen to all the users at the same time.
Appreciate if you can commment on this scenario.
How come I cannot see my post that I posted yesterday?
If you have a web app that logs all users out it is probably a crash or a recycle of the worker process. Alternatively a recycle of the app domain.
The reason you couldn't see the comments is because I have to moderate comments because of some sites sending out spam comments:( I am sorry about this, and I normally publish the comments right away when I see them but yesterday I was delivering a workshop so I didn't log into email.
Thanks for you comments and time. So does IIS not log anything in event/system/security logs during app domain recycles/worker process? as we do not see anything anything in the eventviewer at the time of the logouts. Also if it is a crash, how do we debug this scenario and when is the right time to take a dump?
Additionally we are also investigating if it is because of blocking of some connections in the connection pool. Either case is there a clear and definite way to get answers from a dump.
It is a very interesting problem..
Have you tried checking out this post: http://blogs.msdn.com/tess/archive/2006/08/02/686373.aspx
I read all your articles so far..and yes I did read that article. We are logging all .NET counters related to performance/threads/locks in the db table. We do not see any app restarts/ worker process recycles in our tables. Also when mass logouts are happening, we did observe that they are happening for a particular spid..which is like all the users on the same connection in the connection pool..It is indeed a weird problem/...
I would suggest that you do a crash dump and look at the callstack. It may give you some hints. That is what I usually do when I don't know what happened. The disadvantage is that it would make the app run slower.
Great post! Please keep ‘em coming.
By the way, there is a discrepancy in .time output between dump file and AdPlus log:
When you create a dump you start off with creating a log file (if this is in crash mode this log file will be started when you first attach, and if it is hang mode when you start dumping), at this time .time is run and the time is recorded in the log file.
From there to when the dump is actually gathered some time will elapse, and this is the time diff betwee the log file and the dump.
For crash mode this can be several hours or days (depending on how long you are attached before you crash), and in hang mode it is about 1 second per 10 MB of process memory.
Sorry, I should have clarified it further. Actually, values between kernel time and user time are swapped. The output in AdPlus log is correct, but output from within Windbg while reading the same dump file is incorrect.
Kernel time: 0 days 0:16:43.531
User time: 0 days 0:00:28.531
Kernel time: 0 days 0:00:28.000
User time: 0 days 0:16:43.000
In this example, application really spent almost 17 minutes in privileged mode. Until a fix becomes available, I rely on .time output from AdPlus log.
Interesting... I have noticed the swapped times before too (in some windbg versions) and never knew why it was. (I just assumed a problem in windbg) but it is interesting that they are different in adplus and windbg because they both run the .time command.
At first when I saw your comment I thought it may be different in cdb.exe (which adplus uses) and windbg, but that doesn't make sense since they are essentially the same, and debugging a dump with cdb i can see that there is no difference, so either there is a difference live (which i dont think) or you got the dumps from someone with a different version of windbg/cdb where the times were swapped.
I will send off a note on this to the guys in the debugger team so they are aware. It has been this while for a few builds but I think noone has bothered mentioning it and just swapped the times in their heads.