If broken it is, fix it you should

Using the powers of the debugger to solve the problems of the world - and a bag of chips    by Tess Ferrandez, ASP.NET Escalation Engineer (Microsoft)

.NET Hang: Analyzing Debug Diag output

.NET Hang: Analyzing Debug Diag output

  • Comments 5

Earlier this week I got an email from a reader who had a hang in an application and sent in some Debug Diag logs.

In order for me to dig in deeper I would need to open a support case as I would have to ask follow up questions and look at the dumps etc. However, I wanted to show some interesting details from his report to show you how to move forward if you get an issue like that.

Analyzing the debug diag output

At the top of the report generated by debug diag for the dump, you will see an analysis summary section. 

In this section Debug Diag will report any issues that it found in the dump.  This is based on a set of common issues that debug diag looks for.

Note:  Debug Diag is so not looking for any managed (.net) issues in the standard hang/crash analysis scripts, so what you will see here is mostly if there are any native causes for the hang/crash,  such as locked critical sections, waiting on socket calls etc.

 
Analysis Summary
 
Type Description Recommendation
  Warning The following threads in DumpName.dmp are waiting on data to be returned from another server via WinSock.

The call to WinSock originated from 0x008fa1c3 and is destined for port 50001 at IP address 127.0.0.1


( 18 )

0.65% of threads blocked

Ensure that any remote server this application may be calling is functioning properly and there are no network issues between the two servers. If the problem continues, please contact the application vendor for further assistance

In this particular case we have one thread (thread 18) that is waiting to receive some data on a socket call to localhost (127.0.0.1) on port 50001.  This could be anything from a web service call, to a database call, to a call to any type of service on the local box.    You would have to look at the stack for thread 18 to get more information, or look at services listening on port 50001 in this case.

Drilling down to the data for thread 18 in the report we get

Thread 18 - System ID 10972
Entry point   mscorwks!Thread::intermediateThreadProc
Create time   1/7/2009 11:01:53 PM
Time spent in user mode   0 Days 00:12:01.546
Time spent in kernel mode   0 Days 00:00:01.937

This thread is waiting on data to be returned from another server via WinSock.

The call to WinSock originated from 0x008fa1c3 and is destined for port 50001 at IP address 127.0.0.1

Function   Source
ntdll!KiFastSystemCallRet
ntdll!NtWaitForSingleObject+c
mswsock!SockWaitForSingleObject+19d
mswsock!WSPRecv+203
ws2_32!recv+83
0x008fa1c3
System_ni+1bf523
System_ni+1bf523
System_ni+1bf401

Socket properties:

Source Port 3656
Destination IP 127.0.0.1
Destination Port    50001

From this we don’t get a whole lot of more information since the frames below the actual socket call are listed as System_ni… which means that it is .net code.     To get any more info about this you would have to open the dump in windbg,  load sos.dll (.loadby sos mscorwks), move to thread 18 (~18s) and run !clrstack to get the .net stack and see what is doing the socket call.

Other interesting points in the debug diag output

Apart from the thread making the socket call, there were also some other interesting callstacks in the debug diag output

Over 60 threads were sitting in the following callstack

Thread 59 - System ID 10048
Entry point   mscorwks!Thread::intermediateThreadProc
Create time   1/7/2009 11:55:17 PM
Time spent in user mode   0 Days 00:00:00.015
Time spent in kernel mode   0 Days 00:00:00.00

Function   Source
ntdll!KiFastSystemCallRet
ntdll!NtWaitForMultipleObjects+c
kernel32!WaitForMultipleObjectsEx+11a
mscorwks!WaitForMultipleObjectsEx_SO_TOLERANT+6f
mscorwks!Thread::DoAppropriateAptStateWait+3c
mscorwks!Thread::DoAppropriateWaitWorker+13c
mscorwks!Thread::DoAppropriateWait+40
mscorwks!CLREvent::WaitEx+f7
mscorwks!CLREvent::Wait+17
mscorwks!AwareLock::EnterEpilog+8c
mscorwks!AwareLock::Enter+61
mscorwks!AwareLock::Contention+199
mscorwks!JITutil_MonContention+a3
mscorlib_ni+3fc3b1
mscorlib_ni+3fc32b
mscorlib_ni+406d71

Again here we have the mscorlib_ni calls, which means that it there is actually .net code at the bottom of this stack, but since debug diag doesn't resolve .net calls since it is using native debugger APIs they show up as mscorlib_ni or System_ni calls.  

A little further up we see mscorwks!AwareLock::Enter, which means that this thread is waiting for a .net lock, (i.e. a lock(){}).  

So what we can tell from this is that we have a lot of contention in the process because all the threads are waiting for a lock and the thread holding the lock has probably held it for some time.

See Lab1 for more info about troubleshooting .net locks.

Chances are pretty big that they are all waiting for a lock held by thread 18, who is waiting for the socket call, but we really can’t tell for sure without looking at the stacks and the lock itself in windbg. 

In order to figure out what is going on here and stalling the process, the best way to go about it would be to open the dump in windbg and load sos.  Then run ~* e !clrstack to look at the .net stacks for all threads,  specifically for the socket thread and the threads waiting for the lock to understand what lock we are waiting for and why.

Then to find out the owner of the lock we can run !syncblk, and focus on the thread that owns the lock to figure out why we are stuck or what could be done to reduce the contention on the lock.

In summary

When you look at a .net hang or .net crash dump, and analyze it with debug diag, you need to be aware that debug diag currently mostly looks for common native issues, so it is often necessary to make a second pass with windbg to get to the bottom of the issue.

Debug diag will spit out the following info about the thread in the cases where it can’t properly resolve the callstack

“This thread is not fully resolved and may or may not be a problem. Further analysis of these threads may be required.”

which is the case with .net stacks.  So if you see something like that, that is a clue that either it is a .net stack, and you should be looking at it in windbg, or it is a stack where there are custom components and the symbols for those were not available.

Laters,

Tess

  • PingBack from http://windows7news.com.au/2009/01/23/net-hang-analyzing-debug-diag-output/

  • Tess, you're pretty and wise, thank you for the article.

    Maybe you won't believe, but while I was reading through your articles in my Firefox browser, it entered a hang state and once I had DebugDiag installed on the machine, I generated a manual hang dump and I could absorb the content of your lesson in practice.

    I came across the notification

    "This thread is not fully resolved and may or may not be a problem. Further analysis of these threads may be required."

    and now I know what it means. Also, I didn't know the limitation of DebugDiag regarding .NET threads, but now yes, thanks to you.

    I think you could post more on WinDbg, that would be highly appreciable.

    Thanks again!

    Marcos

    PSSE Microsoft Brazil (Generalist Support Engineer)

  • Thank you for submitting this cool story - Trackback from DotNetShoutout

  • Hi,

    I just wanted to share that after I downloaded DebugDiag 1.1 from the official MS site, I (as well as others obviously)  got an "overflow" script error in the Analyze step which basically made the report in Internet Explorer look weird, such as that I couldn't find any callstacks.

    Turns out that someone corrected a division-by-zero bug in an .asp file that fixes this issue: http://www.pretentiousname.com/debugdiag_fix/index.html

    Could this fix make it into the official version ?

    One other question: the report still shows weird numbers, such as negative bytes and leaked TBytes (Terabytes?) which I kind of wondering if there are other known calculation issues ?

    Comment: I run Vista SP2 (and Windows Server 2008 SP2).

    BR,

    Kristian Holmlund

    Senior Software Developer

    ABB Automation

  • Kristian, I'll let the devs of debug diag know about this so they can add it when possible

Page 1 of 1 (5 items)
Leave a Comment
  • Please add 8 and 2 and type the answer here:
  • Post