What Is In A RHS Dump File Created By Windows Error Reporting

What Is In A RHS Dump File Created By Windows Error Reporting

Rate This
  • Comments 2

Hello all, East here.  I wanted to give you a hint on how to use a RHS dump to find what thread was part of the Windows Server 2008 R2 cluster RHS recovery deadlock.

 

First let me start off with letting you know that Windows Server 2008 R2 will create two types of user-mode dumps:

1 - A heap dump (.hdmp) - which is an extended mini-dump that contains additional data such as the process memory.

2 - A mini dump (.mdmp) - which will only contain mainly stack information and not enough info to help you get further.

 

When a Windows Server 2008 R2 cluster has a resource that does not respond to one of the main cluster queries in a timely fashion, RHS will try to recover that resource by restarting it in an isolated process as well as creating a WER report and a user-mode dump of the deadlock.

 

NOTE: More info about RHS and WER reports can be seen in the following blog - http://blogs.msdn.com/b/clustering/archive/2009/06/27/9806160.aspx

 

When a deadlock is detected the cluster log will show:

"00000dd4.00000aac::2011/05/13-16:12:39.559 ERR   [RHS] WER report is submitted. Result : WerReportQueued."

 

You should find the WER report file and dumps located at the default location:

<SystemDrive>\Users\All Users\Microsoft\Windows\WER\ReportQueue

 

Here is what a cluster created Windows Error Report looks like:

Version=1

EventType=WSFC Resource Deadlock

EventTime=129497767591366327

ReportType=1

Consent=1

ReportIdentifier=d2277837-7d7b-11e0-b5d0-00155d06402c

Response.type=4

Sig[0].Name=Resource Name

Sig[0].Value=New flexdisk

Sig[1].Name=Resource Type

Sig[1].Value=flexdisk

Sig[2].Name=Call Type

Sig[2].Value=ISALIVE

DynamicSig[1].Name=OS Version

DynamicSig[1].Value=6.1.7601.2.1.0.274.10

DynamicSig[2].Name=Locale ID

DynamicSig[2].Value=1033

File[0].CabName=memory.hdmp

File[0].Path=WERC1F2.tmp.hdmp <----------------------------- Name of the dump file   

File[0].Flags=2097152

File[0].Type=3

File[0].Original.Path=C:\Windows\Temp\WERC1F2.tmp.hdmp

File[1].CabName=minidump.mdmp

File[1].Path=WERC2CE.tmp.mdmp

File[1].Flags=2162690

File[1].Type=2

File[1].Original.Path=C:\Windows\Temp\WERC2CE.tmp.mdmp

FriendlyEventName=WSFC Resource Deadlock

ConsentKey=WSFC Resource Host Monitor

AppName=Failover Cluster Resource Host Subsystem <--------- The application that had an issue - RHS

AppPath=C:\Windows\Cluster\rhs.exe

ReportDescription=WSFC Resource Deadlock

 

Now you have the area to look for in the cluster log, and an idea of which resource caused RHS to deadlock, and what it was doing at the time:

 

The Resource Name = New flexdisk

The Resource Type = flexdisk

The type of call the resource was engaged in = ISALIVE

 

You have all this information in the cluster log as well:

00000dd4.00000aac::2011/05/13-16:12:39.136 ERR   [RHS] RhsCall::DeadlockMonitor: Call ISALIVE timed out for resource 'New flexdisk'.  RHS deadlock during an Isalive check for the New flexdisk resource.

00000dd4.00000aac::2011/05/13-16:12:39.136 INFO  [RHS] Enabling RHS termination watchdog with timeout 1200000 and recovery action 3.

00000dd4.00000aac::2011/05/13-16:12:39.136 ERR   [RHS] Resource New flexdisk handling deadlock. Cleaning current operation and terminating RHS process.

00000dd4.00000aac::2011/05/13-16:12:39.136 ERR   [RHS] About to send WER report.

000004dc.00000fb4::2011/05/13-16:12:39.136 WARN  [RCM] HandleMonitorReply: FAILURENOTIFICATION for 'New flexdisk', gen(3) result 4.

000004dc.00000fb4::2011/05/13-16:12:39.136 INFO  [RCM] rcm::RcmResource::HandleMonitorReply: Resource 'New flexdisk' consecutive failure count 1.

00000dd4.00000aac::2011/05/13-16:12:39.559 ERR   [RHS] WER report is submitted. Result : WerReportQueued.  WER report and user-mode dump created

000004dc.00000fb4::2011/05/13-16:12:39.559 ERR   [RCM] rcm::RcmMonitor::RecoverProcess: Recovering monitor process 3540 / 0xdd4

000004dc.00000fb4::2011/05/13-16:12:39.575 INFO  [RCM] Created monitor process 3280 / 0xcd0

00000cd0.00000200::2011/05/13-16:12:39.575 INFO  [RHS] Initializing.

000004dc.00000fb4::2011/05/13-16:12:39.591 INFO  [RCM] rcm::RcmResource::ReattachToMonitorProcess: (New flexdisk, OfflineCallIssued)

000004dc.00000fb4::2011/05/13-16:12:39.591 WARN  [RCM] Canceling pending control GET_RO_PRIVATE_PROPERTIES for resource 'New flexdisk' due to monitor crash.

 

Now if you want to get adventurous and review the dump file to see the call stacks of RHS and what the resource was doing, you will look at the WER dump with the extension “.hdmp”.  Since we are looking at a process with multiple threads, we need to find the one that is showing the WER reporting calls and the one showing the RHS deadlock.

 

Run the following command after opening the dump under your debugger “~*k”.  In the below example, thread 3 is the one we are interested in. You can tell this because the process and thread ID matches the cluster log above.  If you don’t have a cluster log, this is the thread with wer.dll on the stack.

0:000> ~*k

 

.  0  Id: dd4.9d4 Suspend: 0 Teb: 000007ff`fffdd000 Unfrozen

Child-SP          RetAddr           Call Site

00000000`001cf128 000007fe`fd611420 ntdll!ZwWaitForMultipleObjects+0xa

00000000`001cf130 00000000`772c1220 KERNELBASE!WaitForMultipleObjectsEx+0xe8

00000000`001cf230 00000000`ff763d53 kernel32!WaitForMultipleObjects+0xb0

00000000`001cf2c0 00000000`ff74802c rhs!cxl::WaitHandle::WaitAny+0x3f

00000000`001cf3b0 00000000`ff7458a1 rhs!Rhs::WaitForTermination+0x74

00000000`001cf450 00000000`ff769099 rhs!wmain+0x4b1

00000000`001cfbd0 00000000`772c652d rhs!__wmainCRTStartup+0x13d

00000000`001cfc10 00000000`774fc521 kernel32!BaseThreadInitThunk+0xd

00000000`001cfc40 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

 

   1  Id: dd4.d0c Suspend: 0 Teb: 000007ff`fffdb000 Unfrozen

Child-SP          RetAddr           Call Site

00000000`00c3fa98 000007fe`fd611420 ntdll!ZwWaitForMultipleObjects+0xa

00000000`00c3faa0 00000000`772c1220 KERNELBASE!WaitForMultipleObjectsEx+0xe8

00000000`00c3fba0 00000000`ff763d53 kernel32!WaitForMultipleObjects+0xb0

00000000`00c3fc30 00000000`ff75785c rhs!cxl::WaitHandle::WaitAny+0x3f

00000000`00c3fd20 00000000`ff766988 rhs!PingAliveCallScheduler::Run+0x184

00000000`00c3fda0 00000000`772c652d rhs!cxl::Thread::ThreadStartRoutine+0x18

00000000`00c3fdf0 00000000`774fc521 kernel32!BaseThreadInitThunk+0xd

00000000`00c3fe20 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

 

   2  Id: dd4.e78 Suspend: 0 Teb: 000007ff`fffd9000 Unfrozen

Child-SP          RetAddr           Call Site

00000000`00baf998 00000000`774eb007 ntdll!ZwWaitForMultipleObjects+0xa

00000000`00baf9a0 00000000`772c652d ntdll!TppWaiterpThread+0x14d

00000000`00bafc40 00000000`774fc521 kernel32!BaseThreadInitThunk+0xd

00000000`00bafc70 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

 

   3  Id: dd4.aac Suspend: 0 Teb: 000007ff`fffd7000 Unfrozen

Child-SP          RetAddr           Call Site

00000000`00cbe5e8 000007fe`fd611420 ntdll!ZwWaitForMultipleObjects+0xa

00000000`00cbe5f0 00000000`772d2d53 KERNELBASE!WaitForMultipleObjectsEx+0xe8

00000000`00cbe6f0 00000000`773e8f7d kernel32!WaitForMultipleObjectsExImplementation+0xb3

00000000`00cbe780 00000000`773e62b2 user32!RealMsgWaitForMultipleObjectsEx+0x12a

00000000`00cbe820 00000000`773e62e0 user32!MsgWaitForMultipleObjectsEx+0x46

00000000`00cbe860 000007fe`f6f3c35c user32!MsgWaitForMultipleObjects+0x20

00000000`00cbe8a0 000007fe`f6f6dbd0 wer!UtilMsgWaitForMultipleObjects+0x54

00000000`00cbe930 000007fe`f6f54158 wer!CReportManager::ReportProblemOutOfProcess+0x5c4

00000000`00cbecc0 000007fe`f6f3b99b wer!CReportManager::ReportProblem+0x1eb

00000000`00cbede0 000007fe`f6f61cbe wer!CReportHandleInstance::SubmitReport+0x386

00000000`00cbee50 00000000`ff750dcf wer!WerReportSubmit+0x16a

00000000`00cbeef0 00000000`ff75117b rhs!RhsResource::WERReportDeadlock+0x43f

00000000`00cbfaf0 00000000`ff74b270 rhs!RhsResource::HandleDeadlock+0x187

00000000`00cbfb60 00000000`772c652d rhs!RhsCall::DeadlockMonitor+0x324

00000000`00cbfc60 00000000`774fc521 kernel32!BaseThreadInitThunk+0xd

00000000`00cbfc90 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

 

   4  Id: dd4.e48 Suspend: 0 Teb: 000007ff`fffae000 Unfrozen

Child-SP          RetAddr           Call Site

00000000`00e2f0f8 000007fe`fd611203 ntdll!ZwDelayExecution+0xa

00000000`00e2f100 000007fe`f78f1ad9 KERNELBASE!SleepEx+0xab

00000000`00e2f1a0 00000000`00326aa0 FlexRes+0x11ad9

00000000`00e2f1a8 00000000`00000000 0x326aa0

 

<snipped to save space>

 

We will switch context to thread 3 by using the “ ~3s ” command, and follow that with “ kn “ to show the stack with frame numbers.  Debug Tip: Since we already know that its thread id is “aac” we can also switch directly to thread 3 by using its thread id in the following command syntax:  “ ~~[aac]s ”

0:003> kn

# Child-SP          RetAddr           Call Site

00 00000000`00cbe5e8 000007fe`fd611420 ntdll!NtWaitForMultipleObjects+0xa

01 00000000`00cbe5f0 00000000`772d2d53 KERNELBASE!WaitForMultipleObjectsEx+0xe8

02 00000000`00cbe6f0 00000000`773e8f7d kernel32!WaitForMultipleObjectsExImplementation+0xb3

03 00000000`00cbe780 00000000`773e62b2 user32!RealMsgWaitForMultipleObjectsEx+0x12a

04 00000000`00cbe820 00000000`773e62e0 user32!MsgWaitForMultipleObjectsEx+0x46

05 00000000`00cbe860 000007fe`f6f3c35c user32!MsgWaitForMultipleObjects+0x20

06 00000000`00cbe8a0 000007fe`f6f6dbd0 wer!UtilMsgWaitForMultipleObjects+0x54

07 00000000`00cbe930 000007fe`f6f54158 wer!CReportManager::ReportProblemOutOfProcess+0x5c4

08 00000000`00cbecc0 000007fe`f6f3b99b wer!CReportManager::ReportProblem+0x1eb

09 00000000`00cbede0 000007fe`f6f61cbe wer!CReportHandleInstance::SubmitReport+0x386

0a 00000000`00cbee50 00000000`ff750dcf wer!WerReportSubmit+0x16a

0b 00000000`00cbeef0 00000000`ff75117b rhs!RhsResource::WERReportDeadlock+0x43f

0c 00000000`00cbfaf0 00000000`ff74b270 rhs!RhsResource::HandleDeadlock+0x187

0d 00000000`00cbfb60 00000000`772c652d rhs!RhsCall::DeadlockMonitor+0x324

0e 00000000`00cbfc60 00000000`774fc521 kernel32!BaseThreadInitThunk+0xd

0f 00000000`00cbfc90 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

 

The frame of our concern will be frame d. You can either go the context ( /c ) or just display the registers  ( /r )  for that frame context:

0:003> .frame /r d

0d 00000000`00cbfb60 00000000`772c652d rhs!RhsCall::DeadlockMonitor+0x324

rax=000007fffffd7000 rbx=0000000000346920 rcx=00000000007316c0

rdx=0000000000000000 rsi=000000000035f8b0 rdi=000000000035f8b0

rip=00000000ff74b270 rsp=0000000000cbfb60 rbp=0000000000000000

r8=00000000ffffffff  r9=00000000000004ff r10=0000000000000000

r11=0000000000cbe810 r12=0000000000000000 r13=0000000000000000

r14=0000000000000000 r15=000000004dcd5877

iopl=0         nv up ei pl zr na po nc

cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246

rhs!RhsCall::DeadlockMonitor+0x324:

00000000`ff74b270 448b442430      mov     r8d,dword ptr [rsp+30h] ss:00000000`00cbfb90=ffffffff

 

Here is where we get into tricky territory about where to look. I know from experience that the RDI register from above is going to end up containing my information for this resource. Another way to verify you are in the right area, dumping RDI will show the CallType.

 

Dump that out to find the thread containing the stack with:

0:003> dqs 000000000035f8b0

00000000`0035f8b0  00000000`ff704eb8 rhs!IsAliveCall::`vftable' == CallType

00000000`0035f8b8  00000000`00000004

00000000`0035f8c0  00000000`001e7da0

00000000`0035f8c8  00000000`00000e48 <-- Thread id of concerning resource

00000000`0035f8d0  00000000`00346920

00000000`0035f8d8  0000000c`00000001

00000000`0035f8e0  00000000`00000003

00000000`0035f8e8  00000000`4dcd5876

00000000`0035f8f0  00000000`048d2dd3

00000000`0035f8f8  00000000`00000100

00000000`0035f900  00000001`00000100

00000000`0035f908  00000000`00000000

00000000`0035f910  00430000`00000031

00000000`0035f918  8a0000a4`52e48988

00000000`0035f920  004c0041`00530049

00000000`0035f928  00450045`00560049

 

If we scan our previous “~*k” output, we should find the thread id above to match our thread with the problem resource, FlexRes:

   4  Id: dd4.e48 Suspend: 0 Teb: 000007ff`fffae000 Unfrozen

Child-SP          RetAddr           Call Site

00000000`00e2f0f8 000007fe`fd611203 ntdll!ZwDelayExecution+0xa

00000000`00e2f100 000007fe`f78f1ad9 KERNELBASE!SleepEx+0xab

00000000`00e2f1a0 00000000`00326aa0 FlexRes+0x11ad9

00000000`00e2f1a8 00000000`00000000 0x326aa0

 

In this case I was using an in house test app that just went to sleep instead of responding to the LooksAlive() check, which caused IsAlive() to fail.

 

Also if you check the cluster log 5 minutes prior to the RhsCall::DeadlockMonitor error you will see the resource and thread id as well, showing what was going on prior to the Deadlock:

00000dd4.00000e48::2011/05/13-16:07:37.270 INFO  [RES] flexdisk <New flexdisk>: Set new entry point  = ISALIVEERROR:1722,ISALIVEFAILTYPE:1,ISALIVEHANG:1 Error = 0.

000004dc.00000fd4::2011/05/13-16:07:37.286 WARN  [RCM] ResourceControl(SET_PRIVATE_PROPERTIES) to New flexdisk returned 5024.

00000dd4.00000e48::2011/05/13-16:07:38.023 INFO  [RES] flexdisk <New flexdisk>: Set new entry point  = ISALIVEERROR:1722,ISALIVEFAILTYPE:1,ISALIVEHANG:0 Error = 0.

00000dd4.00000e48::2011/05/13-16:07:38.023 INFO  [RES] flexdisk <New flexdisk>: Going to Hang in IsAlive.

 

In most cases the cluster log can provide enough information to help you narrow down possible cluster resource issues.  If not, the user-mode dump may be able to help verify additional data, especially if you are debugging a third party resource dll that is deadlocking RHS.

Leave a Comment
  • Please add 8 and 1 and type the answer here:
  • Post
  • hello, when i run the dqs "rdi value" command  i get on the 3rd and 4rth columns with ????? ????? on the output,any ideas? thanks!

    [The debugger uses ???????? to indicate that memory is invalid.  Most likely you are dumping the wrong address.  Keep in mind that this article was written in 2011 and compilers will use registers differently when code is rebuilt.]

  • Adding link to new info at blogs.msdn.com/.../10388009.aspx

Page 1 of 1 (2 items)