How to Determine Which Resource is Causing the Cluster Resource Monitor to Crash – Possible Deadlock

How to Determine Which Resource is Causing the Cluster Resource Monitor to Crash – Possible Deadlock

  • Comments 1

Hello, my name is John Marlin, and I am a Support Escalation Engineer on the Microsoft Platform Cluster Services Support team.  I wanted to talk about the Windows 2003 Cluster Resource Monitor and with what happens when it crashes. In this blog I’ll show you how to look under the hood to determine why it crashed.

As a foundation for this article we need to understand the basics of Cluster Resource Monitor.  Below is taken from the Microsoft MSDN site describing the Cluster Resource Monitor.

 

A Resource Monitor provides a communication, monitoring, and processing layer between the Cluster service and one or more resources. Resource Monitors have the following characteristics:

·         A Resource Monitor always runs in a process separate from the Cluster service. If a resource fails, the Resource Monitor isolates the Cluster service from the effects. If the Cluster service fails, the Resource Monitor allows its resources to shut down gracefully.

·         To work with a resource, a Resource Monitor loads the resource DLL responsible for that resource type into its process.

·         When the Cluster service requests an operation on a resource, the Resource Monitor routes the request to the appropriate entry point function of the resource DLL responsible for the resource. The Resource Monitor performs default processing for some resource operations.

·         A Resource Monitor stores synchronized state data, allowing the Cluster service and resource DLLs to operate asynchronously, checking and updating resource status as needed.

·         A Resource Monitor periodically checks the operational status of all of its resources. For more information on this process, see Resource Failure.

 

By default, the Cluster service creates one Resource Monitor per node.

 

As the MSDN information states, everything currently running on the node is in the one Resource Monitor.  If the Resource Monitor crashes, the system will dump the Resource Monitor Process to a file called RESRCMON.DMP, and create a new instance of the process.  Because it must create a new one, all resources in the monitor are gone and need to be restarted.  When this occurs, you would see the following entry in the Windows System Event Log. 

 

Event ID:  1146

Source:  ClusSvc

Description:  The cluster resource monitor died unexpectedly, an attempt will be made to restart it

 

After this, you could also see other resource failures (Event ID: 1069) as well as disk related events such as Lost Delayed Writes, etc.  You would see the disk related events because the disk(s) would be considered down and since there is data in the cache of the HBA, it has nowhere to write it.  Hence, lost delayed writes exist until the disk is brought back online.  For our examples here, we will ignore these disk related events as we will focus on why the Resource Monitor crashed.

 

There are a couple reasons why a Resource Monitor would crash such as an Access Violation (0xC0000005) or a Deadlock (0xC0000194).  In a previous blog, we talked about the Access Violation (0xC0000005).  This blog will focus on deadlocks (0xC0000194) where a particular resource DLL was not responding properly.

 

Along with the above System Event (Event ID: 1146) where the Resource Monitor died, you will see 0xc00000494 entries in the Cluster Log file. 

 

NOTE:  The Cluster Log will convert times to Greenwich Mean Time (GMT), so you must ensure you do the proper GMT conversion of time to get to the location in the Cluster Log.

 

00000a20.00002238::2008/02/06-20:10:25.762 ERR  [RM] Exception. Code = 0xc0000194, Address = 0x77E4BEE7

00000a20.00002300::2008/02/06-20:10:25.762 ERR  [RM] Exception parameters: 0, d2fa2000, fffffff4, 7ffda000

00000a20.00002238::2008/02/06-20:10:25.762 ERR  [RM] CallStack:

00000a20.00002238::2008/02/06-20:10:25.762 ERR  [RM] Frame      Address

00000a20.00002300::2008/02/06-20:10:25.762 INFO [RM] GenerateMemoryDump: Start memory dump to file C:\WINNT\Cluster\resrcmon.dmp

 

Now that we see this entry in the log, we should take a look at the Resource Monitor dump to see what caused the failure.  The first thing to examine is the register states, specifically the ESP (stack pointer) value.

 

0:017> r

eax=00970000 ebx=000f0cd8 ecx=00000007 edx=7c8285ec esi=000f0cb0 edi=000f0d08

eip=7c8285ec esp=0227ee10 ebp=0227ee20 iopl=0         nv up ei pl zr na pe nc

cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000246

ntdll!KiFastSystemCallRet:

7c8285ec c3              ret

 

Starting at the stack pointer address 0227ee10, we use the dds command to dump the raw stack.  We are looking for the value on the stack just below the routine resrcmon!GenerateMemoryDump.  It will take several iterations of the dds command to finally get to the value because the call was made much earlier in the stack.

 

0:017> dds 0227ee10

0227ee10  00700053 xpsp2res+0xc0053

0227ee14  00630061

0227ee18  005f0065

*** pages removed ***

0227f650  028a0c90

0227f654  01ece948

0227f658  01e94518

0227f65c  0227f8ac                                           <<-- this address

0227f660  0100e638 ResrcMon!GenerateMemoryDump+0x180

0227f664  ffffffff

0227f668  00001258

0227f66c  00000018

 

Now that we have our value, we will use the kv= command with the value 0227f8ac to dump out the stack contents.

 

0:017> kv=0227f8ac

 

ChildEBP RetAddr  Args to Child             

0227eea0 7c826d2b 000801a8 0000066c 0227eff4 ntdll!KiFastSystemCallRet (FPO: [0,0,0])

0227ef10 7c826c9b 7c833c4e ffffffff 0227ef5c ntdll!ZwClose+0xc (FPO: [1,0,0])

0227efec 77e63f55 ffffffff 02a00000 00000008 ntdll!ZwAllocateVirtualMemory+0xc (FPO: [6,0,0])

0227f8ac 0100e989 0227fb78 01003024 00000000 kernel32!UnmapViewOfFile+0x14 (FPO: [Non-Fpo])

0227f8c4 01008b2c 0227fb78 01003024 0227fb78 ResrcMon!GenerateExceptionReport+0x7e (FPO: [Non-Fpo])

0227f8d8 76348d17 0227fb78 0227fb78 0227f8f8 ResrcMon!RmpExceptionFilter+0x14 (FPO: [Non-Fpo])

0227f8e8 7786d6d2 0227fb78 77ecb7c0 0227fb50 netshell!__CxxUnhandledExceptionFilter+0x4a (FPO: [Non-Fpo])

0227f8f8 77e761b7 0227fb78 00000000 00000000 netman!__CxxUnhandledExceptionFilter+0x4a (FPO: [Non-Fpo])

0227fb50 77e792a3 0227fb78 77e61ac1 0227fb80 kernel32!UnhandledExceptionFilter+0x12a (FPO: [Non-Fpo])

0227fb58 77e61ac1 0227fb80 00000000 0227fb80 kernel32!BaseThreadStart+0x4a (FPO: [SEH])

0227fb80 7c828752 0227ff3c 0227ffdc 0227fc5c kernel32!_except_handler3+0x61 (FPO: [Uses EBP] [3,0,7])

0227fba4 7c828723 0227ff3c 0227ffdc 0227fc5c ntdll!ExecuteHandler2+0x26

0227fc4c 7c82863c 0227e000 0227fc5c 00010007 ntdll!ExecuteHandler+0x24

0227ff2c 77e4bee7 0227ff3c 000a7ca0 c0000194 ntdll!RtlRaiseException+0x3d

0227ff8c 01007ddd c0000194 00000000 00000000 kernel32!RaiseException+0x53 (FPO: [Non-Fpo])

0227ffb8 77e64829 000009ec 00000000 00000000 ResrcMon!RmpTimerThread+0xa8 (FPO: [Non-Fpo])

0227ffec 00000000 01007d35 000a7ca0 00000000 kernel32!BaseThreadStart+0x34 (FPO: [Non-Fpo])

 

Based on the stack above, we have an exception at address 0x227fb78 which we will use to set the failing context.

 

0:017> dc 0227fb78

0227fb78  0227ff3c 0227fc5c 0227fba4 7c828752  <.'.\.'...'.R..|    <<-- Exception and Context Records

0227fb88  0227ff3c 0227ffdc 0227fc5c 0227fc40  <.'...'.\.'.@.'.

0227fb98  0227ffdc 7c828766 0227ffdc 0227fc4c  ..'.f..|..'.L.'.

0227fba8  7c828723 0227ff3c 0227ffdc 0227fc5c  #..|<.'...'.\.'.

0227fbb8  0227fc40 77e61a60 00000000 0227ff3c  @.'.`..w....<.'.

0227fbc8  0227ffdc 7c8315c2 0227ff3c 0227ffdc  ..'....|<.'...'.

0227fbd8  0227fc5c 0227fc40 77e61a60 00000102  \.'.@.'.`..w....

0227fbe8  00000000 00000000 4f464e49 000a0d20  ........INFO ...

 

The first DWORD is the Exception Record (0x0227ff3c) and the second is the Context Record (0x0227fc5c) that holds our true stack where the problem occurred.   Based on the .exr output, we see that this is a possible deadlock.

 

0:017> .exr 0227ff3c

ExceptionAddress: 77e4bee7 (kernel32!RaiseException+0x00000053)

   ExceptionCode: c0000194

  ExceptionFlags: 00000000

NumberParameters: 0

 

0:017> !error c0000194

Error code: (NTSTATUS) 0xc0000194 (3221225876) - {EXCEPTION}  Possible deadlock condition.

 

So let’s set the context to get the thread that caused the Resource Monitor to crash.

 

0:017> .cxr 0227fc5c

eax=0227ff3c ebx=00000000 ecx=77e61d43 edx=7c8285ec esi=00000000 edi=00000102

eip=77e4bee7 esp=0227ff38 ebp=0227ff8c iopl=0         nv up ei pl zr na pe nc

cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000246

kernel32!RaiseException+0x53:

77e4bee7 5e              pop     esi

 

0:017> kv

  *** Stack trace for last set context - .thread/.cxr resets it

ChildEBP RetAddr  Args to Child             

0227ff8c 01007ddd c0000194 00000000 00000000 kernel32!RaiseException+0x53 (FPO: [Non-Fpo])

0227ffb8 77e64829 000009ec 00000000 00000000 ResrcMon!RmpTimerThread+0xa8 (FPO: [Non-Fpo])     <<--

0227ffec 00000000 01007d35 000a7ca0 00000000 kernel32!BaseThreadStart+0x34 (FPO: [Non-Fpo])

 

Based on the names of the functions in the stack below, we are in the timer thread.  As a side note, with the 0xC0000194 Resource Dumps, we tend to be either in the Timer Thread itself or the Event List that holds the current resource.  In this instance, we are in the timer thread. 

 

Now let’s look at all the threads using the ~*kb command to find the Event List for this timer thread. I truncated the results showing thread 1 and thread 18.

 

0:001> kv

ChildEBP RetAddr  Args to Child             

0095fc5c 7c827d0b 77e61d1e 00000958 00000000 ntdll!KiFastSystemCallRet (FPO: [0,0,0])

0095fc60 77e61d1e 00000958 00000000 00000000 ntdll!NtWaitForSingleObject+0xc (FPO: [3,0,0])

0095fcd0 77e61c8d 00000958 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xac (FPO: [Non-Fpo])

0095fce4 310079b8 00000958 ffffffff 010119e0 kernel32!WaitForSingleObject+0x12 (FPO: [Non-Fpo])

WARNING: Stack unwind information not available. Following frames may be wrong.

0095fd04 31005464 00000958 000ab638 00000000 JohnHungApp!Startup+0x26de

0095fe28 0100a385 00000002 0095fe58 01006548 JohnHungApp!Startup+0x18a

0095fe34 01006548 000ab638 01c868a2 00088038 ResrcMon!Resmon_LooksAlive+0x14 (FPO: [Non-Fpo])

0095fe58 01006728 00000000 00000102 00000000 ResrcMon!RmpPollBucket+0x156 (FPO: [Non-Fpo])

0095fe78 010068c9 0008d908 00000000 00000000 ResrcMon!RmpPollList+0xa3 ß-- My Poll Event List

0095ffb8 77e64829 00000102 00000000 00000000 ResrcMon!RmpPollerThread+0x133 ß-My Timer thread

0095ffec 00000000 01006796 0008d908 00000000 kernel32!BaseThreadStart+0x34 (FPO: [Non-Fpo])

 

0:018> kv

ChildEBP RetAddr  Args to Child             

022bfbd8 7c827d0b 77e61d1e 000003bc 00000000 ntdll!KiFastSystemCallRet (FPO: [0,0,0])

022bfbdc 77e61d1e 000003bc 00000000 00000000 ntdll!NtWaitForSingleObject+0xc (FPO: [3,0,0])

022bfc4c 77e61c8d 000003bc ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xac (FPO: [Non-Fpo])

022bfc60 10032719 000003bc ffffffff 00000000 kernel32!WaitForSingleObject+0x12 (FPO: [Non-Fpo])

WARNING: Stack unwind information not available. Following frames may be wrong.

022bfc78 100fdb0e 000003bc ffffffff 100f7c6c JohnHungApp!VxWaitForEvent+0x22

022bfd68 100c66d8 ffffffff 00000001 00000000 JohnHungApp!getAllChildNodes+0x8d172

022bfde4 10012357 0026d838 01ece3f0 022bff44 JohnHungApp!getAllChildNodes+0x89914

022bffb8 77e64829 0095fd14 00000000 00000000 JohnHungApp!Startup+0x24a1 ------ My Timer Thread

022bffec 00000000 3100767d 0095fd14 00000000 kernel32!BaseThreadStart+0x34 (FPO: [Non-Fpo])

Threads 1 and 18 are of most interest as they show JohnHungApp Resource DLL waiting while the PollerThread performs a LooksAlive check.  So this is a likely cause as it did not respond within the expected 80 minute window, causing the crash.  The thread waits on the resource for 80 minutes.

The next course of action is to review the Poll List to ensure that this is working with JohnHungApp, and to determine type of resource.  So let’s dump the DWORD values of my list (adding +0x114, which is the offset in the data structure that contains the string). Note: We cannot guarantee offsets in future versions of the component.

 

0:017> dc 0008d908+0x114 l1

0008da1c  0008d090                             ....                 <<-- My Resource

 

0:017> dc 0008d090

0008d090  63727352 00000001 0009ec20 000a7ca8  Rsrc.... ....|..

0008d0a0  000a15f0 0008d158 0008d038 000a7070  ....X...8...pp..     <<-- My Resource strings

0008d0b0  00001388 0000ea60 60180000 000a7b70  ....`......`p{..

0008d0c0  00000000 00000000 00000000 00000000  ................

0008d0d0  00000000 00000001 601c1d8d 601c3633  ...........`36.`

0008d0e0  601c34cf 601c3549 601c35b5 601c21c5  .4.`I5.`.5.`.!.`

0008d0f0  601c20e1 00000000 00000000 601c3712  . .`.........7.`

0008d100  601c222f 00000003 00000006 0000000c  /".`............

 

0:017> du 0x000a7070                         <<-- Resource Displayed in Cluster

000a7070  "Johns Hung Resource"

 

0:017> du 0x0008d038                         <<-- GUID in registry (HKLM\Cluster\Resources)

0008d038  "0502cab5-3e1f-47d4-b490-e5301be7"

0008d078  "2928"

 

0:017> du 0x0008d158                         <<-- Resource Type

0008d158  "Johns Hung App"

 

0:017> du 0x000a15f0                         <<-- Specific DLL being Used

000a15f0  "johnhungapp.dll"

 

So this confirms that Johnhungapp.dll is the resource with the problem.  At this point we know the name and GUID of the resource.  Remember, we put an 80 minute time out check on the resource.   Sometimes you can get a little more information from the Cluster Log to possible narrow down the root cause.  So we are back to the original entry of where the Resource Monitor dumped.

 

 

00000a20.00002238::2008/02/06-20:10:25.762 ERR  [RM] Exception. Code = 0xc0000194, Address = 0x77E4BEE7

00000a20.00002300::2008/02/06-20:10:25.762 ERR  [RM] Exception parameters: 0, d2fa2000, fffffff4, 7ffda000

00000a20.00002238::2008/02/06-20:10:25.762 ERR  [RM] Exception parameters: 0, d2fa2000, fffffff4, 7ffda000

00000a20.00002238::2008/02/06-20:10:25.762 ERR  [RM] CallStack:

00000a20.00002238::2008/02/06-20:10:25.762 ERR  [RM] Frame      Address

00000a20.00002300::2008/02/06-20:10:25.762 INFO [RM] GenerateMemoryDump: Start memory dump to file C:\WINNT\Cluster\resrcmon.dmp

 

If we traverse back 80 minutes, we see the following information in the log.

 

00001d38.00000568::2008/02/06-18:50:10.179 WARN [FM] FmpHandleResourceTransition: Resource Name = 0502cab5-3e1f-47d4-b490-e5301be72928 [Johns Hung Resource] old state=2 new state=4

00001d38.00000568::2008/02/06-18:50:10.179 INFO [FM] FmpPropagateResourceState: resource 0502cab5-3e1f-47d4-b490-e5301be72928 failed event.

00001d38.00000568::2008/02/06-18:50:10.179 INFO [FM] FmpHandleResourceFailure: taking resource 0502cab5-3e1f-47d4-b490-e5301be72928 and dependents offline

00001d38.00000568::2008/02/06-18:50:10.179 INFO [FM] TerminateResource: 0502cab5-3e1f-47d4-b490-e5301be72928 depends on dfc6b244-888c-43b4-ad3e-f1d26853c9a4. Terminating first

00001d38.00000568::2008/02/06-18:50:10.622 INFO [FM] RestartResourceTree, Restart resource 0502cab5-3e1f-47d4-b490-e5301be72928

00001d38.00000568::2008/02/06-18:50:10.959 INFO [FM] FmpPropagateResourceState: resource 0502cab5-3e1f-47d4-b490-e5301be72928 online event.

00001d38.00000568::2008/02/06-18:50:10.959 INFO [FM] FmpOnlineWaitingTree, Start resource 0502cab5-3e1f-47d4-b490-e5301be72928

00001d38.00000568::2008/02/06-18:50:10.959 INFO [FM] FmpRmOnlineResource: bringing resource 0502cab5-3e1f-47d4-b490-e5301be72928 (resid 1198584) online.

00001d38.00000568::2008/02/06-18:50:10.959 INFO [CP] CppResourceNotify for resource Johns Hung Resource

00001d38.00000568::2008/02/06-18:50:10.959 INFO [FM] FmpRmOnlineResource: called InterlockedIncrement on gdwQuoBlockingResources for resource 0502cab5-3e1f-47d4-b490-e5301be72928

 

And nothing else other than Cluster issuing a reprieve (wait) every three minutes (default timeout value under the resource) until we dumped the resource monitor.  So now we know the resource actually failed and was attempting to come back online.  Because it was not coming back online, it caused the Resource Monitor to crash.  Since this is a third party resource, you would need to engage the programmer or vendor of the resource to see why it had the issue it did.

 

 

Some Additional Information Regarding the Exception Address

In some cases the steps to determine the exception record may not be necessary. The debugger sometimes provides the exception record information when the dump is first opened. Additionally it could show the stack that we found above by simply entering the .ecxr command...

 

0:017> .ecxr

eax=0227ff3c ebx=00000000 ecx=77e61d43 edx=7c8285ec esi=00000000 edi=00000102

eip=77e4bee7 esp=0227ff38 ebp=0227ff8c iopl=0         nv up ei pl zr na pe nc

cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000246

kernel32!RaiseException+0x53:

77e4bee7 5e              pop     esi

 

0:017> kb

  *** Stack trace for last set context - .thread/.cxr resets it

ChildEBP RetAddr  Args to Child             

0227ff8c 01007ddd c0000194 00000000 00000000 kernel32!RaiseException+0x53

0227ffb8 77e64829 000009ec 00000000 00000000 ResrcMon!RmpTimerThread+0xa8

0227ffec 00000000 01007d35 000a7ca0 00000000 kernel32!BaseThreadStart+0x34

 

You could also get to the same information using the above original steps (dds 0227ff38) but stopping at the resrcmon!RmpExceptionFilter (Resource Monitor handles the exception) which has the exception as the first parameter.

 

0:017> dds 0227ee10

0227ee10  00700053 xpsp2res+0xc0053

0227ee14  00630061

0227ee18  005f0065

0227ee1c  00610043

0227ee20  00610074

*** pages removed ***

0227f650  028a0c90

0227f654  01ece948

0227f658  01e94518

0227f65c  0227f8ac                                        <<-- pointer to Exception address stack

0227f660  0100e638 ResrcMon!GenerateMemoryDump+0x180

0227f664  ffffffff

0227f668  00001258

0227f66c  00000018

*** pages removed ***

0227f874  0100c1fe ResrcMon!except_handler3

0227f878  01005528 ResrcMon!`string'+0xc

0227f87c  ffffffff

0227f880  0100d27b ResrcMon!ClRtlLogPrint+0x499

0227f884  0100e96c ResrcMon!GenerateExceptionReport+0x61

0227f888  00000001

*** pages removed ***

0227f8b8  01003024 ResrcMon!`string'

0227f8bc  00000000

0227f8c0  7786d687 netman!__CxxUnhandledExceptionFilter

0227f8c4  0227f8d8

0227f8c8  01008b2c ResrcMon!RmpExceptionFilter+0x14       <<-- Frame 4 in kv= 0227f8ac above

0227f8cc  0227fb78                                        <<-- Frame 3 in kv= 0227f8ac above

0227f8d0  01003024 ResrcMon!`string'

Leave a Comment
  • Please add 7 and 8 and type the answer here:
  • Post
  • i managing Exchange on MSCS windows 2003 64 bit with SP2...

    last night ID:1146 happened but

    there isn't any dmp file.

    it there any set up option to leave the dump file on c:windows\cluster\

     

     

    [The dump file creation is hard coded to write to only the Cluster folder, so this is something that cannot change. You would need to compare the System Event Log time and the time in the C:\WINDOWS\CLUSTER\CLUSTER.LOG file. Remember that the Cluster Log file is GMT time so you would need to do the conversion to get the time frame proper. You should see something like:
     
    00000a20.00002238::2008/02/06-20:10:25.762 ERR [RM] Exception. Code=0xc0000194, Address = 0x77E4BEE7 <------------------------line of interest
    .....
    .....
    00000a20.00002300::2008/02/06-20:10:25.762 INFO [RM]GenerateMemoryDump: Start memory dump to file
     
    C:\WINNT\Cluster\resrcmon.dmp The first thing to look for is what is the exception code and if the GenerateMemoryDump() call is made and if it says it is creating the dump file. If it is made and there is no dump file created, this could be an indication of other issues with the machine that could be related to RPC or other user mode processes. ]
Page 1 of 1 (1 items)