If Cluster is Running Fine, Don’t Dump Cluadmin on a Hang

If Cluster is Running Fine, Don’t Dump Cluadmin on a Hang

  • Comments 3

Hello, my name is East, and I’m an Escalation Engineer on the Microsoft Platform Global Escalation Services team.  Here are some tips for those wanting to review dumps from Cluster Administrator (Cluadmin.exe), the GUI version of cluster.exe that allows management of your cluster environment. Historically there were more Cluadmin hangs in Windows 2000 Cluster server than Windows 2003 Cluster server so the volume of these dumps submitted to my group has dropped significantly. People have submitted complete memory dumps thinking the entire Cluster server was hung when it turned out to be a Cluadmin hang. It’s important to determine what is really hung when troubleshooting these types of problems.

 

So what causes Cluadmin to hang? In most cases a resource is working to come online, or a remote connection is misbehaving which results in the hang.  Cluadmin dumps are not as helpful in revealing why a resource is misbehaving in your Cluster environment, when compared to dumping the cluster service and resrcmon. In most cases the cluster log can help you narrow down a problematic resource that’s holding Cluadmin up. Later I will show how to find the possible resource that caused a Cluster Administrator hang by using the debugger.

 

To help narrow down the hang you should start by testing that the Cluster environment is healthy.

 

Let’s say at the time of the hang you were using Cluadmin to view properties of certain resources. At this point it would be useful to examine the status of the cluster by using the Cluster.exe tool. Here are some examples.

 

Using the cluster.exe command you can check the Groups status with the following command:

 

C:\>cluster group

Listing status for all available resource groups:

 

Group                    Node            Status

-------------------- --------------- --------------

Test1                   NODE1            Online

Cluster Group           JNODE1           Online

 

You can check the Node status using this command:

 

C:\>cluster node

Listing status for all available nodes:

 

Node           Node ID          Status

-------------- ------- ---------------------

NODE1            1                 Up

NODE2            2                 Down

 

And you can check the Resource status with the resource command:

 

C:\>cluster resource

Listing status for all available resources:

 

Resource                    Group              Node       Status

-------------------- -------------------- --------------- ------

Disk Q:                 Cluster Group         NODE1       Online

tst1                    Test1                 NODE1       Online

Cluster IP Address      Cluster Group         NODE1       Online

Cluster Name            Cluster Group         NODE1       Online

MSDTC                   Cluster Group         NODE1       Online

 

 

You can get more granular by using the /priv switch to list out the private properties like this:

 

C:\>cluster resource /priv

Listing private properties for all resources:

                <ouput not shown to save space>

 

Since hangs can occur during startup of Cluadmin I recommend the following article which discusses different ways to start Cluadmin.

 

280125  Cluster Administrator Switches for Connecting to a Cluster

http://support.microsoft.com/default.aspx?scid=kb;EN-US;280125

 

Now let’s take a look at a Cluadmin dump to illustrate what you would see during a hang. Most dumps we gather from customers are snapped when Cluadmin is starting, resulting in limited information contained in the dump.  Here’s an example of the resource that Cluster was working with when the process hung.  The thread below is making an RPC call to the local cluster service.  This can be the culprit causing Cluadmin to be in the hung state:

 

0:000> kb

ChildEBP RetAddr  Args to Child             

0006eddc 7c59a0a2 00000154 00000001 0006edfc NTDLL!ZwWaitForSingleObject+0xb

0006ee04 77d7f41d 00000154 000007d0 00000001 KERNEL32!WaitForSingleObjectEx+0x71

0006ee28 77d5fadf 00000000 0006ee58 0006ee50 rpcrt4!DG_ReceivePacket+0xd3

0006ee68 77d5f998 0006f0c8 0006f07c 0006ee88 rpcrt4!DG_CCALL::ReceiveSinglePacket+0x5e

0006ee7c 77d3b96c 0006f07c 0006f274 77d6a686 rpcrt4!DG_CCALL::SendReceive+0xd6

0006ee88 77d6a686 0006f07c 00000009 7393270e rpcrt4!I_RpcSendReceive+0x2c

0006ee9c 77d93b64 0006f0c8 000a425c 0006eeac rpcrt4!NdrSendReceive+0x31

0006f274 73939b6d 73931778 739326c0 0006f28c rpcrt4!NdrClientCall2+0x512

0006f284 73934bd5 00091ed8 01000059 00000000 CLUSAPI!ApiResourceControl+0x14

0006f2c4 0102705f 000921e8 00000000 01000059 CLUSAPI!ClusterResourceControl+0xd0

0006f2f8 01027b00 000921e8 01000059 00000000 CLUADMIN!CClusPropList::ScGetResourceProperties+0x3e

0006f374 010129ae 00c988a0 00000080 00239770 CLUADMIN!CResource::ReadItem+0xb9

0006f3b4 0101141d 002397c8 00239770 00000001 CLUADMIN!CClusterDoc::InitResources+0x9a

0006f3ec 01010f45 00000001 00239770 00000000 CLUADMIN!CClusterDoc::CollectClusterItems+0xd5

0006f410 01010e11 002352d0 00234960 00239770 CLUADMIN!CClusterDoc::OnOpenDocumentWorker+0x80

0006f444 76fbaea3 002352d0 00234960 01044c28 CLUADMIN!CClusterDoc::OnOpenDocument+0xa2

0006f46c 0101039f 002352d0 00000001 01044c28 mfc42u!CMultiDocTemplate::OpenDocumentFile+0xb5

0006fc88 0100fff6 002352d0 002373a0 00000000 CLUADMIN!CClusterAdminApp::OpenDocumentFile+0xbf

0006fcc0 01021c9e 00000001 00000000 76fb2089 CLUADMIN!CClusterAdminApp::OnRestoreDesktop+0xc4

0006fccc 76fb2089 00000001 00000000 002373a0 CLUADMIN!CMainFrame::OnRestoreDesktop+0x17

 

The first parameter passed to CLUSAPI!ClusterResourceControl in this Windows 2000 dump reveals the resource in question.

 

Here I’m dumping out the raw data of the parameter which yields the name of the resource, Joseph.

 

0:000> dc 000921e8+28

00092210  006F004A 00650073 00680070 00000000  J.o.s.e.p.h.....

00092220  00030007 000c0100 00091d40 00092780  ........@....'..

00092230  000923e8 000923e8 00092238 00092238  .#...#..8"..8"..

00092240  00000000 000805e8 00000000 00000000  ................

00092250  00020019 00000000 00070005 00080100  ................

00092260  00000000 00092378 00092b48 000925f0  ....x#..H+...%..

00092270  00000000 00000000 d3cc9553 00000000  ........S.......

00092280  00050005 00080100 000922b0 000922d8  ........."..."..

 

0:000> du 000921e8+28

00092210  "Joseph"

 

 

On a Windows 2003 stack I again examined the first parameter to CLUSAPI!ClusterResourceControl...

 

lpBytesReturned = 0x0006f2a8

 

ChildEBP RetAddr  Args to Child             

0006f278 0102bb53 000a03c8 00000000 01000059 CLUSAPI!ClusterResourceControl

0006f2ac 0102ec09 000a03c8 01000059 00000000 cluadmin!CClusPropList::ScGetResourceProperties+0x78

0006f328 010142c7 00000000 0026d6b8 01078468 cluadmin!CResource::ReadItem+0xcb

0006f36c 010161dc 00268508 01078468 00000001 cluadmin!CClusterDoc::InitResources+0xa2

0006f3a8 01016647 01078468 00000000 01078a48 cluadmin!CClusterDoc::CollectClusterItems+0xe9

0006f3c8 010167a3 002667b8 00268508 01078468 cluadmin!CClusterDoc::OnOpenDocumentWorker+0x90

0006f400 7f062793 002667b8 00268508 002667b8 cluadmin!CClusterDoc::OnOpenDocument+0x9c

0006f428 01010d97 002667b8 00000001 00264da0 MFC42u!CMultiDocTemplate::OpenDocumentFile+0x103

0006fc50 01010a9b 002667b8 00264da0 00268680 cluadmin!CClusterAdminApp::OpenDocumentFile+0xdc

0006fc64 7f03babe 00099c48 00264da0 00264da0 cluadmin!CClusterAdminApp::OnClusterConnectionOpened+0x27

0006fce8 7f038edc 0000040e 00099c48 01006d70 MFC42u!CWnd::OnWndMsg+0x62e

0006fd10 7f03af27 0000040e 00099c48 00264da0 MFC42u!CWnd::WindowProc+0x2c

0006fd70 7f03b06e 00268680 00000000 0000040e MFC42u!AfxCallWndProc+0xa7

0006fd94 7f0e6d8d 000501c6 0000040e 00099c48 MFC42u!AfxWndProc+0x3e

0006fdc4 7739b6e3 000501c6 0000040e 00099c48 MFC42u!AfxWndProcBase+0x4d

0006fdf0 7739b874 7f0e6d40 000501c6 0000040e USER32!InternalCallWinProc+0x28

0006fe68 7739ba92 00000000 7f0e6d40 000501c6 USER32!UserCallWinProcCheckWow+0x151

0006fed0 7739bad0 01049b24 00000000 0006ff08 USER32!DispatchMessageWorker+0x327

0006fee0 7f073000 01049b24 01049b24 01049af0 USER32!DispatchMessageW+0xf

0006fef0 7f072dda 01049af0 01049af0 ffffffff MFC42u!CWinThread::PumpMessage+0x40

 

0:000> dc 000a03c8+b8

000a0480  00650052 006f0073 00720075 00650063  R.e.s.o.u.r.c.e.

000a0490  005c0073 00330031 00350034 00310030  s.\.1.3.4.5.0.1.

000a04a0  00370030 0031002d 00340065 002d0034  0.7.-.1.e.4.4.-.

000a04b0  00660034 00310035 0038002d 00370061  4.f.5.1.-.8.a.7.

000a04c0  002d0030 00610035 00300034 00630034  0.-.5.a.4.0.4.c.

000a04d0  00350032 00620031 00650036 abab0000  2.5.1.b.6.e.....

000a04e0  abababab feeeabab 00000000 00000000  ................

000a04f0  000f0007 001c0731 000a0530 fedcba98  ....1...0.......

 

0:000> du 000a03c8+b8

000a0480  "Resources\13450107-1e44-4f51-8a7"

000a04c0  "0-5a404c251b6e"

 

 

You will also notice that Windows 2003 shows the resource guid for the resource.

 

NOTE: You can speculate that the problem resource is the one in the bottom view of the Cluadmin GUI, however using that method can lead to an improper conclusion because the next resource may actually be causing the problem. The resource may have hung the GUI before getting displayed in Cluadmin.

 

Ok now I know the resource because I dumped it out so what can I do with it now? You can take the resource offline to see if Cluster exhibits an issue bringing that resource off or online.  Additionally, the cluster log may show that this resource is having an issue. Your best bet from here would be to review the cluster log for any additional information about the specific resource or get a Resource Monitor (Resrcmon.exe) dump to investigate any further issues. I also did not show all the threads of the Cluadmin.exe dump since the rest are normally RPC calls and irrelevant to this discussion. If the connection was to another machine we would do a network trace to reveal additional problems.

 

Leave a Comment
  • Please add 2 and 3 and type the answer here:
  • Post
Page 1 of 1 (3 items)