Red alert! My Server is hung - what do I do?

Red alert! My Server is hung - what do I do?

  • Comments 5

So you have a dump from a hung server and you’re the first person on the scene. Your IT Manager is jumping up and down, the phone is ringing off the hook and people are hovering outside your cube.  It’s game time and the pressure is on!!!  Now what do you do? 

 

Well take a deep breath, get a cup of coffee, and relax because I’m here to help you out!  Let me share what we typically do on our first pass through a hung server kernel debug.  This works for both live debugs and dumps. These are steps you can take and they will find problems!

 

Here’s something else to consider.  If the server is mission critical you will probably want to get a dump vs. a live debug so you can get the server back up and running.  This will take the pressure off because you can then do the debug offline, and if need be, send the dump to other people for review.

 

Before we get started let me state that the following data is completely fabricated and many of the process names and address in this output have been made up.  Do not question odd offsets or alignments.

 

I’m also assuming that you know how to

 

1.       Collect a kernel dump: http://support.microsoft.com/kb/244139

 

2.       Set up the debugger: http://www.microsoft.com/whdc/devtools/debugging/installx86.mspx

 

3.       Know how to use the symbol server: http://support.microsoft.com/kb/311503

 

 

0)      Before I start these types of debugs I like to open a log file.

 

1: kd> .logopen H:\repro\hungserver.log

Opened log file 'H:\repro\hungserver.log'

 

 

1)      !vm - Look for memory usage.  Generally speaking you want to look at what the current pool or memory usage values are and compare them to the max available.

 

 

1: kd> !vm

 

 

*** Virtual Memory Usage ***

      Physical Memory:      982890 (   3931560 Kb)

      Page File: \??\P:\pagefile.sys

        Current:   3931560 Kb  Free Space:   3742548 Kb

        Minimum:   3931560 Kb  Maximum:      4193280 Kb

      Available Pages:      631300 (   2525200 Kb)

      ResAvail Pages:       888171 (   3552684 Kb)

      Locked IO Pages:         195 (       780 Kb)

      Free System PTEs:     202830 (    811324 Kb) < THIS IS OK

      Free NP PTEs:          32765 (    131060 Kb) < THIS IS OK

      Free Special NP:           0 (         0 Kb)

      Modified Pages:          241 (       964 Kb)

      Modified PF Pages:       241 (       964 Kb)

      NonPagedPool Usage:    11377 (     45508 Kb) < THIS IS OK

      NonPagedPool Max:      65536 (    262144 Kb) 

      PagedPool 0 Usage:      6398 (     25592 Kb)

      PagedPool 1 Usage:      2201 (      8804 Kb)

      PagedPool 2 Usage:      2216 (      8864 Kb)

      PagedPool 3 Usage:      2179 (      8716 Kb)

      PagedPool 4 Usage:      2199 (      8796 Kb)

      PagedPool Usage:       15193 (     60772 Kb) < THIS IS OK

      PagedPool Maximum:     67584 (    270336 Kb)

      Shared Commit:         24569 (     98276 Kb)

      Special Pool:              0 (         0 Kb)

      Shared Process:        12519 (     50076 Kb)

      PagedPool Commit:      15252 (     61008 Kb)

      Driver Commit:          2083 (      8332 Kb)

      Committed pages:      313611 (   1254444 Kb) < THIS IS OK

      Commit limit:        1925815 (   7703260 Kb)

 

Check to see if any apps are using tons of memory.  In this case I don’t see a problem.

 

      Total Private:        239673 (    958692 Kb)

         36b0 EXCEL.EXE        10775 (     43100 Kb) < THIS IS OK, etc

         2ee8 myapploc.exe     10288 (     41152 Kb)

         097c MySSrv.exe        7497 (     29988 Kb)

         0418 MyFun32.exe       6277 (     25108 Kb)

         0474 svchost.exe       6164 (     24656 Kb)

         1be8 ABCDEFGH.EXE      4984 (     19936 Kb)

         0480 IEXPLORE.EXE      4924 (     19696 Kb)

         09c4 ANOTHER.exe       4768 (     19072 Kb)

         19a4 HMMINTER.exe      4207 (     16828 Kb)

         1b30 ohboya.EXE        4146 (     16584 Kb)

         4558 aprocess.EXE      4138 (     16552 Kb)

         30e8 another.exe       3691 (     14764 Kb)

         0924 aservicec.exe     3508 (     14032 Kb)

         0854 RRXXc.exe         3400 (     13600 Kb)

         3458 MYWIN.EXE         3389 (     13556 Kb)

         0d90 FunService.exe    3298 (     13192 Kb)

         1180 CustomAp.exe      3221 (     12884 Kb)

         06ac XYZvrver.exe      2769 (     11076 Kb)

         2cdc ABCDEFGH.exe      2591 (     10364 Kb)

         02f4 lsass.exe         2567 (     10268 Kb)

         21b4 IEXPLORE.EXE      2516 (     10064 Kb)

         3420 Process.exe       2450 (      9800 Kb)

         4cd4 XYZXY.EXE         2305 (      9220 Kb)

         4a30 lookup.EXE        2244 (      8976 Kb)

         4360 Process.exe       2201 (      8804 Kb)

         0564 spoolsv.exe       2166 (      8664 Kb)

         2e5c XYZXYZEXE         2076 (      8304 Kb)

         02bc winlogon.exe      1964 (      7856 Kb)

         4e48 winlogon.exe      1958 (      7832 Kb)

         42bc ABCDEFGH.exe      1943 (      7772 Kb)

         0eb8 svchost.exe       1922 (      7688 Kb)

         3b98 Process.exe       1919 (      7676 Kb)

         4c1c IEXPLORE.EXE      1864 (      7456 Kb)

         17b8 winlogon.exe      1852 (      7408 Kb)

         3124 winlogon.exe      1849 (      7396 Kb)

         14b8 winlogon.exe      1847 (      7388 Kb)

         32cc winlogon.exe      1843 (      7372 Kb)

         1f84 winlogon.exe      1843 (      7372 Kb)

         2ebc winlogon.exe      1842 (      7368 Kb)

         1548 winlogon.exe      1840 (      7360 Kb)

         21c4 PROCESS213.EXE    1833 (      7332 Kb)

         3b58 MYWIN.EXE         1817 (      7268 Kb)

         4b3c winlogon.exe      1816 (      7264 Kb)

 

NOTE if you see high pool values you will want to issue a !poolused 2 and a !poolused 4 to dump out the pool usages so you can see what pool tags are consuming pool.  (We will write a dedicated blog on this topic later.)

 

 

2) !sysptes - See if one of the lists is low (less than 10)

 

 

1: kd> !sysptes

 

All of these are ok

 

System PTE Information

  Total System Ptes 224223

     SysPtes list of size 1 has 225 free

     SysPtes list of size 2 has 57 free

     SysPtes list of size 4 has 136 free

     SysPtes list of size 8 has 59 free

     SysPtes list of size 16 has 95 free

 

    starting PTE: c022b000

    ending PTE:   c03dff78

 

  free blocks: 652   total free: 202831    largest free block: 191973

 

 

3) !defwrites - If throttling, the server is doing nothing other than writing to the disk.

 

 

1: kd> !defwrites

*** Cache Write Throttle Analysis ***

 

      CcTotalDirtyPages:                   187 (     748 Kb)

      CcDirtyPageThreshold:             130560 (  522240 Kb)

      MmAvailablePages:                 631300 ( 2525200 Kb)

      MmThrottleTop:                       450 (    1800 Kb)

      MmThrottleBottom:                     80 (     320 Kb)

      MmModifiedPageListHead.Total:        241 (     964 Kb)

 

Write throttles not engaged  < THIS IS OK. Good = NOT engaged.

 

 

4) !ready to see if we're holding stuff up

 

 

1: kd> !ready

Processor 0: No threads in READY state  < THIS IS OK

Processor 1: No threads in READY state  < THIS IS OK

 

If we had threads in a ready state you would want to investigate what those threads were and what is running on the processor.

 

 

5) !pcr x; kv on each processor - If they aren't idle then we could be doing DPCs

 

 

1: kd> !pcr 0  < Dump the processor control registers for CPU 0

KPCR for Processor 0 at ffdff000:

    Major 1 Minor 1

      NtTib.ExceptionList: ffffffff

          NtTib.StackBase: 00000000

         NtTib.StackLimit: 00000000

       NtTib.SubSystemTib: 80042000

            NtTib.Version: 012e7ace

        NtTib.UserPointer: 00000001

            NtTib.SelfTib: 00000000

 

                  SelfPcr: ffdff000

                     Prcb: ffdff120

                     Irql: 00000000

                      IRR: 00000000

                      IDR: ffffffff

            InterruptMode: 00000000

                      IDT: 8003f400

                      GDT: 8003f000

                      TSS: 80042000

 

            CurrentThread: 8056cd00

               NextThread: 00000000

               IdleThread: 8056cd00

 

                DpcQueue: < NO DPCs: Not much to look at then 

    

1: kd> !pcr 1  < Dump the processor control registers for CPU 1

KPCR for Processor 1 at f773f000:

    Major 1 Minor 1

      NtTib.ExceptionList: f5ba1d30

          NtTib.StackBase: 00000000

         NtTib.StackLimit: 00000000

       NtTib.SubSystemTib: f773fef0

            NtTib.Version: 0121925d

        NtTib.UserPointer: 00000002

            NtTib.SelfTib: 7ffda000

 

                  SelfPcr: f773f000

                     Prcb: f773f120

                     Irql: 00000000

                      IRR: 00000000

                      IDR: ffffffff

            InterruptMode: 00000000

                      IDT: f77456e0

                      GDT: f77452e0

                      TSS: f773fef0

 

            CurrentThread: 8963cb90

               NextThread: 00000000

               IdleThread: f7741fa0

 

                DpcQueue: < NO DPCs: Not much to look at then

 

6) !locks - Look for deadlocks and contention

 

 

The following output is of interest.

The thread ID with the <*> next to it means that he has exclusive access to the resource and that all the other threads are waiting on that thread to finish its work. Typically you would !thread that OWNER THREAD ID <*> (e.g., !thread 87bddda0) to see what that thread is doing. If you have two threads that have exclusive access to two different resources, and these threads are in each other’s exclusive waiters list, you have a deadlock.  The following is an example of what a deadlock might look like.  In this case you would want to !thread each owner and evaluate the logic of the code in each stack that allowed the threads to get into this state 

 

1: kd> !locks

**** DUMP OF ALL RESOURCE OBJECTS ****

KD: Scanning for held locks......

 

Resource @ 0x8a50ee98    Shared 4 owning threads

     Threads: 896856d0-01<*> 89686778-01<*> 896862d0-01<*> 89685da0-01<*>

KD: Scanning for held locks............................................................

 

Resource @ 0x896da1bc    Exclusively owned

     Threads: 896e3b20-01<*>

KD: Scanning for held locks..

 

 

Resource @ 0x81234567    Shared 1 owning threads

    Contention Count = 15292

    NumberOfSharedWaiters = 1

    NumberOfExclusiveWaiters = 39

     Threads: 87bddda0-01<*> 806d2020-01 

 

 

     Threads Waiting On Exclusive Access:

              80ced020       80c036f8       80cdc7a0       80c438b0      

              80e6cda0       80f96987       8007fd60       8004dc10      

              80d7b020       80a2dd70       80b89620       80b58020      

              8036eda0       87abc123       80606da0       8056e890      

              802b3630       80cc7590       80d64020       80f7dda0      

              80129580       80b73da0       806d2578       80b505d8      

      

 

KD: Scanning for held locks................

 

Resource @ 0x83245678    Exclusively owned

    Contention Count = 4827

    NumberOfExclusiveWaiters = 35

     Threads: 87abc123-01<*>

     Threads Waiting On Exclusive Access:

              803e6aa0       80876020       80240020       80f56588      

              808174f0       80bd6b28       80c3c448       8046d6c8      

              801e8da0       80356518       80b4c978       8069e020      

              80cb9020       87bddda0       80c65020       86daaac0      

              80379020       80fe4020      

 

 

 

8) !process 0 0 - Search for drwtsn32.  This would indicate that we have a process that has crashed and is in the process of being dumped.  This could cause a server hang.  Look at the PEB for drwtsn32 and get its command line to see what process is being dumped.  You should be able to do this by getting its process id and doing a .process PROCESSID;.reload;!PEB

 

The following is how to extract a command line for any process, but it would work for Watson also.

 

1: kd> .process 89f31020 

Implicit process is now 89f31020

1: kd> .reload

Loading Kernel Symbols

...........................................................................................................................................

Loading User Symbols

...............................

Loading unloaded module list

...............

1: kd> !peb

PEB at 7ffdf000

    InheritedAddressSpace:    No

    ReadImageFileExecOptions: Yes

    BeingDebugged:            No

    ImageBaseAddress:         01000000

    Ldr                       77fc23a0

    Ldr.Initialized:          Yes

    Ldr.InInitializationOrderModuleList: 00171ef8 . 00176c90

    Ldr.InLoadOrderModuleList:           00171e90 . 00176c80

    Ldr.InMemoryOrderModuleList:         00171e98 . 00176c88

            Base TimeStamp                     Module

         1000000 3e80245d Mar 24 05:41:49 2003 \??\P:\WINDOWS\system32\winlogon.exe

        77f40000 3e802494 Mar 25 05:42:44 2003 P:\WINDOWS\system32\ntdll.dll

        77e40000 44c60ec8 Jul 25 08:30:00 2006 P:\WINDOWS\system32\kernel32.dll

        77ba0000 3e802496 Mar 25 05:42:46 2003 P:\WINDOWS\system32\msvcrt.dll

        77da0000 3e802495 Mar 25 05:42:45 2003 P:\WINDOWS\system32\ADVAPI32.dll

        77c50000 40566fc9 Mar 15 23:08:57 2004 P:\WINDOWS\system32\RPCRT4.dll

        77d00000 45e7bafc Mar 02 00:49:48 2007 P:\WINDOWS\system32\USER32.dll

        77c00000 45e7bafc Mar 02 00:49:48 2007 P:\WINDOWS\system32\GDI32.dll

        75970000 3e8024a2 Mar 25 05:42:58 2003 P:\WINDOWS\system32\USERENV.dll

        75810000 3e8024a3 Mar 25 05:42:59 2003 P:\WINDOWS\system32\NDdeApi.dll

        761b0000 3e8024a0 Mar 25 05:42:56 2003 P:\WINDOWS\system32\CRYPT32.dll

       

    SubSystemData:     00000000

    ProcessHeap:       00070000

    ProcessParameters: 00020000

    WindowTitle:  '< Name not readable >'

    ImageFile:    '\??\P:\WINDOWS\system32\winlogon.exe'

    CommandLine:  'winlogon.exe' < HERE IS THE COMMAND LINE.. No args in this case

 

 

( output is truncated ... )

 

9) Look at the handle table size.  If it’s over 10000 you may have trouble.  If you do have a handle leak refer to TalkBackVideo Understanding handle leaks and How to use !htrace to find them

 

 

1: kd> !process 0 0

 

**** NT ACTIVE PROCESS DUMP ****

PROCESS 8a613270  SessionId: none  Cid: 0004    Peb: 00000000  ParentCid: 0000

    DirBase: 0acc0000  ObjectTable: e1001d10  HandleCount: 2510.

    Image: System

 

PROCESS 8a294328  SessionId: none  Cid: 0274    Peb: 7ffdf000  ParentCid: 0004

    DirBase: ef1ac000  ObjectTable: e14ac1d0  HandleCount: 124.

    Image: smss.exe

 

PROCESS 8a103424  SessionId: 0  Cid: 02a4    Peb: 7ffdf000  ParentCid: 0274

    DirBase: ed804000  ObjectTable: e18caa68  HandleCount: 1171.

    Image: csrss.exe

 

PROCESS 8a104343  SessionId: 0  Cid: 02bc    Peb: 7ffdf000  ParentCid: 0274

    DirBase: ed539000  ObjectTable: e18c67b0  HandleCount: 498.

    Image: winlogon.exe

 

PROCESS 8a0f6634  SessionId: 0  Cid: 02e8    Peb: 7ffdf000  ParentCid: 02bc

    DirBase: ece72000  ObjectTable: e1668e40  HandleCount: 568.

    Image: services.exe

 

PROCESS 8a123423  SessionId: 0  Cid: 02f4    Peb: 7ffdf000  ParentCid: 02bc

    DirBase: ecd7a000  ObjectTable: e16684a0  HandleCount: 30000. < This is bad

    Image: lsass.exe

 

PROCESS 89f96453  SessionId: 0  Cid: 03e0    Peb: 7ffdf000  ParentCid: 02e8

    DirBase: eb99c000  ObjectTable: e16bb570  HandleCount: 500.

    Image: svchost.exe

 

PROCESS 8a0c6532  SessionId: 0  Cid: 042c    Peb: 7ffdf000  ParentCid: 02e8

    DirBase: eb6d7000  ObjectTable: e1731170  HandleCount: 156.

    Image: svchost.exe

 

PROCESS 8a0a8d88  SessionId: 0  Cid: 0460    Peb: 7ffdf000  ParentCid: 02e8

    DirBase: eb58f000  ObjectTable: e17372e8  HandleCount: 124.

    Image: svchost.exe

 

PROCESS 89f77678  SessionId: 0  Cid: 0474    Peb: 7ffdf000  ParentCid: 02e8

    DirBase: eb484000  ObjectTable: e17305b8  HandleCount: 1457.

    Image: svchost.exe

 

9) !process 0 0 system - Check the worker threads in the system process (search for srv! to find server worker threads).  What are these threads doing?  These are the server service threads.  Are they blocked on I/O or waiting for a resource?

 

10) 1: kd> !process 0 17 csrss.exe  - Look for 16 LPC server threads.

What are they doing? Are they blocked?

 

11) !stacks 2,  This will dump every call stack on the server.  You may need to go through and evaluate every stack on the server.  Look for critical sections, etc.

 

15) !qlocks  This will allow you to check the stack of all the Queued spin locks on the machine.   For further information on spinlocks refer to the Windows Internals book.

 

1: kd> !qlocks

Key: O = Owner, 1-n = Wait order, blank = not owned/waiting, C = Corrupt

 

                       Processor Number

    Lock Name         0  1    << Nothing to worry about here.

 

KE   - Dispatcher        

MM   - Expansion         

MM   - PFN               

MM   - System Space      

CC   - Vacb              

CC   - Master            

EX   - NonPagedPool      

IO   - Cancel            

EX   - WorkQueue         

IO   - Vpb                

IO   - Database          

IO   - Completion        

NTFS - Struct            

AFD  - WorkQueue         

CC   - Bcb               

MM   - NonPagedPool     

 

16) !process 0 17 winlogon.exe to look for hung LPC calls.  If you find a LPC call calling out of winlogon you can follow the call with the !LPC debugger command. This will allow you to see what the thread is doing in the other process.

 

 

If you have further questions on any of these commands, please refer to the debugger.chm file in the Windows debugger tools install.

 

Good luck and happy debugging.

 

“This debugger is mine, there are many like it but this one is mine!” Jeff Dailey

Leave a Comment
  • Please add 1 and 3 and type the answer here:
  • Post
Page 1 of 1 (5 items)