Welcome to MSDN Blogs Sign in | Join | Help

Today I want to talk about tracking down leaking pool. Back with Server 2003 and before, leaking pool was a major issue because it was a limited resource. In Vista and beyond, it isn’t as much of an issue since pool is allocated dynamically, but it can still cause system performance issues if a component uses too much of it. Since all nonpaged pool allocations are locked in RAM, it leaves the system with a smaller amount of RAM for anything else.

I recently worked a case involving leaking nonpaged pool and it was easy to troubleshoot. I thought I would walk through the steps I used to narrow down the issue in this article.  So let's get started. Note - The following can be done on a live machine or with a kernel memory dump as well.

Running the !vm command, you can get statistics about the virtual memory usage on the system.

  

       Virtual Memory Usage

 

 

       Physical Memory:   655219   ( 2620876 Kb)

 

       Page File: \??\C:\pagefile.sys

          Current:   4190208Kb Free Space:   4115148Kb

          Minimum:   4190208Kb Maximum:      4190208Kb

 

       Available Pages:   303760   ( 1215040 Kb)

       ResAvail Pages:    571159   ( 2284636 Kb)

       Modified Pages:       677   (    2708 Kb)

       NonPagedPool Usage: 59613   (  238452 Kb)

       NonPagedPool Max:   69377   (  277508 Kb)

                                                          NonPaged Pool Usage :  85%

       ********** Excessive NonPaged Pool Usage *****

       PagedPool 0 Usage:  31684   (  126736 Kb)

       PagedPool 1 Usage:   1824   (    7296 Kb)

       PagedPool 2 Usage:   1821   (    7284 Kb)

       PagedPool 3 Usage:   1824   (    7296 Kb)

       PagedPool 4 Usage:   1827   (    7308 Kb)

       PagedPool Usage:    38980   (  155920 Kb)

       PagedPool Maximum:  86016   (  344064 Kb)

                                                          Paged Pool Usage :  45%

 

Hmm….looks like something is using a lot of nonpaged pool. To determine what is using the maximum amount of nonpaged pool, run the !poolused command.

0: kd>!poolused 2

   Sorting by NonPaged Pool Consumed

 

  Pool Used:

            NonPaged            Paged

 Tag    Allocs     Used    Allocs     Used

 Thre   326514 208968960         0       0

 MmCm      647  2830624         0        0

 LSwi        1  2293760         0        0

 Npfs     1681  1276672      1202   208320

 File     5863   945472         0        0

 Dmio      131   525760         0        0

 Irp      1158   508800         0        0

 

Adding a value of 2 to end of the command instructs the debugger to sort by nonpaged pool used. For more information on the command, please review the help file.

According to the output, the “Thre” tag is the highest user of nonpaged pool. To get a rough idea of all the modules that might be allocating pool using this tag, we can search all the modules to see if they contain the string. This can help narrow down the problem to a few drivers. To search all the modules, just use the for_each_module command.

      !for_each_module s -a @#Base @#End "Thre"

 

Look at David Butler’s blog post if you want to narrow it down even further by looking at each driver’s import table to see if they are importing the ExAllocatePool or ExAllocatePoolWithTag functions. Though, in this case I already know  the “Thre” tag is used by the kernel to allocate thread objects. Could the kernel be leaking thread objects? Most likely not, otherwise we would have seen a lot of customers reporting this issue. I am not ruling it out, but to start off, I am going to focus on the third party software installed on the machine.

Let’s dump out a few sample pool allocations, to see what exactly they are. I was sure there were going to be a ton of them, so I just hit the Break button once the output started flying by.

0: kd> !poolfind Thre 0

Scanning large pool allocation table for Tag: Thre

Searching NonPaged pool (81093000 : 89093000) for Tag: Thre

 

81098000 size:  280 previous size:    0  (Allocated) Thre (Protected)

81098380 size:  280 previous size:  100  (Allocated) Thre (Protected)

81098600 size:  280 previous size:  280  (Allocated) Thre (Protected)

81098880 size:  280 previous size:  280  (Allocated) Thre (Protected)

81098b00 size:  280 previous size:  280  (Allocated) Thre (Protected)

81098d80 size:  280 previous size:  280  (Allocated) Thre (Protected)

81099000 size:  280 previous size:    0  (Allocated) Thre (Protected)

810992c0 size:  280 previous size:   40  (Allocated) Thre (Protected)

810996e0 size:  280 previous size:  1a0  (Allocated) Thre (Protected)

81099960 size:  280 previous size:  280  (Allocated) Thre (Protected)

81099d80 size:  280 previous size:  1a0  (Allocated) Thre (Protected)

8109a000 size:  280 previous size:    0  (Allocated) Thre (Protected)

8109a380 size:  280 previous size:  100  (Allocated) Thre (Protected)

8109a600 size:  280 previous size:  280  (Allocated) Thre (Protected)

8109a880 size:  280 previous size:  280  (Allocated) Thre (Protected)

.

.

.

 

It would be interesting to see the thread stacks for the KTHREAD structures in these pools. To dump out the thread using the !thread command, I needed to figure out where the base of the KTHREAD structure started in the pool. To figure out the offset I dumped out all the threads in the System process, picked a random thread, and then calculated the thread’s offset from the start of it’s pool block. You can use any thread in the entire system because the offset will always be the same.

0: kd> !thread  8904a180

THREAD 8904a180  Cid 8.18  Teb: 00000000  Win32Thread: 00000000 WAIT: (WrEventPairLow) KernelMode Non-Alertable

    80478d40  Unknown

Not impersonating

Owning Process 890754a0

Wait Start TickCount    3831270       Elapsed Ticks: 209711

Context Switch Count    28238              

UserTime                  0:00:00.0000

KernelTime                0:00:00.0875

Start Address nt!ExpWorkerThread (0x80417a9a)

Stack Init f6444000 Current f6443d34 Base f6444000 Limit f6441000 Call 0

Priority 13 BasePriority 13 PriorityDecrement 0 DecrementCount 0

ChildEBP RetAddr  Args to Child

f6443d4c 8042ef5f bfec0068 80478d40 8904a180 nt!KiSwapThread+0x1b1

f6443d70 80417b0d 00000000 80442f00 00000000 nt!KeRemoveQueue+0x197

f6443da8 804578c2 00000000 00000000 00000000 nt!ExpWorkerThread+0x73

f6443ddc 8046c966 80417a9a 00000000 00000000 nt!PspSystemThreadStartup+0x54

00000000 00000000 00000000 00000000 00000000 nt!KiThreadStartup+0x16

 

0: kd> !pool 8904a180 

 8904a080 size:   e0 previous size:   20  (Allocated)  Dire (Protected)

*8904a160 size:  280 previous size:   e0  (Allocated) *Thre (Protected)

 8904a3e0 size:  280 previous size:  280  (Allocated)  Thre (Protected)

 

0: kd> ?8904a180-8904a160

Evaluate expression: 32 = 00000020

 

So based on my calculations, the Kthread structure starts 32 bytes into the Thre pool.

Now I can use this offset to dump threads stacks from the pools returned by the poolfind command.  Some of the threads are going to be valid, but most of them are going to be doing something fishy. Basically I need to use a large sample to figure out what would be causing the Thre tag to leak. After a few tries, I was able to find a lot of threads which looked like this -

kd> !Thread 81098620 

THREAD 81098620Cid 8e8.a391c  Teb: 00000000  Win32Thread: 00000000 TERMINATED

Not impersonating

Owning Process 886f5020

Wait Start TickCount    1949088       Elapsed Ticks: 2091893

Context Switch Count    2              

UserTime                  0:00:00.0000

KernelTime                0:00:00.0000

Start Address 0x7c57b740

Win32 Start Address 0x20003474

Stack Init 0 Current bca4fc44 Base bca50000 Limit bca4d000 Call 0

Priority 16 BasePriority 8 PriorityDecrement 0 DecrementCount 0

 

 

kd> !PROCESS 886f5020 0

PROCESS 886f5020SessionId: 0  Cid: 08e8    Peb: 7ffdf000  ParentCid: 016c

    DirBase: 694bc000  ObjectTable: 88b3ec68  TableSize: 391189.

    Image: xxxxxx.EXE

 

Looks like the thread is terminated. But why is it still hanging around?

0: kd> !object 81098620

Object: 81098620  Type: (89075900) Thread

    ObjectHeader: 81098608

    HandleCount: 1  PointerCount: 1

 

Aha! Since the handle count and pointer count are not 0, the system has not freed this Thre allocation. Most likely something has an open handle to this object. Dumping out all the processes using the !process command, I found a particular process with 388,836 open handles. This is the culprit. Interesting enough, this is the same process which was listed in the thread’s process field.

       1  Token

       1  Desktop

       1  Section

       2  WindowStation

       3  Directory

       4  IoCompletion

       5  Mutant

       7  Port

       9  Semaphore

      15  File

      73  Event

   65110  Key

  323155  Thread

 

  388386  Handles in process

 

Running !handle 0 2 886f5020 and analyzing the output, I found a ton of thread handles. Most likely the process is leaking thread handles and preventing the system from cleaning up the thread allocations. This is resulting in high usage of nonpaged pool, which is causing the performance issues seen by the customer.

We contacted the software vendor responsible for the process, and it turned out to be a known bug in their software. An update was provided to the customer, and it fixed the issue immediately.

 

Share this post :

Hi All. Recently I had a Windows 2000 case where the machine wouldn’t shut down. After initiating the shutdown process, we saw the user get logged off, and on the console we watched the services shut down. The final “Windows is shutting down…” message was displayed on the screen, and then the screen would go blank. At that point we expected the machine to power down, but it continued running. Even after waiting 10 minutes the machine stayed powered up. The same behavior was exhibited after a restart as well. We tried the same thing in safe mode, but the machine still wouldn’t shut down.

We setup the machine for a full memory dump, and crashed the box once the screen went blank. The customer uploaded the dump, and I started digging.

The first thing I noticed was a bunch of processes still running. I expected csrss.exe, smss.exe, winlogon.exe and the System process , but to my surprise I found other processes such as svchost.exe and services.exe still lingering. Why were they still running?

 

0: kd> !process 0 0

**** NT ACTIVE PROCESS DUMP ****

PROCESS f95ed480  SessionId: 0  Cid: 0008    Peb: 00000000  ParentCid: 0000

    DirBase: 00030000  ObjectTable: f95edf68  TableSize:  62.

    Image: System

 

PROCESS f94de520  SessionId: 0  Cid: 00fc    Peb: 7ffff000  ParentCid: 0008

    DirBase: 2465c000  ObjectTable: f94def08  TableSize:  34.

    Image: SMSS.EXE

 

PROCESS f94b01c0  SessionId: 0  Cid: 0118    Peb: 7ffff000  ParentCid: 00fc

    DirBase: 26515000  ObjectTable: f94b29e8  TableSize: 103.

    Image: CSRSS.EXE

 

PROCESS f94a5d60  SessionId: 0  Cid: 0130    Peb: 7ffff000  ParentCid: 00fc

    DirBase: 2671a000  ObjectTable: f94b18a8  TableSize: 114.

    Image: WINLOGON.EXE

 

PROCESS f9495a60  SessionId: 0  Cid: 014c    Peb: 7ffff000  ParentCid: 0130

    DirBase: 26814000  ObjectTable: f94b2848  TableSize: 204.

    Image: SERVICES.EXE

 

PROCESS f9494860  SessionId: 0  Cid: 0158    Peb: bffdf000  ParentCid: 0130

    DirBase: 2681d000  ObjectTable: f9499608  TableSize: 247.

    Image: LSASS.EXE

 

PROCESS f9476500  SessionId: 0  Cid: 01e8    Peb: 7ffff000  ParentCid: 014c

    DirBase: 272a1000  ObjectTable: f94833a8  TableSize: 136.

    Image: svchost.exe

 

 

Let’s find out.

How is Shutdown Performed

At first I thought some service was stuck and not shutting down, causing the entire system to stall. I pulled out my copy of Windows Internals and thumbed through the Startup and Shutdown chapter. I found that to initiate a shutdown, Csrss sends a shutdown message to Winlogon. Winlogon then calls ExitWindowsEx, which in turn sends a message to each session’s Csrss. Inside each session, Csrss then sends messages to all the threads that own a Windows message loop to shut down. If the threads do not respond, Csrss waits for a timeout specified by HKCU\Control Panel\Desktop\HungAppTimeout. Once all the threads that own windows exit, Csrss then terminates the processes as well. If Csrss finds a console application, it sends it a message to shut down. It will wait for a timeout specified by HKCU\Control Panel\Desktop\WaitToKillAppTimeout, and then displays a message on the desktop.

If you want Windows to terminate the processes immediately after timeout period, you can set the HKCU\Control Panel\Desktop\AutoEndTasks to 1.

At this point all the processes in the interactive user process have been shut down. Csrss in session 0 then sends messages to the Service Control Manager(SCM) to shut down all the services. When a service is registered, it also specifies a wait hint. During shutdown, this hint is used by the SCM to wait for services to terminate. Prior to Vista, Csrss waits for the SCM to shut down, but if that does not happen by the timeout value in HKLM\SYSTEM\CurrentControlSet\Control\WaitToKillServiceTimeout, Csrss would just kill the SCM and proceed with the shutdown. Since this could lead to services not completing their shutdown routines, going forward in Vista and beyond Windows now implements preshutdown notification routines. Services that want to get these notifications can register themselves by using the SetServiceStatus API.

Once Csrss has finished its pass notifying system processes that the system is shutting down, Winlogon in session 0 finishes the shutdown process by calling the NtShutdownSystem. Note, many system processes such as Csrss, Smss, WinInit, Services, Lsass are still running when the machine actually powers down. They are ready to be shut down, but the processes still exist. NtShutdownSystem calls PoSetSystemPowerState to shut down all the drivers and the rest of the executive subsystems.

One of the things that PoSetSystemPowerState does is call the I/O manager to send notifications to all the drivers that have requested the shutdown notification. Once the drivers receive these notifications, they can then clean up and perform any special steps that might be needed for their devices. Other subsystems also perform clean up, and the page file is cleared by the Memory Manager if that option was turned on. The I/O manager is called a second time, and then the file system drivers are notified that the system is shutting down. The System actually shuts down in the power manager. The power manager tells the BIOS to turn the power off to the machine.

The Memory Dump Analysis

I started by looking at the Winlogon process in session 0 and found the main thread had called NtShutdownSystem as we expected. I always start with this thread when troubleshooting shutdown problems.

 

0: kd> !process f94a5d60 17

PROCESS f94a5d60  SessionId: 0  Cid: 0130    Peb: 7ffff000  ParentCid: 00fc

    DirBase: 2671a000  ObjectTable: f94b18a8  TableSize: 114.

    Image: WINLOGON.EXE

    VadRoot f93eb748 Clone 0 Private 515. Modified 396. Locked 0.

    DeviceMap f95c8448

    Token                             e246cdf0

    ElapsedTime                        0:34:42.0937

    UserTime                          0:00:00.0109

    KernelTime                        0:00:01.0015

    QuotaPoolUsage[PagedPool]         30112

    QuotaPoolUsage[NonPagedPool]      31264

    Working Set Sizes (now,min,max)  (1172, 50, 345) (4688KB, 200KB, 1380KB)

    PeakWorkingSetSize                1866

    VirtualSize                       29 Mb

    PeakVirtualSize                   31 Mb

    PageFaultCount                    3603

    MemoryPriority                    FOREGROUND

    BasePriority                      13

    CommitCharge                      888

 

        THREAD f94a5ae0  Cid 130.114  Teb: 7fffe000  Win32Thread: e002e328 WAIT: (Suspended) KernelMode Non-Alertable

            f93c2010  SynchronizationEvent

            f94a5bc8  NotificationTimer

        Not impersonating

        Owning Process f94a5d60

        Wait Start TickCount    138381        Elapsed Ticks: 842

        Context Switch Count    1914                   LargeStack

        UserTime                  0:00:00.0078

        KernelTime                0:00:00.0875

        Start Address winlogon!_setargv (0x0100ae2c)

        Stack Init f1d00000 Current f1cffa60 Base f1d00000 Limit f1cfb000 Call 0

        Priority 15 BasePriority 15 PriorityDecrement 0 DecrementCount 0

 

        ChildEBP RetAddr  Args to Child

        f1cffa78 dd42d893 dd0655c0 f93c2008 dd0654f0 nt!KiSwapThread+0x1b1

        f1cffaa0 dd490be9 f93c2010 00000005 00000000 nt!KeWaitForSingleObject+0x1a3

        f1cffae0 dd4908c5 f93c2000 00000001 00000001 nt!PopWaitForSystemPowerIrp+0xf2

        f1cffb04 dd4903db 00000000 f1cffbf4 f1cffc78 nt!PopSetDevicesSystemState+0xfb

        f1cffbe0 dd468389 00000005 00000004 c0000004 nt!NtSetSystemPowerState+0x329

        f1cffbe0 dd432197 00000005 00000004 c0000004 nt!_KiSystemService+0xc9

        f1cffc64 dd490122 00000005 00000004 c0000004 nt!ZwSetSystemPowerState+0xb

        f1cffd48 dd494db4 00000005 00000004 c0000004 nt!NtSetSystemPowerState+0x70

        f1cffd58 dd468389 00000001 00000000 00000000 nt!NtShutdownSystem+0x2e

        f1cffd58 77f88e07 00000001 00000000 00000000 nt!_KiSystemService+0xc9

        0006fe90 01014c3d 00000001 00000001 000767b8 ntdll!NtShutdownSystem+0xb

        0006fea8 01014f6c 00000060 0000000b 0000000b winlogon!ShutdownMachine+0x165

        0006ff04 0100e20f 000767b8 0000000b 5ffa0000 winlogon!Logoff+0x216

        0006ff20 01007e8c 000767b8 00000005 000735dc winlogon!MainLoop+0x1fb

        0006ff58 0100af70 00071fc8 00000000 000735dc winlogon!WinMain+0x37a

        0006fff4 00000000 7ffff000 000000c8 00000100 winlogon!WinMainCRTStartup+0x156

 

Looks like the I/O Manager has sent shutdown notifications to all the devices and this winlogon thread is waiting for it to complete. Dumping out the IRP, I found it had been completed, but for some reason the completion routine had never been called. That is why we are waiting forever. Digging further, it looked like someone had messed up the current IO stack location on the IRP, which resulted in the IRP completion routine never being called.

Let’s take a look at the IRP.

Based on the thread stack you can see Winlogon sent an IRP down to Plug and Play and is waiting for the action to complete. Since there is no obvious way to get a pointer to the IRP from the stack we’ll need to do some digging. We see that nt!PopWaitForSystemPowerIrp is waiting for some type of dispatch object to signal. According to MSDN the first parameter passed to KeWaitForSingleObject() is the object. Here I’m dumping the raw memory for the Object parameter.

0: kd> dc f93c2010

f93c2010  00040001 00000000 f94a5b4c f94a5b4c  ........L[J.L[J.

f93c2020  00000000 f94a5ae0 00000000 00000063  .....ZJ.....c...

f93c2030  00000028 00000005 dd47bd20 f93c203c  (....... .G.< <.

f93c2040  f93c203c f93c2044 f93c2044 f938fea8  < <.D <.D <...8.

f93c2050  00000000 00000000 01010000 00000100  ................

f93c2060  f946c860 f946c860 f93c23ac 00000000  `.F.`.F..#<.....

f93c2070  00000000 f93c2360 f93c2360 f93c207c  ....`#<.`#<.| <.

f93c2080  f93c207c f93c2084 f93c2084 f93c208c  | <.. <.. <.. <.

 

0: kd> !pool f93c2010 2

*f93c2000 size:  420 previous size:    0  (Allocated) *PDss

 

You may ask yourself, “are any of these values in the PDss pool pointers to IRP pool blocks?”  If you wanted to know that, you could iterate over each one of them passing them to !pool or !address.  That sounds tedious to do manually.  But thankfully, the debugger has some nice command tokens such as .foreach that make this a breeze.  If you want to pass each of the above values to !pool, you can perform the following in the debugger:

0: kd>  .foreach /pS 1 /ps 1 ( value { dp /c 1 f93c2000 L 420/@$ptrsize } ) { .if( value != 0 ) {.printf "**** %p ****\n", ${value}; !pool ${value} 0x2} }

 

**** 21018000 ****

21018000: Unable to get contents of pool block

**** 73734450 ****

73734000: Unable to get contents of pool block

**** 00000002 ****

00000000: Unable to get contents of pool block

**** 00000006 ****

00000000: Unable to get contents of pool block

.

. (I truncated this output for the blog. There were many more entries)

.

**** f946c860 ****

**** f946c860 ****

*f946c800 size:  1e0 previous size:   40  (Allocated) *Irp

**** f946c860 ****

*f946c800 size:  1e0 previous size:   40  (Allocated) *Irp

**** f93c23ac ****

 

**** f93c207c ****

*f93c2000 size:  420 previous size:    0  (Allocated) *PDss

 

Bingo! There is a pointer to an IRP block in our PDss pool. Let’s take a look at f946c800. I’m dumping it’s pool.

 

        0: kd> !pool f946c800

 f946c000 size:   80 previous size:    0  (Allocated)  MmCa

 f946c080 size:   40 previous size:   80  (Free)       ....

 f946c0c0 size:   40 previous size:   40  (Allocated)  Ntfn

 f946c100 size:   a0 previous size:   40  (Allocated)  File (Protected)

 f946c1a0 size:   40 previous size:   a0  (Allocated)  Ntfr.

.

.

.

*f946c800 size:  1e0 previous size:   40  (Allocated) *Irp

 

Here you see the IRP tag in this raw memory output. The first two DWORDs contain the POOL_HEADER.

 

0: kd> dc f946c800

f946c800  0f018002 20707249 01d80006 00000000  ....Irp ........

f946c810  00000000 00000000 f946c818 f946c818  ..........F...F.

f946c820  c00000bb 00000000 0b0a0000 04000000  ................

f946c830  00000000 00000000 00000000 00000000  ................

f946c840  00000000 00000000 00000000 00000000  ................

f946c850  00000000 00000000 00000000 00000000  ................

f946c860  f93c2060 f93c2060 f946c9e0 00000000  ` <.` <...F.....

f946c870  00000000 00000000 00000000 00000000  ................

 

Since the pool header is 8 bytes long we need to add 8 to f946c800 to skip over the header and get to the actual IRP.

 

0: kd> ? f946c800 + 8

Evaluate expression: -112801784 = f946c808

 

0: kd> !irp f946c808

Irp is active with 10 stacks 11 is current (= 0xf946c9e0)

 No Mdl Thread 00000000:  Irp is completed. 

     cmd  flg cl Device   File     Completion-Context

 [  0, 0]   0  0 00000000 00000000 00000000-00000000   

 

                    Args: 00000000 00000000 00000000 00000000

 [  0, 0]   0  0 00000000 00000000 00000000-00000000   

 

                    Args: 00000000 00000000 00000000 00000000

 [  0, 0]   0  0 00000000 00000000 00000000-00000000   

 

                    Args: 00000000 00000000 00000000 00000000

 [  0, 0]   0  0 00000000 00000000 00000000-00000000   

 

                    Args: 00000000 00000000 00000000 00000000

 [  0, 0]   0  0 00000000 00000000 00000000-00000000   

 

                    Args: 00000000 00000000 00000000 00000000

 [  0, 0]   0  0 00000000 00000000 00000000-00000000   

 

                    Args: 00000000 00000000 00000000 00000000

 [  0, 0]   0  0 00000000 00000000 00000000-00000000   

 

                    Args: 00000000 00000000 00000000 00000000

 [  0, 0]   0  0 00000000 00000000 00000000-00000000   

 

                    Args: 00000000 00000000 00000000 00000000

 [  0, 0]   0  0 00000000 00000000 00000000-00000000   

 

                    Args: 00000000 00000000 00000000 00000000

 [ 16, 2]   0 e1 f9533ba0 00000000 dd491028-f93c2354 Success Error Cancel pending

              \Driver\Diskperf    nt!PopCompleteSystemPowerIrp

                    Args: 00000000 00000000 00000006 00000005

 

The Major function is 16 (IRP_MJ_POWER) which synchs up with the power management activity we saw in the stack. You can find all the major function codes for IRPs in the standard header files such as WDM.H or NTDDK.H available with the WDK. There is an excellent chance this is the IRP Winlogon sent down to Plug and Play. Now that we have the IRP in question with the IO Stack locations we see that the pending status is set.

 

                     Args: 00000000 00000000 00000000 00000000

 [ 16, 2]   0 e1 f9533ba0 00000000 dd491028-f93c2354 Success Error Cancel pending

              \Driver\Diskperf    nt!PopCompleteSystemPowerIrp

                     Args: 00000000 00000000 00000006 00000005

 

 

It appears to be stuck in \Driver\Diskperf which is the device stack for the device associated with Diskperf.  At this point I determined it was most likely due to one of the devices in it’s devstack.

0: kd> !devstack f9533ba0

  !DevObj   !DrvObj            !DevExt   ObjectName

  f94d5660  \FileSystem\Foo      f94d5718 

  f95338a0  \Driver\VSP          f9533958 

> f9533ba0  \Driver\Diskperf     f9533c58 

  f9539150  \Driver\Ftdisk       f9539208  HarddiskVolume1

!DevNode f953aba8 :

  DeviceInst is "STORAGE\Volume\1&30a96598&0&Signature80Offset7E00Length2730C00"

  ServiceName is "DiskPerf"

 

Doing a bing search for the drivers in the list above (and shutdown), I found a file system filter driver known to cause this issue in the past. Updating to the latest version of the driver fixed the issue, and the machine would shutdown normally.

BTW - For more information on how Windows performs a shutdown check out this past article from Ntdebugging - http://blogs.msdn.com/ntdebugging/archive/2007/06/09/how-windows-shuts-down.aspx

 

Share this post :

 

Here's one from the Rube Goldberg debug collection!

The dripping sarcasm is because I'm about to show the reeeeally long way to figure out what's eating MmCm, skip down to the end if you are in a time crunch.

 

Otherwise, do resist the temptation to skip ahead as some of the techniques can be used in other debugs.

 

-Tate

 

 

What is this Pool Tag?

 

MmCm is a Non Paged Pool (NPP) tag allocated via a kernel mode call (a driver or the OS itself…applications cannot allocate this memory directly) to either MmAllocateContiguousMemory  or MmAllocateContiguousMemorySpecifyCache.

 

Great comments from msdn…

 

MmAllocateContiguousMemory can be called to allocate a contiguous block of physical memory for a long-term internal buffer, usually from the DriverEntry routine…

 

A device driver that must use contiguous memory should allocate only what it needs during driver initialization because physical memory is likely to become fragmented as the operating system runs.

 

Warning  If you use the MmAllocateContiguousMemory routine on computers with large amounts of memory, the operating system's performance might severely degrade when the system tries to create a contiguous chunk of memory. This degradation is reduced for Windows Server 2008. Memory that MmAllocateContiguousMemory allocates is uninitialized...

 

 

First, how much is normal?

 

As the comments indicate it is contingent mostly on the driver's needs, done at initialization time.  We have seen consumption in the single MB range up to tens of MB consumed depending on what drivers are loaded.

 

 

What drivers are the most likely consumers of this memory and why would the amount vary?

 

The most likely consumers on any machine would be the following types of drivers. Note that each will likely allocate a multiple of some amount per instance for each physical adapter exposed to Windows. (i.e. A single NIC driver used for 3 NICs in a system will likely allocate 3 times the MmCm vs. a single NIC.)

 

Storage Adapters for your SAN...SCSI Controller, Fibre Channel Controller, etc.

Video Adapters

Network Adapters

 

These drivers need to pre-allocate memory to service transfers and as mentioned need to do this right at boot time since memory can become fragmented later.  In other words, this allocation pattern attempts to "bake-in" a known set of scratch space if you will for the adapter to function properly given the features you wish to use.

 

 

So what's the problem?

 

Not everyone plays nice and the machine can run out of memory!  This is downright painful on x86 Exchange servers and their limited NPP ceiling.

 

Recall from an earlier post and your likely experience, that running out of such memory can cause hangs or general system instability as the machine runs out of NPP.  The scary part is that this usually happens unexpectedly under incrementally higher than normal user demand. i.e when you most need the server it fails.  Not coincidently most often we see older  x86 machines, especially /3GB configured Exchange servers, have this memory consumption problem. I often wonder just how many machines are out there just waiting for a liiiiitle more user load and are hovering at the brink of NPP exhaustion…

 

<Ok, queue the sad music for the heart breaking x86 Exchange server example>

 

This server has the standard 128MB maximum for NPP because of /3GB

 

0: kd> !vm

 

*** Virtual Memory Usage ***

Physical Memory:      851418 (   3405672 Kb)

Page File: \??\C:\pagefile.sys

  Current:   4193280 Kb  Free Space:   4038364 Kb

  Minimum:   4193280 Kb  Maximum:      4193280 Kb

Available Pages:      639032 (   2556128 Kb)

ResAvail Pages:       796753 (   3187012 Kb)

Locked IO Pages:         272 (      1088 Kb)

Free System PTEs:      27083 (    108332 Kb)

Free NP PTEs:           5778 (     23112 Kb)

Free Special NP:           0 (         0 Kb)

Modified Pages:         4138 (     16552 Kb)

Modified PF Pages:      4116 (     16464 Kb)

NonPagedPool Usage:    25151 (    100604 Kb)

NonPagedPool Max:      32026 (    128104 Kb)

PagedPool 0 Usage:     11497 (     45988 Kb)

PagedPool 1 Usage:      1645 (      6580 Kb)

PagedPool 2 Usage:      1667 (      6668 Kb)

PagedPool 3 Usage:      1662 (      6648 Kb)

PagedPool 4 Usage:      1679 (      6716 Kb)

PagedPool Usage:       18150 (     72600 Kb)

PagedPool Maximum:     63488 (    253952 Kb)

Session Commit:          586 (      2344 Kb)

Shared Commit:          4720 (     18880 Kb)

Special Pool:              0 (         0 Kb)

Shared Process:         7517 (     30068 Kb)

PagedPool Commit:      18214 (     72856 Kb)

Driver Commit:          8779 (     35116 Kb)

Committed pages:      237424 (    949696 Kb)

Commit limit:        1867524 (   7470096 Kb)

 

Given 100Megs used, 28MB of that is MmCm

 

0: kd> !poolused /t 10 2

   Sorting by  NonPaged Pool Consumed

 

  Pool Used:

            NonPaged            Paged

Tag    Allocs     Used    Allocs     Used

MmCm     3210 28779488         0        0        Calls made to MmAllocateContiguousMemory , Binary: nt!mm

NDpp     1013  4076960         0        0        packet pool , Binary: ndis.sys

MPIO   121985  3457752         0        0        UNKNOWN pooltag 'MPIO', please update pooltag.txt

elxs        6  3299344         0        0        UNKNOWN pooltag 'elxs', please update pooltag.txt

BCM0       24  3057232         0        0        UNKNOWN pooltag 'BCM0', please update pooltag.txt

LSwi        1  2654208         0        0        initial work context

 RaME        3  2572288         0        0        RiAllocateMiniportDeviceExtension , Binary: storport.sys

ElxA        5  2360208         3      160        UNKNOWN pooltag 'ElxA', please update pooltag.txt

Io        223  2319712       123     5136        general IO allocations , Binary: nt!io

TPLA      512  2097152         0        0        UNKNOWN pooltag 'TPLA', please update pooltag.txt

TCPt       40  1662448         0        0        TCP/IP network protocol , Binary: TCP

Mdl      7225  1435160         0        0        Io, Mdls

 Pool        3  1134592         0        0        Pool tables, etc.

 Devi      602  1124400         0        0        Device objects

 RcpI        1  1048576         0        0        Internal memory mgr initial heap block , Binary: sacdrv.sys

brcm       38  1032528         0        0        UNKNOWN pooltag 'brcm', please update pooltag.txt

TOTAL      219941 75555240     70157 56589336

 

That's not a lot right?  Well, it may be more than you like.  Depending on load cycles on this server it could be too much, you'd have to know a bit more history of the maximum amount of NPP demanded at peak usage, etc. to know how close to the edge this server may be.  Let's go with "what's using the ~28MB?" as the question to answer...

 

 

Who uses this memory?

 

Here's where it gets interesting.  A useful heuristic here is to group the sizes of the allocations in Excel.

You get this data from a !poolfind MmCm

 

0: kd> !poolfind MmCm

 

Scanning large pool allocation table for Tag: MmCm (f9a67000 : f9b67000)

 

*f7fc7000 :free large page allocation, Tag was MmCm, size was 0x2000 bytes

*f8298000 :free large page allocation, Tag was MmCm, size was 0x2000 bytes

*faeae000 :large page allocation, Tag  is MmCm, size  is 0x4000 bytes

*faeb2000 :large page allocation, Tag  is MmCm, size  is 0xd000 bytes

*fae84000 :large page allocation, Tag  is MmCm, size  is 0x11000 bytes

*fae95000 :large page allocation, Tag  is MmCm, size  is 0x11000 bytes

*fae23000 :large page allocation, Tag  is MmCm, size  is 0x1000 bytes

*fae24000 :large page allocation, Tag  is MmCm, size  is 0x3000 bytes

*fae2d000 :large page allocation, Tag  is MmCm, size  is 0x1000 bytes

.

.

.

fdf2e000 size:  f18 previous size:    0  (Allocated) MmCm

fdf2f000 size:  f18 previous size:    0  (Allocated) MmCm

fdf30000 size:  f18 previous size:    0  (Allocated) MmCm

fdf31000 size:  f18 previous size:    0  (Allocated) MmCm

fdf32000 size:  f18 previous size:    0  (Allocated) MmCm

fdf33000 size:  f18 previous size:    0  (Allocated) MmCm

fdf34000 size:  f18 previous size:    0  (Allocated) MmCm

fdf35000 size:  f18 previous size:    0  (Allocated) MmCm

fdf36000 size:  f18 previous size:    0  (Allocated) MmCm

fdf37000 size:  f18 previous size:    0  (Allocated) MmCm

fdf38000 size:  f18 previous size:    0  (Allocated) MmCm

fdf39000 size:  f18 previous size:    0  (Allocated) MmCm

fdf3a000 size:  f18 previous size:    0  (Allocated) MmCm

fdf3b000 size:  f18 previous size:    0  (Allocated) MmCm

fdf3c000 size:  f18 previous size:    0  (Allocated) MmCm

 

Searching NonPaged pool (fe000000 : ffb7e000) for Tag: MmCm

 

 

When I have to do this, usually some variation of following works.

 

.logopen

!poolfind

.logclose

 

Open the txt file up in notepad, clean it up a bit (remove the frees and extra lines)

Open the txt in Excel and use the old standby Text Import Wizard via file open

I usually also save the large pool allocs to one txt file and the regular to another and then cut paste to combine them in Excel.

 

There's probably an easier way to do this import, but this works well enough.

image

You can really go crazy getting the data all pretty but don't bother as even something as ugly as this is useful because a quick sort on Column B shows the allocation size pattern.  Note the Text Import Wizard allows you to skip columns here, so you can import only the address column and the size column.  i.e. faeae000’s column and the 0x4000 column in the above.

 

image

Via Excel I just generated another table on the fly to keep track of our totals here.  Note, I’m just counting the number of each size of alloc by visual inspection.  i.e. There are 9 allocations of 0x1000 size in my list.

 

Allocation Sizes in Bytes (h)  

Number of Allocations

Total MmCm in Bytes

0xf18 bytes 

2014

7,782,096

0x1000 bytes

9

36,864

0x2000 bytes

1017

8,331,264

0xf000 bytes

23

1,413,120

0x11000 bytes

92

6,406,144

0x156000 bytes

2

2,801,664

 

Ahhh…..In total we've got 26 of 28 Megs sampled here, well representing the MmCm usage minus some insignificant allocs and rounding, great.

 

(I love the new Win7 calc.exe)

clip_image003

So how does this help me again?  Here's the crucial step.  Look at samples of these allocations (which is why I kept the address of the alloc in the Excel import).  There are very often telltale tags or strings inside the alloc that give you a strong indication if not proof positive who allocated this memory.

 

Here's a common allocation example:

 

Dump several samples of the f18 sized allocations…

 

0: kd> dc fdf18000 fdf18000+f18

fdf18000  0be30000 6d436d4d fd1df008 00000000  ....MmCm........<--the pooltag, okay…knew that...

fdf18010  1f2e3d4c 00000000 00000000 00000000  L=..............<-------hum, this 1f2e3d4c seems to be in every one of these puppies...

fdf18020  00000000 00000000 00000000 00000000  ................

fdf18030  00000000 00000000 00000000 00000000  ................

fdf18040  00000000 00000000 00000000 00000000  ................

 

Remember we learned that these are predominantly allocated on boot?  Turns out that Storport actually allocates 1000 of these(per adapter) based on the extension size of the adapter and guess what it does, it puts a handy little tag in here defined as 1f2e3d4c.    Here's the proof.

 

storport!RaInitializeRaidResources+0x47:

f5d6aeb3 8bd7            mov     edx,edi

f5d6aeb5 8bce            mov     ecx,esi

f5d6aeb7 ff1510d1d6f5    call    dword ptr [storport!_imp_InterlockedPushEntrySList (f5d6d110)]

f5d6aebd ff4508          inc     dword ptr [ebp+8]

f5d6aec0 8b4508          mov     eax,dword ptr [ebp+8]

f5d6aec3 3b450c          cmp     eax,dword ptr [ebp+0Ch]

f5d6aec6 c747084c3d2e1f  mov     dword ptr [edi+8],1F2E3D4Ch<------------------ah ha!

f5d6aecd 72be            jb      storport!RaInitializeRaidResources+0x21 (f5d6ae8d)

 

 

Cool! So now I've accounted for 7.8 Megs of the 28, and I know at this point that since I've got ~2000 allocs I've got at least two storage adapters on this machine.

Actually, If you are suspicious about the roughly 1000 0x2000 sized allocations and you think it's another adapter with a larger extension size, you'd be right!

 

0: kd> dc fcc20000

fcc20000  fcc22000 00000000 1f2e3d4c 00000000  . ......L=......<-------again!

fcc20010  f7b7e000 00000000 00000000 00000000  ................

fcc20020  00000000 00000000 00000000 00000000  ................

fcc20030  00000000 00000000 00000000 00000000  ................

fcc20040  00000000 00000000 00000000 00000000  ................

fcc20050  00000000 00000000 00000000 00000000  ................

fcc20060  00000000 00000000 00000000 00000000  ................

fcc20070  00000000 00000000 00000000 00000000  ................

 

 

Now, I've accounted for likely all my storage adapters (7,782,096 + 8,331,264  =  16,113,360) which are between the three of them consuming 16 of 28 Megs!

(I say three because of the ~3000 allocations in total, with 1000 per adapter as stated.  So two adapters of size 0xf18 and one of size 0x2000)

Sure enough checking msinfo32 on this machine shows three adapters.

 

So what about the rest of the odd ball sized allocations and 12 Megs?  Especially the 92 0x11000 and the two whopping 0x156000 bytes ones?

 

First , the two whopper 0x156000 byte allocs.  Here's where you are limited usually only by your own creativity and patience...( this particular sample is courtesy of a fellow persistent debugger,  Pushkar)

 

//Dump the beginning of one…

 

0: kd> dc fdb24000

fdb24000  00036c40 fdb24080 fdc43100 08b24080  @l...@...1...@..

fdb24010  00000000 08c43100 00000000 00000001  .....1..........

fdb24020  00000000 00000000 00000000 00000000  ................

fdb24030  00000000 00000000 00000000 00000000  ................

fdb24040  00000000 00000000 00000000 00000000  ................

fdb24050  00000000 00000000 00000000 00000000  ................

fdb24060  00000000 00000000 00000000 00000000  ................

fdb24070  00000000 00000000 00000000 00000000  ................

 

On the line below with the highlight we are looking for interesting pointers and !pool'ing them to figure out who's associated with this alloc.

 

0: kd> dc

fdb24080  00000003 00000940 08b24100 00000000  ....@....A......

fdb24090  fdef0080 00000000 08b241d8 00000000  .........A......<-----This is the line

fdb240a0  00000080 08b25080 00000000 00000080  .....P..........

fdb240b0  08b35080 00000000 00000080 08b29080  .P..............

fdb240c0  00000000 00000080 08b37080 00000000  .........p......

fdb240d0  00000080 08b2d080 00000000 00000080  ................

fdb240e0  08b39080 00000000 00000080 08b31080  ................

fdb240f0  00000000 00000080 08b3b080 00000000  ................

 

//Oh, it happens to be a Device object...

 

0: kd> !pool fdef0080

Pool page fdef0080 region is Nonpaged pool

*fdec1000 : large page allocation, Tag is Dev., size is 0x33000 bytes

Owning component : Unknown (update pooltag.txt)

 

//Dump it out via devobj, adding the object offset and poolheader length…this one looks like it's going to belong to my Emulex adapters

 

0: kd> !devobj fdec1000+38

Device object (fdec1038) is for:

ElxPlus*** ERROR: Module load completed but symbols could not be loaded for elxplus.sys

\Driver\elxplus DriverObject fdef41c0

Current Irp 00000000 RefCount 1 Type 0000002a Flags 00000048

DevExt fdec10f0 DevObjExt fdef3ca0

ExtensionFlags (0000000000) 

AttachedTo (Lower) fdf68ba0 \Driver\PnpManager

Device queue is not busy.

 

On to the 0x11000 sized, first pass is to just dc out the memory like before and look for interesting tags or strings…At the end of Network card adapter driver allocations via their calls into the Ndis layer for instance, you can flag them by the END of the allocation may have an ND** string.  I don't see any of those here but it "looks like" the 0x11000 sized allocations have network related data and strings as a common factor throughout, so I'm guessing these are associated with one of the four network adapters on this machine.  Turns out if you sample more of these they may have pointers back to other interesting pool allocations that can flag the network miniport adapter, etc.

 

You may be thinking, hum…I get the storport one but these last two sound like mere correlation?  Absolutely.  However, given the lack of randomness sampling multiple allocations it's often strong enough to at least inform your hypothesis and test, quickly.

 

 

Yes, there is an easier way!

 

If this memory is allocated at driver initialization and most drivers initialize at system boot time, it stands to reason that most of this memory will be consumed on boot.  Cool!  Then it also stands to reason we can use our friend Poolmon.exe to quickly check how much is being used shortly after boot and do some quick testing by excluding some of the usual suspect adapters.  First, you could just trust me and check for ancillary storage, NIC, and enhanced video adapters and disable/remove these first as a low risk test.

 

In any case, the testing sequence is:

 

1.      Promptly record the amount of MmCm after a reboot via Poolmon.exe

(Sort by Non Paged Pool (toggle through by hitting 'P') and then descending by bytes (same, but 'B')

2.      Disable the ancillary test adapter(s)

3.      Reboot, check poolmon.exe again

4.      What's the decrease?

 

5.      Perhaps repeat testing with updated drivers or disabled features if you must have the additional hardware in place (hoping they use less MmCm).

 

I hope this post saved you time by being wary of configuring more adapters than absolutely necessary, especially /3GB x86 Exchange servers and if you have to reactively triage this a quick and dirty way to determine the primary consumers of this memory.

 

Enjoy!

 

-Tate

 

P.S. If you can't scale down the number or features of let's say one of the suspect adapters, do check with your driver vendor to see if there is a way to throttle back the consumption.  For instance, when we shipped the Scalable Networking Pack feature, we often saw higher consumption in the NIC drivers supporting this functionality, later driver releases from hardware vendors reportedly used much less.

 

 

 

 

Bonus:

 

Here's another handy debug trick to isolate who can call these two Memory manager functions and narrow down the driver population...

 

//Find the address of MmAllocateContiguousMemorySpecifyCache

0: kd> x nt!MmAllocateContiguousMemorySpecifyCache

e080efac nt!MmAllocateContiguousMemorySpecifyCache = <no type information>

 

//Use the handy !for_each_module to scan each module for the address (e080efac )

0: kd> !for_each_module ".echo ${@#ModuleName} ;s -d ${@#ModuleName} L?${@#Size} e080efac"

 

//Hits will look like this, with the matching address listed…

ATMFD

nt

e0a81044  e080efac e0806d12 e082e237 e082ddc7  .....m..7.......

hal

e0a81044  e080efac e0806d12 e082e237 e082ddc7  .....m..7.......

RDPWD

.

.

.

ati2mtag

f5575234  e080efac e083b69c e08454c9 e082f651  .........T..Q...

f5836084  e080efac e08329f9 e0839c13 e08e2d78  .....)......x-..

.

.

.

ql2300

f5d6d174  e080efac e0806d12 e08412f9 e08eaeb1  .....m..........

f5e170cc  e080efac e08d15a3 e08e8543 e0815d48  ........C...H]..

storport

f5d6d174  e080efac e0806d12 e08412f9 e08eaeb1  .....m..........

f5e170cc  e080efac e08d15a3 e08e8543 e0815d48  ........C...H]..

elxstor

f5e170cc  e080efac e08d15a3 e08e8543 e0815d48  ........C...H]..

SCSIPORT

f5e170cc  e080efac e08d15a3 e08e8543 e0815d48  ........C...H]..

 

 

//Example match in the import table for ati2mtag...

0: kd> dps f5575234

f5575234  e080efac nt!MmAllocateContiguousMemorySpecifyCache

f5575238  e083b69c nt!ZwQueryInformationProcess

f557523c  e08454c9 nt!PsGetCurrentThreadProcessId

 

 

//Repeat for the other call...

0: kd> x nt!MmAllocateContiguousMemory

e080be42 nt!MmAllocateContiguousMemory = <no type information>

 

 

 

 

Share this post :

 

 

 

 

It's Venkatesh with a WMI nugget. While troubleshooting or debugging WMI issues you may come across WMI queries wherein you don’t know which provider implemented the WMI class used in the query. You may want to know the binary and the product that implemented the provider so you can contact the vendor or upgrade the binary to see if it resolves the issue you are investigating.

To help make the investigation easier, Windows Vista and all post operating systems have the cool new 'Analytic and Debug log' for WMI tracing. To get similar tracing in Windows 2003 and earlier versions, WMI has a “Verbose Logging” option enabled through the WMI control (WmiMgmt.msc). Check out this blog for detailed steps to enable the WMI Analytic and Debug -

http://blogs.msdn.com/wmi/archive/2009/05/27/is-wmiprvse-a-real-villain.aspx

After enabling WMI logging you will see events like the ones pictured below. There may be a particular WMI query relevant to the issue you're troubleshooting requiring you to know the provider binary for the WMI class in the query. Apart from the class name (InstProvSamp) in the query you need the WMI namespace to which this class belongs to such as '\\.\root\default' as shown below.

image

Fig 1: Event ID 1 in the Windows Vista’s Event Viewer\Windows Logs\Applications and Services Logs\Microsoft\Windows\WMI-Activity\Trace log

Now we have the WMI Namespace and the class name so let’s find the provider for it.

1. Get the provider name from the Class Name and Namespace:

Every WMI class has a qualifier “provider” containing the name of the provider to which it belongs. Using wbemtest.exe connect to the namespace we mentioned above(\\.\root\default). Then get the class information for our class, InstProvSamp using the “Open Class…” button. After clicking "OK" the Object editor for InstrProvSamp displays the class information for InstProvSamp including the name of the provider it belongs to (CIM_STRING qualifier “provider” as shown below).

image

image

2. Get the class identifier (CLSID) of the provider from the provider name:

For every WMI provider WMI maintains an instance of system class __Win32Provider which contains information about the physical implementation of the provider like CLSID which contains the GUID that represents the class identifier (CLSID) of the provider COM object. Now query the repository to get the provider information for “InstProvSamp” provider as I did below.

image

image 

image

3. Get the provider binary from the provider’s CLSID:

During registration every COM object adds an entry under HKLM\SOFTWARE\Classes\CLSID. Using regedit.exe you can browse to provider’s COM registration information using the CLSID obtained from Step 2 as shown below.

image

This should help you connect the dots while debugging your WMI issue. Let me know if you have additional questions!

Share this post :

Introduction and Overview

In this article we will explore a practical use for ETW tracing, and discover what ETW (Event Tracing for Windows) tracing is available for a popular Windows user-mode component, Internet Explorer. In my previous article ETW Introduction and Overview, we covered what ETW tracing is and how it could be used.

The goal in this exercise is to learn about ETW tracing in general, how to self-discover what tracing is available in a component, and some ways you can leverage the tracing to self-troubleshoot issues. ETW logging is essentially allowing Microsoft code to speak for itself. It tells you what code ran, what that code did, and any errors produced. Also ETW logs can be used along with a more traditional toolset for troubleshooting that class of issue such as the SysInternals tools, Network Monitor, etc.

For this article, we are using the RC Build of Windows 7. The concepts and examples should be very similar and work fine for Windows Vista.

Getting Started “Exploring”

First we start off by opening Internet Explorer 8 whose home page is set to the NTDebugging Blog - http://blogs.msdn.com/ntdebugging

clip_image001

In order to find what Internet Explorer is logging with ETW, we list all of the registered ETW providers on a box, which includes all of the installed code on the machine (the component doesn’t need to be actively running).

We start off by running “logman query providers” and look through the list for relevant hits. However, the results below return more than 400+ hits on Vista, and 600+ on Win7. You will probably quickly see though that searching through this large list of providers might not always be best way to go about finding which providers Internet Explorer is logging with.

clip_image003

Using this list we can also filter or browse for our component. This may work just fine, however, one issue with this approach is that sometimes you may not know the particular naming convention for a process. Also, many software components use shared dlll’s to do a variety of the under the hood work. As such, you may not know what each of those pieces of software are, or how they are named.

Here we try running a built-in cmd line tool logman ‘query providers’ and searching for “Internet”, which didn’t turn up anything on this particular search. It looks like the naming convention for Internet Explorer might be a bit different than our first search.

clip_image005

While we could continue to browse the large list of providers, I have decided to use another useful method to find which providers IE uses, which is to filter all providers that a certain ProcessId uses. We start out by getting the ProcessId of Internet Explorer using two methods shown here, Task Manager and the command line tasklist filtering on iexplore.exe.

clip_image007

clip_image009

Now that we know Internet Explorer is running with process id 6200, we can do another query with logman to find out the ETW providers.

Internet Explorer ETW Providers

‘Logman query providers –pid 6200’ is used to list all of the user-mode ETW providers that Internet Explorer uses and the associated GUIDs (GUIDs are the Globally Unique ids that enable tracing for a component).

In looking at the screenshot below, notice that a wide variety of providers are shown, and at first glance, you may wonder how some relate to Internet Explorer. As you investigate further though, all these components enable certain features within Internet Explorer, but even with tracing enabled, may not log anything unless you specifically use the part of Internet Explorer which runs that code at run-time.

It is also important to note that Internet Explorer will also use a lot of kernel services to eventually do its work, such as NDIS. There is ETW tracing for these kernel components, but would not show up under a specific process as they are used by all processes at the kernel level. Here we see a specific component of interest that we want to follow – Microsoft-Windows-WinINet. A quick Bing search of WinINet turns up this on MSDN, which sounds relevant.

Extracted from MSDN: The Microsoft Windows Internet (WinINet) application programming interface (API) enables applications to access standard Internet protocols, such as FTP and HTTP. For ease of use, WinINet abstracts these protocols into a high-level interface.

So Microsoft-Windows-WinINet looks like a good component to enable tracing in.

clip_image011

From using SysInternals Process Explorer you can also confirm and that WININET.dll is loaded into the address space of iexplore.exe as shown below.

clip_image013

Now that we know the ETW Provider name and GUID, we will launch Computer Mgmt and turn on tracing for that component. Note that not every Provider can be decoded this way (due to architectural and security reasons), but many can, and at the very least, the log can be enabled and provided to Microsoft Support to fully decode.

Launch Computer Management by typing ‘compmgmt.msc’ into the start search box or right clicking Manage on Computer Management.

clip_image014

Enabling ETW Logging

Once in Computer Management, Navigate to and click on Event Viewer -> Applications and Services Logs. Once there, make sure View -> Show Analytic and Debug Logs is enabled as shown below, which will provide a much greater set of logs to look at. Most logs are under Event Viewer -> Applications and Services Logs -> Microsoft -> Windows, where we will spend the majority of our time. There are generally four channels that can show up under each provider. Channels are targeted at different roles. Admin and Operational channels target IT professionals and Administrators and are enabled by default, while Analytic and Debug Channels are more in depth, and not usually enabled by default.

The logs that show up here are XML Manifest based tracing called Windows Events. ETW tracing splits up the collection and decoding of traces into two separate steps for performance and security reasons. In the manifest based tracing first available with Vista, most components and events are defined in a XML manifest compiled along with the binary, and which are defined in a resource file language dll. Most user-mode resource dlls show up under C:\Windows\System32\en-US (for English US). The good part about manifest based tracing is that many logs are self-discoverable and customer decodable!

clip_image016

Next we will browse to: Event Viewer -> Applications and Services Logs -> Microsoft -> Windows -> WinINet, right click on the Analytic channel, and choose ‘Enable Log’. Remember that we had to enable this log because Analytic and Debug logs are not enabled by default. Enabling the channel will automatically enable the ETW tracing for that component and logs will start being decoded similar to the Event Log.

clip_image017

Capture and view WinINet logging for our scenario

After enabling the log, I am going to close Internet Explorer which is running code that is now logging using ETW and navigate to the website http://blogs.msdn.com/ntdebugging.

clip_image001[1]

Now we can look at the result of our work and tracing, which decodes the tracing for us. Below, we are on the WININET_REQUEST_HEADER category, showing the GET request. Note that the tracing here is showing us proxy interaction, DNS requests and responses, cookies, TCP requests/responses, as well as HTTP requests and responses.

Now, one might be able to get the same information using a network capturing program such as NetMon or WireShark, but the nice thing about the ETW tracing here is that is

a) In box without having to install additional tools

b) The aggregate view of what the WinINet component is seen from its point of view (composed of DNS, TCP, Proxy, Cookies, etc).

This is a good example of using this tracing in combination with other troubleshooting tools (if required), which may or may not be useful depending on the scenario.

clip_image019

Dumping the trace log out using other methods:

You can use the Event Viewer GUI, or if you prefer to view the log in a text editor (or spreadsheet). You can dump the log to text or CSV format using a couple of different methods.

The first method is using the Windows Event Log GUI to export or save the event log

clip_image020

Or you can output the entire log to screen in human readable text format:

clip_image022

We can also output to the screen filtering for any messages with error level using an XPath Query. We can start right away if you know the XPath Syntax for your query, or you can use the GUI to help build the query for you. Once you choose ‘Filter Current Log’ you can filter by Event Level, or a variety of other criteria.

clip_image024

The XML view will show you the raw XPATH syntax needed to use wevtutil to query for only Error events.

clip_image025

Now we actually run the XPath query using ‘wevtutil query-events Microsoft-Windows-WinINet/Analytic /f:Text "/q:*[System[(Level=2)]]"’

clip_image027

If you prefer a more complex format containing processors, threads, etc, then you can output to a csv log file. This is useful for sorting, filtering, comparing, and doing more complex analysis on the log file entries.

clip_image029

Here we use ‘tracerpt c:\windows\system32\winevet\logs\Microsoft-Windows-WinINet%4Analytic.etl –of CSV –o c:\temp\Microsoft-Windows-WinINet%4Analytic.csv’ to output to a CSV.

clip_image031

Finally, you can use PowerShell v2 to dump out the log. The PowerShell scenario is interesting, because you can write powerful scripts around the event logs.

clip_image033

Look at ‘get-help get-winevent –examples” for many more really cool examples about how to list and filter these logs:

clip_image035

Behind the Scenes

We also need to explore behind the scenes of what is happening when you enable the analytic channel of the WinInet provider. Part of the definition of that channel is to enable an ETW Trace Session with certain pre-defined Keywords (Flags) and Levels. Keywords usually specify functional sub-components (HTTP, COOKIES, CONNECTION, etc) while Levels control the level of detail (Error, Informational, Verbose) events are logged. This is useful to fine-tune logging, but the channels make it easy not to have to mess with these unless required.

Under Performance -> Data Collector Sets -> Event Trace Sessions we see a new session created for us and auto-named “EventLog-Microsoft-Windows-WinINet-Analytic”. This ETW logger was automatically created for us when we enabled the WinInet Analytic channel. You can also manually create the logger, which we may cover in another blog post. If you open up the properties for this session, you'll notice our provider listed as well as all the Keywords (Any) and Level specified. The screenshot also show which Keywords/Flags or categories a provider supports.

clip_image037

Once you know a provider name, you can also query for it on the command line, which is useful to get all the Keywords(Flags) and Levels a provider supports, along with the processes that are using that provider.

clip_image039

For the curious, we wanted to finish this post and show you how to find what tracing is available, and its format. Here we again use the wevtutil to get full publishing event and message information.

‘wevtutil get-publisher Microsoft-Windows-WinInet /getevents /getmessage’

clip_image041

This can be combined with another version of the same utility searching for a certain string or log – e.g. “wevtutil gp Microsoft-Windows-WinINet /ge /gm | findstr /i httpopen”

clip_image043

To wrap-up, here is the overall architecture of ETW covered in a previous blog post, along with the specifics of the scenario we covered. In our scenario the MMC Event Viewer console has two roles – enabling the tracing and decoding the real-time delivery of events.

clip_image045

Where to go from here?

This was meant as a practical introduction to using and self-discovering how to use the ETW tracing that is available right now in newer Windows releases (Vista/2008 and above). There are many, many more components that use tracing – 400+ in Vista and 600+ in Win7, which you can start exploring.

For these specific examples, it is worth noting that as far as the networking aspect is concerned, similar information might be obtained from using a packet capture utility such as NetMon. What is interesting about NetMon in Win7 is that it actually uses ETW under the hood to do most of its tracing. Here is an interesting post about Network Tracing in Windows 7 from the Netmon team.

The use of ETW under the hood of many tools is a general trend in Windows and Microsoft software. You might notice different diagnostics and tracing tools in Windows using ETW under the hood. These tools can add value to the raw ETW tracing such as further parsing, filtering, and rich views into the data. You can mix and match these tools along with the tracing for powerful views into your Windows box or server.

It’s also important to note that ETW tracing isn’t just for Microsoft software. Since any software running on Windows uses many services provided by the OS, there is a variety of ETW logging that is available for all software. For example, try running ‘logman query providers –pid PROCESSID_OF_FIREFOX’ on FireFox! You should find a great deal of logging similar to Internet Explorer.

In future posts, we hope to explore other ways of using ETW tracing, and touch on some of the kernel level tracing available.

 

 

Share this post :

Pushlocks were a new locking primitive first introduced in Windows Server 2003 and are primarily used in place of spinlocks to protect key kernel data structures. Unfortunately, Pushlocks are not documented in the WDK, and are not available for public use; however, a few internal drivers do use them, so you might see them while debugging a machine. Also, I was digging around on MSDN for Pushlocks, and I found that the Filter Manager does expose certain APIs to use Pushlocks, so you are in luck if you are developing a filter driver!

 

Gate objects

 

Pushlocks are built on primitive gate objects, which are defined by KGATE structures. The gate object is a highly optimized version of the basic event object. By not having both the notification and synchronization versions of the basic event object, and by being the exclusive object that a thread can wait on, the code for acquiring and releasing a gate is heavily optimized. Gates even have their own dispatcher lock instead of acquiring the entire dispatcher database.

 

Unlike spinlocks, which must be acquired exclusively for all operations on a data structure, pushlocks can be shared by multiple “readers” and need only be acquired exclusively when a thread needs to modify the data structure.

 

Operation

 

When a thread acquires a normal pushlock, the pushlock code marks the pushlock as owned, if it is not owned already. If someone owns the pushlock exclusively, or the thread wants to own the pushlock exclusively while someone else has it in shared mode, the thread allocates a wait block on its stack, initializes a gate object in the wait block, and then add the wait block to the wait list associated with the pushlock. When the thread holding the pushlock finally releases it, it wakes the next waiter by signaling the event in the waiters wait block.

 

When debugging a machine, there is no easy way to figure out the current owner of the pushlock, apart from doing code review. By looking at the waitlist, you can always figure out the threads trying to get access to it, but since the gate does not keep track of the owner like a regular mutex, it is much harder to find the current owner.

For more details on the operation and structure of a pushlock, please review the Pushlocks section in Windows Internals book, under the System Mechanisms Chapter.

 

Advantages of using a PushLock

 

If a pushlock is held by one or more readers, threads that want to modify the data structure are queued for exclusive access. This queuing mechanism provides some of the same benefits of queued spinlocks—for example, FIFO ordering, elimination of race conditions, and reduced cache thrashing when more than one thread is waiting for the pushlock.

 

Another advantage of using a pushlock is the size. A regular resource object is 56 bytes, however a pushlock is the size of a pointer. Apart from a small memory footprint, this helps especially in the non-contended case, since pushlocks do not require lengthy operations to perform acquisition or release. Because the pushlock fits in one “machine word”, the CPU can use atomic operations to compare and exchange the old lock with the new one.

 

Pushlocks are also self-optimizing in the sense that the list of threads waiting on a pushlock will be periodically rearranged to provide fairer behavior when the pushlock is released.

 

Cache Aware Pushlocks

 

A cache-aware pushlock adds to the basic pushlock by allocating a normal pushlock for each processor in the system and associating it with the cache-aware pushlock. When a thread wants to acquire a cache-aware pushlock for shared access, it simply acquires the pushlock on that processor; however if it needs to acquire the lock for exclusive access, it has to acquire the pushlocks for each processor in exclusive mode.

 

What does a Pushlock look like?

 

3: kd> !thread 8c9764c0

THREAD 8c9764c0  Cid 2410.1be4  Teb: 7ff9f000 Win32Thread: e5c6f298 GATEWAIT

Stack Init b386b000 Current b386a978 Base b386b000 Limit b3867000 Call 0

ChildEBP RetAddr  Args to Child             

b386a990 80833485 8c9764c0 8c9764e4 00000003 nt!KiSwapContext+0x26 (FPO: [Uses EBP] [0,0,4])

b386a9bc 8082ffe0 b06a6a03 e11e0b18 b386aa54 nt!KiSwapThread+0x2e5 (FPO: [Non-Fpo]) (CONV: fastcall)

b386a9e4 8087d722 00000000 e11e0b08 e11e0b18 nt!KeWaitForGate+0x152 (FPO: [Non-Fpo]) (CONV: fastcall)

e11e0b18 00000000 0c050204 7346744e e37b2808 nt!ExfAcquirePushLockExclusive+0x112 (FPO: [Non-Fpo]) (CONV: fastcall)

 

Above is a snipped output from a dump that I was recently looking at. From the stack, you can see the ExfAcquirePushLockExclusive call trying to acquire the pushlock, which then calls KEWaitForGate. In this case, the lock was already acquired, so this thread allocated a wait block on its stack, and then added itself to the waitlist.

Also, the stack is broken due to the fastcall, therefore the debugger cannot display it entirely. So we can manually reconstruct the stack by passing parameters to the kb command.

k[b|p|P|v] = BasePtr StackPtr InstructionPtr

 

To get the arguments, we first dump the stack manually using the dps command with the current esp.

3: kd> dps b386a978 l50

b386a978  b386ad40

b386a97c  00000000

b386a980  8088dafe nt!KiSwapContext+0x26

b386a984  b386a9bc

b386a988  b386aa00

b386a98c  f773f120

b386a990  8c9764c0

b386a994  80833485 nt!KiSwapThread+0x2e5

b386a998  8c9764c0

b386a99c  8c9764e4

b386a9a0  00000003

b386a9a4  8c9764c0

b386a9a8  00000003

b386a9ac  00000002

b386a9b0  00000002

b386a9b4  f773fa7c

b386a9b8  008c0030

b386a9bc  b386a9e4

b386a9c0  8082ffe0 nt!KeWaitForGate+0x152

b386a9c4  b06a6a03

b386a9c8  e11e0b18

b386a9cc  b386aa54

b386a9d0  00000000

b386a9d4  8c976504

b386a9d8  00000000

b386a9dc  0000001c

b386a9e0  00000000

b386a9e4  b386aa40

b386a9e8  8087d722 nt!ExfAcquirePushLockExclusive+0x112

b386a9ec  00000000

b386a9f0  e11e0b08

b386a9f4  e11e0b18

b386a9f8  b386aa40

b386a9fc  8096e9a9 nt!SeOpenObjectAuditAlarm+0x1cf

b386aa00  00040007

b386aa04  00000000

b386aa08  8c976568

b386aa0c  8c976568

b386aa10  b06a6a00

b386aa14  b4ee0a00

b386aa18  b127cc10

b386aa1c  00000000

b386aa20  00000001

b386aa24  80a60456 hal!KfLowerIrql+0x62

b386aa28  b386ac04

b386aa2c  8d117800

b386aa30  00000000

b386aa34  00000000

b386aa38  b386aa20

b386aa3c  01943080

b386aa40  b386aa64

b386aa44  808b7a14 nt!CmpCheckRecursionAndRecordThreadInfo+0x2a

 

From the output above, we can see the stack. To reconstruct the stack, we can get the ebp, esp, and eip from the stack for the ExfAcquirePushLockExclusive frame, and pass it to the kb command. Voila!

 

3: kd> kb = b386aa40 b386a9e4 8087d722

ChildEBP RetAddr  Args to Child             

b386aa40 808b7a14 b386ac04 e11e0b18 e11e0b18 nt!ExfAcquirePushLockExclusive+0x112

b386aa64 808b7b09 e11e0b18 b386aa80 e101bf40 nt!CmpCheckRecursionAndRecordThreadInfo+0x2a

b386aaa4 808da118 0000001c b386ab58 00000001 nt!CmpCallCallBacks+0x6b

b386ab90 80937942 e101bf40 00000000 89f13648 nt!CmpParseKey+0xd4

b386ac10 80933a76 00000000 b386ac50 00000040 nt!ObpLookupObjectName+0x5b0

b386ac64 808bb471 00000000 8e930480 00000d01 nt!ObOpenObjectByName+0xea

b386ad50 808897bc 0243eba0 00020019 0243eb68 nt!NtOpenKey+0x1ad

b386ad50 7c8285ec 0243eba0 00020019 0243eb68 nt!KiFastCallEntry+0xfc

WARNING: Frame IP not in any known module. Following frames may be wrong.

0243eba4 00000000 00000000 00000000 00000000 0x7c8285ec

 

 

Share this post :

This isn’t our typical debugging type article however I found the information very useful for the developer community.

Windows 7 is slated to launch in the next few weeks. If your applications aren’t quite ready for Windows 7 and having you’re having issues we may be able to help you out!

Last Monday we launched a new pilot program in our Advisory Services space.  Advisory Services is a consultative support option that provides support beyond standard break-fix issues.  The new program involves remote, phone-based support for issues such as product migration, code review or new program development.  The service is intended for Developers and IT Professionals for shorter engagements that don’t require traditional onsite consulting or sustained account management services available via other Microsoft support options.

For the Application Compatibility engagements, we’ll start off with some basic scoping questions such as whether the application is 16-, 32-, or 64-bit.  Is it a client-server application?  What compatibility issues are you experiencing?  Slow Performance?  Hang or Crash?  Installation problems?  The support engineers will be using tools such as the Application Compatibility Toolkit, the the Standard User Analyzer Wizard, and the Setup Analysis Tool.

The KB Article referenced below has more details about the program and how to engage Microsoft.  So, if you’re working on a pesky Windows Vista or Windows 7 Application Compatibility issue, give us a call – we can help!

 

Additional Resources:

Cheers,

Ronsto

 

Share this post :

Introduction:

 

Event Tracing for Windows (ETW) is a system and software diagnostic, troubleshooting and performance monitoring component of Windows that has been around since Windows 2000. However, it wasn't until Windows Vista that major components of the OS were updated to heavily use ETW tracing; making it much more practical and useful.

ETW is useful from a variety of scenarios, including:

 -User & Admin: Being in control of your system and knowing what is going on.
 -User & Admin: Troubleshooting performance, hardware and OS issues.
 -Enthusiast: Learning further about the OS and the low level guts of the OS.
 -Software Developer/ISV/OEM: Investigating issues with your software's interaction with Microsoft OS & technologies
 -Hardware Developer/IHV/OEM: Investigating issues with hardware interaction with the OS, including kernel, driver subsystems, up to the user stack.

ETW is a set of technologies and tools that can absolutely complement existing tools while providing a look into the guts of the OS at a very low level.

A great article from the April 2007 edition of MSDN Magazine - http://msdn.microsoft.com/en-us/magazine/cc163437.aspx, covers ETW in great depth, and is recommended reading.

Here is graphical overview of the ETW infrastructure that covers how provider(s) log to high performance in memory buffers, which can be kept in memory in a circular buffer, or written to disk in a sequential or circular fashion.

 

image

Some uses of ETW:

 

Today on Windows Vista, Windows 7, and sparingly on earlier OSes; ETW is used by Microsoft Development and Microsoft Support, as well as others; to help troubleshoot issues, find root-cause of bugs, analyze performance; and a large variety of other tasks.

As it stands today, ETW has two major thrusts:

1. Exposing time-based event data similar in practicality to plain old text based logged (without the normal performance overhead). Troubleshooting logs have long been used across the industry by software developers and IT professionals to troubleshoot issues. Here are some examples of how ETW logging is used.

First off, shown below is Windows Event Viewer enabling viewing of analytic and debug logs (which show you many more logs than the default view):

image

A variety of logs can be viewed graphically using the Event Viewer or via the built-in command line OS utility – wevtutil. There are a quite a variety of logs that are enabled by default and are available via the Event Viewer or in a kernel memory dump.

Since ETW is also a foundational component, you might find it appearing in other troubleshooting utilities. For example, Office Communications Server 2007 uses ETW for troubleshooting.

ETW is used to monitor for network events from the kernel which are then added to Process Monitor, a utility from Microsoft Sysinternals. You should also see tools such netsh and NetMon being able to capture and decode ETW traces.

 

image

There are many other methods and utilities to view the ETW tracing providers available, which may be found in the OS, via the Device Driver Kit (DDK), enabled with Microsoft products (such as Exchange or IIS), or used by tools, such as Office Communication System Logging.

The second major thrust of ETW is exposing performance data.

2. Performance data is exposed from the lowest levels of the kernel, driver subsystems, and every major user-mode component of Windows (such as Windows Media Player).

Performance data can be used in a hybrid fashion both for troubleshooting and performance reasons. Take for example a built in OS tool that uses ETW tracing to provide rich data - Resource Monitor.

Resource Monitor is available from the Windows Task Manager - Performance Tab. The cool thing about Resource Monitor is that it internally uses kernel ETW tracing to gather its data, providing rich data from the kernel itself; and thus providing a practical exposure of ETW technology for Administrators to troubleshoot their systems. This goes beyond the data that Task Manager provides, which often tends to be based on performance counters which you cannot drill down into further.

image

image

Resource Monitor is useful from both a performance standpoint, as well as troubleshooting. For example, you can view top cpu users, top disk users (why your disk is thrashing), or search for a locked file handle that is open.

For a deeper drill-down into performance data XPerf can be used to capture ETW traces and make them available for decoding offline (on another box).

Here, Xperf (available with the Windows DDK and as a standalone download) is shown looking at Internet Explorer's file access time.

image

There have been several posts on the NTDebugging Blog regarding XPerf, other blog posts, as well as the Windows Performance Analysis Dev Center.

Where to go from here:

 

In Windows 7 and Windows Server 2008 R2, ETW tracing has even been further improved with many more providers including 600+ inbox providers’ registered on the box. This number increases when add-on features or products are installed (such as Hyper-V).

The tracing provided by ETW, whether it is used directly in the OS, or via add-on tools, provides deep insight into Windows and how it works. We will explore ETW in depth in future blog posts.

 

 

Share this post :

Our Global Escalation Services team at Microsoft hosted the first annual Global Engineering Conference on February 25th & 26th in 2009. This year’s theme was “Building the Engineering Community” with a focus on increased knowledge sharing, collaboration, and best practices throughout the engineering community. Several vendors attended the conference including our good friends at Citrix. In this video Citrix Escalation engineers Nicholas Vasile, Dmitry Vostokov, and Kapil Ramlal share tools they created to take advantage of the ETW tracing infrastructure in Windows, debugging scripts, and best practices for engaging issues that require multi-vendor support.

Video Link 

Video Table of Contents-

Who is Citrix?

Pattern-Drive Memory Dump Analysis (Presenter: Dmitry Vostokov)

Citrix & ETW: A case study with CDFControl (Presenter: Kapil Ramlal)

Working with Citrix Technical Support (Presenter:Nicholas Vasile)

Citrix’s Public Symbol Server

 

Citrix Engineers:

image

image

image


Feel free to send feedback!

-Ronsto

My name is Nischay Anikar from the Escalation Engineer team in Global Escalation Services. In today’s post I’ll present a weird problem I worked through with a client. When we started to work on the problem, we found the following:

  • Ping to the box worked.
  • Keyboard was responding.
  • Shares on the system were accessible remotely.
  • Could not Remote Desktop into the box.
  • Existing sessions were responsive to some extent – no new processes were getting created.
  • No new processes/application could be launched. Sometimes application would get launched but after waiting for a long time.
  • Remote computer management would work, but not all snap-ins would work. (Event logs showed up, but disk management did not respond).

At this time, perfmon was collected and nothing in it indicated any kind of resource contention. This was certainly not the case of any process/thread pegging the CPU. The above observations told us this was not a hard hang, as the system was responsive at DPC level. Rather system was even responding to certain extent at passive level – remember, shares were accessible. SMB requests are processed basically by the worker threads created by SRV.SYS.

Remote management snap-ins and remote registry responsiveness showed that RPC was working fine. However some snap-ins like disk management were behaving inconsistently.

This is the stage at which we had the Kernel dump of the system and we started our normal analysis trying to find the root of the problem.

From the dump there were no blocked threads on locks (ERESOURCE, !locks), no memory pressure (perfmon confirmed it too - !vm 1), no CPUs stuck in spinlocks, no DPCs pending(!dpcs), no ready threads pending to execute(!ready), and no alarming LPC wait chain among threads leading to system hang. These are some of the common causes that could lead to system hang. None of these were seen in the dump.

Then I said, enough of running behind the debugger commands to look for known problems, they didn’t yield me anything useful up front. When the dump was given we were told that they attempted to launch notepad from explorer (Start->Run->Notepad) which never launched. If we start chasing from this point we are likely going to hit the root of the problem or at least get some leads. With this in mind, when we dumped out the explorer threads we saw one of the threads that was indeed trying to launch notepad (there was one more in the same state but trying to launch some other application).

THREAD 892ef4a0  Cid 0d2c.0ea8  Teb: 7ffd8000 Win32Thread: e108e6c0 WAIT: (Unknown) KernelMode Non-Alertable

    f573bc2c  NotificationEvent

    892ef518  NotificationTimer

Not impersonating

DeviceMap                 e12bf190

Owning Process            892027f0       Image:         explorer.exe

Wait Start TickCount      40848          Ticks: 8313 (0:00:02:09.890)

Context Switch Count      284                 LargeStack

UserTime                  00:00:00.000

KernelTime                00:00:00.078

Win32 Start Address ntdll!RtlpWorkerThread (0x7c839f2b)

Start Address kernel32!BaseThreadStartThunk (0x77e617ec)

Stack Init f573c000 Current f573bb8c Base f573c000 Limit f5736000 Call 0

Priority 14 BasePriority 8 PriorityDecrement 0

ChildEBP RetAddr  Args to Child

f573bba4 8082ffd7 892ef4a0 892ef548 00000100 nt!KiSwapContext+0x25 (FPO: [Uses EBP] [0,0,4])

f573bbbc 808287d4 895c6548 80a560c6 00000000 nt!KiSwapThread+0x83 (FPO: [Non-Fpo])

f573bc00 80810135 f573bc2c 00000000 00000000 nt!KeWaitForSingleObject+0x2e0 (FPO: [Non-Fpo])

f573bc48 80842608 005c6548 00000000 00000000 nt!CcWaitForUninitializeCacheMap+0xa5 (FPO: [Non-Fpo])

f573bcd0 8091f8e7 f573bd20 000f001f 00000000 nt!MmCreateSection+0x1fc (FPO: [Non-Fpo])

f573bd40 80883938 0190d51c 000f001f 00000000 nt!NtCreateSection+0x12f (FPO: [Non-Fpo])

f573bd40 7c82860c 0190d51c 000f001f 00000000 nt!KiFastCallEntry+0xf8 (FPO: [0,0] TrapFrame @ f573bd64)

0190d174 7c826ed9 77e6cc9a 0190d51c 000f001f ntdll!KiFastSystemCallRet (FPO: [0,0,0])

0190d178 77e6cc9a 0190d51c 000f001f 00000000 ntdll!NtCreateSection+0xc (FPO: [7,0,0])

0190d994 77e424b0 00000000 001394f4 0013725c kernel32!CreateProcessInternalW+0x99c (FPO: [Non-Fpo])

0190d9cc 7c916750 001394f4 0013725c 00000000 kernel32!CreateProcessW+0x2c (FPO: [Non-Fpo])

0190e450 7c916b45 00030064 00000000 00139904 SHELL32!_SHCreateProcess+0x387 (FPO: [Non-Fpo])

0190e4a4 7c91617b 00136008 0190e4c4 7c915a76 SHELL32!CShellExecute::_DoExecCommand+0xb4 (FPO: [Non-Fpo])

0190e4b0 7c915a76 00000001 00000009 00136008 SHELL32!CShellExecute::_TryInvokeApplication+0x49 (FPO: [Non-Fpo])

0190e4c4 7c91599f 00000000 00000009 0190e500 SHELL32!CShellExecute::ExecuteNormal+0xb1 (FPO: [Non-Fpo])

0190e4d8 7c915933 0190e500 00000000 00000009 SHELL32!ShellExecuteNormal+0x30 (FPO: [Non-Fpo])

0190e4f4 7c9a3416 0190e500 0000003c 04000b00 SHELL32!ShellExecuteExW+0x8d (FPO: [Non-Fpo])

0190e954 7c9e3f92 00030064 0190e988 0190f828 SHELL32!ShellExecCmdLine+0x143 (FPO: [Non-Fpo])

0190ee20 7c9e4517 0190eea8 7c9e43f6 0190ee5c SHELL32!CRunDlg::OKPushed+0x179 (FPO: [Non-Fpo])

0190ee30 7739b6e3 000400b4 00000111 00000001 SHELL32!RunDlgProc+0x121 (FPO: [Non-Fpo])

0190ee5c 77395f82 7c9e43f6 000400b4 00000111 USER32!InternalCallWinProc+0x28

0190eed8 77395e22 00095fb4 7c9e43f6 000400b4 USER32!UserCallDlgProcCheckWow+0x147 (FPO: [Non-Fpo])

0190ef20 77395ffa 00000000 00000111 00000001 USER32!DefDlgProcWorker+0xa8 (FPO: [Non-Fpo])

0190ef3c 7739b6e3 000400b4 00000111 00000001 USER32!DefDlgProcW+0x22 (FPO: [Non-Fpo])

0190ef68 7739b874 77395fd8 000400b4 00000111 USER32!InternalCallWinProc+0x28

0190efe0 7739bfce 00095fb4 77395fd8 000400b4 USER32!UserCallWinProcCheckWow+0x151 (FPO: [Non-Fpo])

0190f010 7739bf74 77395fd8 000400b4 00000111 USER32!CallWindowProcAorW+0x98 (FPO: [Non-Fpo])

0190f030 77431848 77395fd8 000400b4 00000111 USER32!CallWindowProcW+0x1b (FPO: [Non-Fpo])

0190f04c 77431b9b 000400b4 00000111 00000001 comctl32!CallOriginalWndProc+0x1a (FPO: [Non-Fpo])

0190f0a8 77431d5d 001060a8 000400b4 00000111 comctl32!CallNextSubclassProc+0x3c (FPO: [Non-Fpo])

0190f0cc 75ed2f80 000400b4 00000111 00000001 comctl32!DefSubclassProc+0x46 (FPO: [Non-Fpo])

0190f0f0 77431b9b 000400b4 00000111 00000001 BROWSEUI!CAutoComplete::s_ParentWndProc+0xec (FPO: [Non-Fpo])

0190f14c 77431dc0 001060a8 000400b4 00000111 comctl32!CallNextSubclassProc+0x3c (FPO: [Non-Fpo])

0190f1a0 7739b6e3 000400b4 00000111 00000001 comctl32!MasterSubclassProc+0x54 (FPO: [Non-Fpo])

0190f1cc 7739b874 77431d6c 000400b4 00000111 USER32!InternalCallWinProc+0x28

0190f244 7739c2d3 00095fb4 77431d6c 000400b4 USER32!UserCallWinProcCheckWow+0x151 (FPO: [Non-Fpo])

0190f280 7739c337 004f8a78 004f5df0 00000001 USER32!SendMessageWorker+0x4bd (FPO: [Non-Fpo])

0190f2a0 77386cea 000400b4 00000111 00000001 USER32!SendMessageW+0x7f (FPO: [Non-Fpo])

0190f2d0 77396199 000400b4 00503dc8 00030064 USER32!IsDialogMessageW+0x41c (FPO: [Non-Fpo])

0190f30c 7738965e 000400b4 00030064 00000001 USER32!DialogBox2+0x144 (FPO: [Non-Fpo])

 

This is the first parameter to CreateProcess, and this thread coincided with the notepad launch from explorer.

 

kd> du 001394f4

001394f4  "C:\WINDOWS\system32\notepad.exe"

 

This thread has been waiting longer than 2 minutes. Looking at what this thread is doing, we see that it’s waiting for a Cache Map to be uninitialized (tear down of the existing references on this cache map) as part of creating the Image section during process creation.

Examining the state of threads in the whole box, you see there are a few more threads in different processes that are waiting on the CreateProcess while creating an Image section and waiting to un-initialize the cache map.

kd> !thread 891910a8

THREAD 891910a8  Cid 0180.0184  Teb: 7ffdf000 Win32Thread: e1442bb8 WAIT: (Unknown) KernelMode Non-Alertable

    f6d44c2c  NotificationEvent

    89191120  NotificationTimer

IRP List:

    894f0298: (0006,0094) Flags: 00000800  Mdl: 00000000

Impersonation token:  e105d028 (Level Impersonation)

DeviceMap                 e12bf190

Owning Process            89138708       Image:         winlogon.exe

Wait Start TickCount      48380          Ticks: 781 (0:00:00:12.203)

Context Switch Count      1617                 LargeStack

UserTime                  00:00:00.156

KernelTime                00:00:00.468

Start Address winlogon!__report_gsfailure (0x0103e1b0)

Stack Init f6d45000 Current f6d44b8c Base f6d45000 Limit f6d40000 Call 0

Priority 15 BasePriority 15 PriorityDecrement 0

ChildEBP RetAddr  Args to Child

f6d44ba4 8082ffd7 891910a8 89191150 00000100 nt!KiSwapContext+0x25 (FPO: [Uses EBP] [0,0,4])

f6d44bbc 808287d4 896b20e8 80a560c6 00000000 nt!KiSwapThread+0x83 (FPO: [Non-Fpo])

f6d44c00 80810135 f6d44c2c 00000000 00000000 nt!KeWaitForSingleObject+0x2e0 (FPO: [Non-Fpo])

f6d44c48 80842608 006b20e8 00000000 00000000 nt!CcWaitForUninitializeCacheMap+0xa5 (FPO: [Non-Fpo])

f6d44cd0 8091f8e7 f6d44d20 000f001f 00000000 nt!MmCreateSection+0x1fc (FPO: [Non-Fpo])

f6d44d40 80883938 0006eedc 000f001f 00000000 nt!NtCreateSection+0x12f (FPO: [Non-Fpo])

f6d44d40 7c82860c 0006eedc 000f001f 00000000 nt!KiFastCallEntry+0xf8 (FPO: [0,0] TrapFrame @ f6d44d64)

0006eb34 7c826ed9 77e6cc9a 0006eedc 000f001f ntdll!KiFastSystemCallRet (FPO: [0,0,0])

0006eb38 77e6cc9a 0006eedc 000f001f 00000000 ntdll!NtCreateSection+0xc (FPO: [7,0,0])

0006f354 7d1ec670 00000818 00000000 0006fadc kernel32!CreateProcessInternalW+0x99c (FPO: [Non-Fpo])

0006f3a0 75842db7 00000818 00000000 0006fadc ADVAPI32!CreateProcessAsUserW+0x108 (FPO: [Non-Fpo])

0006f424 75842f3a 0008c260 0006f8d4 0008c208 MSGINA!ExecApplication+0x8e (FPO: [Non-Fpo])

0006f884 0103be76 0008c208 0006f8d4 00710000 MSGINA!WlxStartApplication+0xbb (FPO: [Non-Fpo])

0006f8a8 01036d59 0007a868 0006f8d4 00008001 winlogon!StartApplication+0x40 (FPO: [Non-Fpo])

0006faf8 01036fa4 0007a868 00000001 0007a868 winlogon!HandleLoggedOn+0x203 (FPO: [Non-Fpo])

0006fb14 0103b24d 00050020 00000659 00000001 winlogon!LoggedonDlgProc+0x8b (FPO: [Non-Fpo])

0006fb38 7739b6e3 00050020 00000659 00000001 winlogon!RootDlgProc+0x6e (FPO: [Non-Fpo])

0006fb64 77395f82 0103b1df 00050020 00000659 USER32!InternalCallWinProc+0x28

0006fbe0 77395e22 0008fca4 0103b1df 00050020 USER32!UserCallDlgProcCheckWow+0x147 (FPO: [Non-Fpo])

0006fc28 77395ffa 00000000 00000659 00000001 USER32!DefDlgProcWorker+0xa8 (FPO: [Non-Fpo])

0006fc44 7739b6e3 00050020 00000659 00000001 USER32!DefDlgProcW+0x22 (FPO: [Non-Fpo])

0006fc70 7739b874 77395fd8 00050020 00000659 USER32!InternalCallWinProc+0x28

0006fce8 7739ba92 0008fca4 77395fd8 00050020 USER32!UserCallWinProcCheckWow+0x151 (FPO: [Non-Fpo])

0006fd50 7739bad0 0006fda0 00000000 0006fd84 USER32!DispatchMessageWorker+0x327 (FPO: [Non-Fpo])

0006fd60 77395d78 0006fda0 00000000 004f2cd0 USER32!DispatchMessageW+0xf (FPO: [Non-Fpo])

0006fd84 77396199 00050020 004f2cd0 00000000 USER32!IsDialogMessageW+0x56b (FPO: [Non-Fpo])

0006fdc0 7738965e 00050020 00000000 00000010 USER32!DialogBox2+0x144 (FPO: [Non-Fpo])

0006fde8 773896a0 01000000 0107cbc8 00000000 USER32!InternalDialogBox+0xd0 (FPO: [Non-Fpo])

0006fe08 773896e8 01000000 0107cbc8 00000000 USER32!DialogBoxIndirectParamAorW+0x37 (FPO: [Non-Fpo])

0006fe2c 0103de0a 01000000 00000578 00000000 USER32!DialogBoxParamW+0x3f (FPO: [Non-Fpo])

0006fe50 0102d838 01000000 00000578 00000000 winlogon!Fusion_DialogBoxParam+0x24 (FPO: [Non-Fpo])

0006fe8c 0103b6e0 0007a868 01000000 00000578 winlogon!TimeoutDialogBoxParam+0x28 (FPO: [Non-Fpo])

0006fec4 0103746e 0007a868 01000000 00000578 winlogon!WlxDialogBoxParam+0x80 (FPO: [Non-Fpo])

0006fee4 01038042 0007a868 77e62f9d 77e42014 winlogon!BlockWaitForUserAction+0x3a (FPO: [Non-Fpo])

0006ff08 01031b33 0007a868 ffffffff 00000004 winlogon!MainLoop+0x42d (FPO: [Non-Fpo])

0006ff50 0103e33b 0007a868 00000000 000724e4 winlogon!WUNotify+0x515 (FPO: [Non-Fpo])

0006fff4 00000000 7ffd7000 000000c8 000001c9 winlogon!__report_gsfailure+0x267 (FPO: [Non-Fpo])

 

 

kd> !thread 88a1c3a0

THREAD 88a1c3a0  Cid 01b0.072c  Teb: 7ff9d000 Win32Thread: 00000000 WAIT: (Unknown) KernelMode Non-Alertable

    f5ea7c2c  NotificationEvent

    88a1c418  NotificationTimer

Not impersonating

DeviceMap                 e1000128

Owning Process            8911fd88       Image:         services.exe

Wait Start TickCount      32679          Ticks: 16482 (0:00:04:17.531)

Context Switch Count      2043

UserTime                  00:00:00.015

KernelTime                00:00:00.140

Win32 Start Address 0x0000a1d5

LPC Server thread working on message Id a1d5

Start Address kernel32!BaseThreadStartThunk (0x77e617ec)

Stack Init f5ea8000 Current f5ea7b8c Base f5ea8000 Limit f5ea5000 Call 0

Priority 10 BasePriority 9 PriorityDecrement 0

ChildEBP RetAddr  Args to Child

f5ea7ba4 8082ffd7 88a1c3a0 88a1c448 00000100 nt!KiSwapContext+0x25 (FPO: [Uses EBP] [0,0,4])

f5ea7bbc 808287d4 895c83f0 80a560c6 00000000 nt!KiSwapThread+0x83 (FPO: [Non-Fpo])

f5ea7c00 80810135 f5ea7c2c 00000000 00000000 nt!KeWaitForSingleObject+0x2e0 (FPO: [Non-Fpo])

f5ea7c48 80842608 005c83f0 00000000 00000000 nt!CcWaitForUninitializeCacheMap+0xa5 (FPO: [Non-Fpo])

f5ea7cd0 8091f8e7 f5ea7d20 000f001f 00000000 nt!MmCreateSection+0x1fc (FPO: [Non-Fpo])

f5ea7d40 80883938 0359f270 000f001f 00000000 nt!NtCreateSection+0x12f (FPO: [Non-Fpo])

f5ea7d40 7c82860c 0359f270 000f001f 00000000 nt!KiFastCallEntry+0xf8 (FPO: [0,0] TrapFrame @ f5ea7d64)

0359eec8 7c826ed9 77e6cc9a 0359f270 000f001f ntdll!KiFastSystemCallRet (FPO: [0,0,0])

0359eecc 77e6cc9a 0359f270 000f001f 00000000 ntdll!NtCreateSection+0xc (FPO: [7,0,0])

0359f6e8 77e424b0 00000000 00000000 000b5100 kernel32!CreateProcessInternalW+0x99c (FPO: [Non-Fpo])

0359f720 0100928b 00000000 000b5100 00000000 kernel32!CreateProcessW+0x2c (FPO: [Non-Fpo])

0359f80c 01008a4c 0064a8b0 000b5100 0359f844 services!ScLogonAndStartImage+0x28b (FPO: [Non-Fpo])

0359f84c 010069b1 0064a8b0 00000000 00000000 services!ScStartService+0x1c6 (FPO: [Non-Fpo])

0359f87c 01005e57 0064a8b0 00000000 00000000 services!ScStartMarkedServices+0x9c (FPO: [Non-Fpo])

0359f8b4 01005ee0 0064a8b0 00000000 00000000 services!ScStartServiceAndDependencies+0x1f1 (FPO: [Non-Fpo])

0359f8d8 77c80193 000a0180 00000000 00000000 services!RStartServiceW+0x8c (FPO: [Non-Fpo])

0359f8f8 77ce33e1 01005e78 0359fae0 00000003 RPCRT4!Invoke+0x30

0359fcf8 77ce35c4 00000000 00000000 000abe9c RPCRT4!NdrStubCall2+0x299 (FPO: [Non-Fpo])

0359fd14 77c7ff7a 000abe9c 000a06d0 000abe9c RPCRT4!NdrServerCall2+0x19 (FPO: [Non-Fpo])

0359fd48 77c8042d 010024ef 000abe9c 0359fdec RPCRT4!DispatchToStubInCNoAvrf+0x38 (FPO: [Non-Fpo])

0359fd9c 77c80353 00000013 00000000 0101c148 RPCRT4!RPC_INTERFACE::DispatchToStubWorker+0x11f (FPO: [Non-Fpo])

0359fdc0 77c811dc 000abe9c 00000000 0101c148 RPCRT4!RPC_INTERFACE::DispatchToStub+0xa3 (FPO: [Non-Fpo])

0359fdfc 77c812f0 000abc30 0009ff08 000d5c58 RPCRT4!LRPC_SCALL::DealWithRequestMessage+0x42c (FPO: [Non-Fpo])

0359fe20 77c88678 0009ff40 0359fe38 000abc30 RPCRT4!LRPC_ADDRESS::DealWithLRPCRequest+0x127 (FPO: [Non-Fpo])

0359ff84 77c88792 0359ffac 77c8872d 0009ff08 RPCRT4!LRPC_ADDRESS::ReceiveLotsaCalls+0x430 (FPO: [Non-Fpo])

0359ff8c 77c8872d 0009ff08 00000000 00000000 RPCRT4!RecvLotsaCallsWrapper+0xd (FPO: [Non-Fpo])

0359ffac 77c7b110 0009e2b8 0359ffec 77e6482f RPCRT4!BaseCachedThreadRoutine+0x9d (FPO: [Non-Fpo])

0359ffb8 77e6482f 000d4d78 00000000 00000000 RPCRT4!ThreadStartRoutine+0x1b (FPO: [Non-Fpo])

0359ffec 00000000 77c7b0f5 000d4d78 00000000 kernel32!BaseThreadStart+0x34 (FPO: [Non-Fpo])

 

These threads stuck in Cache Manager while attempting to launch a process, can potentially lead to the symptoms that were described to us. Let’s try to prove it.

While we will not go into the details of Cache Manager mechanics (Refer to Cache Manager, Chapter 11 in Windows Internals), a quick note on how these threads will be unblocked is needed for the sake of this problem. When image sections are created if there is any existing shared cache map associated, we wait for any references on the shared cache map for this image section to drop to zero. The thread waiting on the cache map to be un-initialized will get signaled when the reference drops to zero on the shared cache map. The code that signals the un-initialization executes in the context of Cache Manager Worker and is queued onto a System Worker thread. Looking at so many threads, all waiting for Cache Manager Worker thread to signal the cleanup of the section, it appears that either-

  • The Cache Manager Worker kicked off but never reached a point to signal these blocked threads.
  • Cache Manager Worker has not had a chance to run yet.

The Cache Manager globals below indicate the maximum number of CC worker that can be active or queued at any time, and current active count. The counts below indicate we are already at the peak. The “nt!CcNumberActiveWorkerThreads” counter indicates the number of threads that already have work to do, but not necessarily currently executing Cache manager worker.

kd> x nt!CcNumberActiveWorkerThreads

80896144 nt!CcNumberActiveWorkerThreads = <no type information>

kd> dd 80896144 l1

80896144  00000008              <<This indicates the work items queued that will/or have engaged worker

kd> x nt!CcNumberWorkerThreads

80896140 nt!CcNumberWorkerThreads = <no type information>

kd> dd 80896140 l1

80896140  00000008              <<This is the counter for Max Cc worker threads

kd> x nt!CcWorkerThread

8081211e nt!CcWorkerThread = <no type information>

 

So what are these work queue items that are being executed?

If the first condition is true then we should find these worker (nt!CcWorkerThread) executing on top of a system worker thread. Yes we did search the stacks of all the threads in the dump, but we weren’t fortunate enough to find any System Worker Threads executing the Cache Manager Worker.

Only other possibility is these Cache Manager Worker threads never got a chance to run, likely system has no System Worker Threads idle enough to pick these Cache Manager work. So how do we prove/disprove this? (We could have started dumping out the System Worker Queues and its associated threads) We take a quicker approach - !exqueue. This command displays information and state of system worker queue and work items queued in each of its worker queue.

Let’s dump out the state of the System Worker Queue/Threads.

kd> !exqueue

Dumping ExWorkerQueue: 808A76C0

 

**** Critical WorkQueue( current = 0 maximum = 1 )

THREAD 898f9b40  Cid 0004.0010  Teb: 00000000 Win32Thread: 00000000 WAIT

THREAD 898f98d0  Cid 0004.0014  Teb: 00000000 Win32Thread: 00000000 WAIT

THREAD 898f8020  Cid 0004.0018  Teb: 00000000 Win32Thread: 00000000 WAIT

THREAD 898f8db0  Cid 0004.001c  Teb: 00000000 Win32Thread: 00000000 WAIT

THREAD 898f8b40  Cid 0004.0020  Teb: 00000000 Win32Thread: 00000000 WAIT

THREAD 898f88d0  Cid 0004.0024  Teb: 00000000 Win32Thread: 00000000 WAIT

THREAD 898f8660  Cid 0004.0028  Teb: 00000000 Win32Thread: 00000000 WAIT

THREAD 898f83f0  Cid 0004.002c  Teb: 00000000 Win32Thread: 00000000 WAIT

THREAD 898f7020  Cid 0004.0030  Teb: 00000000 Win32Thread: 00000000 WAIT

THREAD 898f7db0  Cid 0004.0034  Teb: 00000000 Win32Thread: 00000000 WAIT

THREAD 89652868  Cid 0004.0ed0  Teb: 00000000 Win32Thread: 00000000 WAIT

THREAD 895faa40  Cid 0004.0ed4  Teb: 00000000 Win32Thread: 00000000 WAIT

THREAD 891fb9b8  Cid 0004.0ed8  Teb: 00000000 Win32Thread: 00000000 WAIT

THREAD 89129db0  Cid 0004.0edc  Teb: 00000000 Win32Thread: 00000000 WAIT

THREAD 892c4780  Cid 0004.0ee0  Teb: 00000000 Win32Thread: 00000000 WAIT

THREAD 8961b6a0  Cid 0004.0ee4  Teb: 00000000 Win32Thread: 00000000 WAIT

THREAD 8917a730  Cid 0004.0ee8  Teb: 00000000 Win32Thread: 00000000 WAIT

THREAD 88a31b10  Cid 0004.0eec  Teb: 00000000 Win32Thread: 00000000 WAIT

THREAD 895eacb0  Cid 0004.0ef0  Teb: 00000000 Win32Thread: 00000000 WAIT

THREAD 891d7db0  Cid 0004.0ef8  Teb: 00000000 Win32Thread: 00000000 WAIT

THREAD 89667b08  Cid 0004.0f14  Teb: 00000000 Win32Thread: 00000000 WAIT

THREAD 8920a490  Cid 0004.0f48  Teb: 00000000 Win32Thread: 00000000 WAIT

THREAD 892f3cb0  Cid 0004.0fa8  Teb: 00000000 Win32Thread: 00000000 WAIT

THREAD 8962bdb0  Cid 0004.0fb0  Teb: 00000000 Win32Thread: 00000000 WAIT

THREAD 89661350  Cid 0004.0fb8  Teb: 00000000 Win32Thread: 00000000 WAIT

THREAD 8918adb0  Cid 0004.0fbc  Teb: 00000000 Win32Thread: 00000000 WAIT

 

<Pending Work Items list for this queue>

 

PENDING: WorkerRoutine nt!CcWorkerThread (8081211e) Parameter 898f51e0

PENDING: WorkerRoutine nt!CcWorkerThread (8081211e) Parameter 898f9670

PENDING: WorkerRoutine nt!IopProcessWorkItem (808e419a) Parameter 891f8648

PENDING: WorkerRoutine nt!CcWorkerThread (8081211e) Parameter 898fa1d8

PENDING: WorkerRoutine Ntfs!NtfsCheckpointAllVolumes (f7135a57) Parameter 00000000

PENDING: WorkerRoutine srv!SrvResourceAllocThread (f5edfa28) Parameter 00000000

PENDING: WorkerRoutine nt!IopProcessWorkItem (808e419a) Parameter 89308f00

PENDING: WorkerRoutine nt!ObpProcessRemoveObjectQueue (8092b70e) Parameter 00000000

PENDING: WorkerRoutine srv!SrvResourceThread (f5ee026d) Parameter 00000000

PENDING: WorkerRoutine netbt!NTExecuteWorker (f67cdbc2) Parameter f67eb6bc

PENDING: WorkerRoutine termdd!_IcaDelayedWorker (f767d29a) Parameter 89191008

PENDING: WorkerRoutine termdd!_IcaDelayedWorker (f767d29a) Parameter 8965d1e8

PENDING: WorkerRoutine termdd!_IcaDelayedWorker (f767d29a) Parameter 895edea0

PENDING: WorkerRoutine termdd!_IcaDelayedWorker (f767d29a) Parameter 892b8be8

PENDING: WorkerRoutine termdd!_IcaDelayedWorker (f767d29a) Parameter 895e11e8

PENDING: WorkerRoutine termdd!_IcaDelayedWorker (f767d29a) Parameter 89607210

PENDING: WorkerRoutine termdd!_IcaDelayedWorker (f767d29a) Parameter 896634a8

PENDING: WorkerRoutine termdd!_IcaDelayedWorker (f767d29a) Parameter 8915dce0

PENDING: WorkerRoutine termdd!_IcaDelayedWorker (f767d29a) Parameter 89221110

PENDING: WorkerRoutine termdd!_IcaDelayedWorker (f767d29a) Parameter 8922a968

PENDING: WorkerRoutine nt!CcWorkerThread (8081211e) Parameter 898f7278

PENDING: WorkerRoutine nt!CcWorkerThread (8081211e) Parameter 8998cd38

PENDING: WorkerRoutine nt!CcWorkerThread (8081211e) Parameter 898f9688

PENDING: WorkerRoutine nt!CcWorkerThread (8081211e) Parameter 898f8298

PENDING: WorkerRoutine nt!CcWorkerThread (8081211e) Parameter 8998c030

PENDING: WorkerRoutine termdd!_IcaDelayedWorker (f767d29a) Parameter 891fe578

PENDING: WorkerRoutine termdd!_IcaDelayedWorker (f767d29a) Parameter 891817c0

PENDING: WorkerRoutine Ntfs!NtfsCheckUsnTimeOut (f71489b8) Parameter 00000000

PENDING: WorkerRoutine termdd!_IcaDelayedWorker (f767d29a) Parameter 89648fd0

PENDING: WorkerRoutine termdd!_IcaDelayedWorker (f767d29a) Parameter 89207618

PENDING: WorkerRoutine termdd!_IcaDelayedWorker (f767d29a) Parameter 895fc7d0

PENDING: WorkerRoutine termdd!_IcaDelayedWorker (f767d29a) Parameter 89268950

PENDING: WorkerRoutine termdd!_IcaDelayedWorker (f767d29a) Parameter 8921e008

PENDING: WorkerRoutine termdd!_IcaDelayedWorker (f767d29a) Parameter 88acbe98

PENDING: WorkerRoutine termdd!_IcaDelayedWorker (f767d29a) Parameter 89685e98

PENDING: WorkerRoutine termdd!_IcaDelayedWorker (f767d29a) Parameter 8921ae60

PENDING: WorkerRoutine termdd!_IcaDelayedWorker (f767d29a) Parameter 896521a0

PENDING: WorkerRoutine termdd!_IcaDelayedWorker (f767d29a) Parameter 8920ab68

 

**** Delayed WorkQueue( current = 0 maximum = 1 )

THREAD 898f7b40  Cid 0004.0038  Teb: 00000000 Win32Thread: 00000000 WAIT

THREAD 898f78d0  Cid 0004.003c  Teb: 00000000 Win32Thread: 00000000 WAIT

THREAD 898f7660  Cid 0004.0040  Teb: 00000000 Win32Thread: 00000000 WAIT

THREAD 898f73f0  Cid 0004.0044  Teb: 00000000 Win32Thread: 00000000 WAIT

THREAD 898f6020  Cid 0004.0048  Teb: 00000000 Win32Thread: 00000000 WAIT

THREAD 898f6db0  Cid 0004.004c  Teb: 00000000 Win32Thread: 00000000 WAIT

THREAD 898f6b40  Cid 0004.0050  Teb: 00000000 Win32Thread: 00000000 WAIT

 

**** HyperCritical WorkQueue( current = 0 maximum = 1 )

THREAD 898f68d0  Cid 0004.0054  Teb: 00000000 Win32Thread: 00000000 WAIT

 

This command examines the state of the System Work queue and associated System Worker threads. It’s telling us that there are three set of queues prioritized as hyper-Critical, Critical and Delayed-Worker queues. While Delayed-worker and Hyper-Critical queues are empty, the Critical Worker queue has enough pending items to keep it busy. This is not good. On an ideal case we expect all the work to be processed immediately and almost no work pending in the queue.

Before we move on, let’s take a step back and see why we came here. We were chasing down the cache manager workers and we came here to find if there is any Cache Manager work pending in the worker queue to be picked up. Indeed yes, we can see all of the 8 (“nt!CcNumberActiveWorkerThreads”) still pending. This answers the puzzle as far as threads that were blocked at Cache Manager’s shared cache map un-initialization. And “!exqueue” did come to our rescue here.

It’s always like this! You get an answer to one question, but at the same time the next question is readied for you, i.e. why are these work items still pending and not being processed?

For this we need a little bit of background on how System Worker Threads work. Several system components and drivers may need to execute the code at PASSIVE LEVEL and in a thread context. For this they could always create new threads and use them to execute the code they want. Other option is to rely on the pre-created threads by the system called “System worker Threads” and get relieved from the burden of thread management itself. Based on the priority of the work, work is queued to any of the three queues (Critical, Hyper-Critical, and Delayed-Worker). By default there will be certain number of worker threads (Refer to Chapter 3, System Mechanisms - System Worker Threads – in Windows Internals) created for each of these queues and they will wait on the respective queues for any new work to come in, pick the work and get back to wait on the queue after the completion of the work.

At a certain point it could so happen that all these pre-created threads would be executing some work, and may get blocked on another work item to complete. But as there are no idle worker threads to pick up this work, it would sit in the pending queue, resulting in blockage of all the work to be done by these set of system worker threads.

The Operating System tries to address this kind of a problem to some extent by running deadlock detection algorithm in a timely manner. When this code runs and the system sees that the pending work items are increasing (that is to say that no work items are being picked up, or work is coming in a higher rate than the existing number of threads could handle), it may decide to create additional worker threads to help with the pending work items load. These threads are special worker threads called “Dynamic Worker Threads”. These threads exist as long as there is enough work to be done. However they terminate on being idle for a certain amount of time, so the system doesn’t tie up resources for unused worker threads. Even if these dynamic worker threads get blocked, the system cannot keep creating the additional dynamic worker threads forever, as this will lead to the system filling up with worker threads and all getting blocked.

The sole intention of Dynamic worker threads is to try to help any immediate additional load or help system recover from deadlock among existing worker threads. However a couple of dynamic threads should suffice this need if it’s indeed a transient state. But if this is not a transient state and there is a real software problem then System should have to stop creating these dynamic threads at some point. This will eventually lead to hung Worker threads with work items getting just queued.

This dynamic thread count is limited to 16 for the Critical worker queue, and System will not create any more as soon we reach this limit.

(Refer to Chapter 3, System Mechanisms - System Worker Threads – in Windows Internals and/or Documentation in DDK/WDK).

With this knowledge on System Worker threads, the next step ahead is to determine what the existing Worker threads in the Critical Worker queue are doing which is preventing them from picking up our work items. Below is one of those threads, waiting on a Notification event as part of processing the work from WorkerDrv.SYS. Checking what every single thread in the Critical Worker queue is doing, we see they all are waiting in WorkerDrv.SYS driver (All these threads may not be occupied by the same driver always, but could be a similar deadlock among different drivers).

kd> !thread 898f9b40

THREAD 898f9b40  Cid 0004.0010  Teb: 00000000 Win32Thread: 00000000 WAIT: (Unknown) KernelMode Alertable

    f78aed5c  NotificationEvent

Not impersonating

DeviceMap                 e1000128

Owning Process            898fa648       Image:         System

Wait Start TickCount      28506          Ticks: 20655 (0:00:05:22.734)

Context Switch Count      2

UserTime                  00:00:00.000

KernelTime                00:00:00.000

Start Address nt!ExpWorkerThread (0x8087acfe)

Stack Init f78af000 Current f78aecc4 Base f78af000 Limit f78ac000 Call 0

Priority 13 BasePriority 13 PriorityDecrement 0

ChildEBP RetAddr  Args to Child

f78aecdc 8082ffd7 898f9b40 898f9be8 00000400 nt!KiSwapContext+0x25 (FPO: [Uses EBP] [0,0,4])

f78aecf4 808287d4 891eac00 00000000 895b3268 nt!KiSwapThread+0x83 (FPO: [Non-Fpo])

f78aed38 f77b30fe f78aed5c 00000000 00000000 nt!KeWaitForSingleObject+0x2e0 (FPO: [Non-Fpo])

WARNING: Stack unwind information not available. Following frames may be wrong.

f78aed6c 808e41ad 88a80c08 f78ced5c 808a76c0 WorkerDrv+0x40fe

f78aed80 8087ade9 895b3268 00000000 898f9b40 nt!IopProcessWorkItem+0x13 (FPO: [Non-Fpo])

f78aedac 809418f4 895b3268 00000000 00000000 nt!ExpWorkerThread+0xeb (FPO: [Non-Fpo])

f78aeddc 80887f7a 8087acfe 00000000 00000000 nt!PspSystemThreadStartup+0x2e (FPO: [Non-Fpo])

00000000 00000000 00000000 00000000 00000000 nt!KiThreadStartup+0x16

 

And yes, we know who the culprit is. This Driver has utilized all of the default System Critical Worker threads and the additional Dynamic threads for this queue.

kd> x nt!ExWorkerQueue

808a76c0 nt!ExWorkerQueue = <no type information>

 

kd> dt nt!_EX_WORK_QUEUE 808a76c0 .

   +0x000 WorkerQueue      :

      +0x000 Header           : _DISPATCHER_HEADER

      +0x010 EntryListHead    : _LIST_ENTRY [ 0x898f51e0 - 0x8920ab70 ]<<---Pending (QueueDepthLastPass)

      +0x018 CurrentCount     : 0

      +0x01c MaximumCount     : 1

      +0x020 ThreadListHead   : _LIST_ENTRY [ 0x898f9c48 - 0x8918aeb8 ]<<---Threads attached to this queue

   +0x028 DynamicThreadCount : 0x10 <<--------------------------------------Count of additional threads created as per deadlock detection

   +0x02c WorkItemsProcessed : 0x10f3

   +0x030 WorkItemsProcessedLastPass : 0x10f3

   +0x034 QueueDepthLastPass : 0x26 <<Count of pending work items

   +0x038 Info             :

      +0x000 QueueDisabled    : 0y0

      +0x000 MakeThreadsAsNecessary : 0y1

      +0x000 WaitMode         : 0y0

      +0x000 WorkerCount      : 0y00000000000000000000000011010 (0x1a)

      +0x000 QueueWorkerInfo  : 210

 

Looking at the pending work items we know what kind of impact this deadlock could have on the system. Any operation that is dependent on this set of worker threads will surely be impacted, and over a period of time you expect the system to crawl and slowly could possibly reach a dead end with components having direct/indirect dependency on this component of the system. We see Termdd, NTFS, and Cache manager work items in the pending queue which explains RDP not working, new processes not getting launched and so on.

 

Closure:

At the point when we know this driver has consumed all the Critical Worker threads, the quickest way to get the system up and running is to disable this driver. And I could work on fixing our WorkerDrv.SYS so that this driver understands the importance of System Worker threads and doesn’t flood the worker queue with work items that will block for a long time or with work items that are dependent on other work items, leading to this situation.

 

To summarize, we started with a problem description of crawling/almost hung system (a few components were indeed responding). We found why application launch was being blocked, which lead us to cache manager threads. Chasing down cache manager threads, we ended up with System Worker Threads. Then to my driver WorkerDrv.SYS which never understood the importance of System Worker threads, and used them too freely.

 

Hope you enjoyed reading this post and could use some of it in case you encounter a problem of a similar kind.

 

 

Share this post :

Debuggers - This is the list of the upcoming articles “in the works”.

The Story of a Hung Box
ETW_BlogPost_ETW Introduction and Overview
ETW_BlogPost_Exploring and Decoding ETW...
ETW_BlogPost_MethodsOfTracing
Got Stack?   No. We ran out and kv won’t tell me why!
Part 2: Got Stack?   No. We ran out and kv won’t tell me why!
WMI: Finding provider binary

Also we’re planning to blog about the cool features in Windows 7 and 2008 R2 with debugging in mind. Feel free to send your feedback to help us tweak the blog.

Cheers,

Ron Stock

Debuggers,

Here’s another short video of the type of triage possible with Xperf, this time Bob shows us how to track down a slow I/O issue. Since I/O Manager is instrumented at IoCallDriver and IoCompleteRequest we can often use this information to diagnose a Slow I/O or SAN issue. The steps to enable the tracing are in the video description and here again below.

Thanks,

Tate

Here’s a link to download the WMV  - http://ntdebugging.members.winisp.net/videos/SlowIOv2.wmv

Video Description: (http://www.youtube.com/watch?v=0h-i_EfzadA)

Here's how you can use the Windows Performance Toolkit aka Xperf to investigate slow I/O issues on Windows 2008, Vista, and higher versions. Put the following into a batch file (modifying the -f to point the etl file generation elsewhere than the slow storage) and it will start a circular trace which can be stopped when the slow I/O occurs.

@echo off

xperf -on PROC_THREAD+LOADER+CSWITCH+FILENAME+FILE_IO+FILE_IO_INIT+DRIVERS -f kernel.etl -stackwalk CSwitch+DiskReadInit+DiskWriteInit+DiskFlushInit+FileCreate+FileCleanup+FileClose+FileRead+FileWrite -BufferSize 1024 -MaxBuffers 1024 -MaxFile 1024 -FileMode Circular

echo Reproduce the problem now, hit any key to finish, collect Merged Etl file

pause

echo merging etl...

xperf -d XperfSlowIOcir.etl

@echo on

 

Share this post :

My name is Ryan Mangipano (ryanman) and I am a Sr. Support Escalation Engineer at Microsoft.  Today’s blog will be a quick walkthrough of the analysis of a bugcheck 0xF4 and how I determined that the action plan going forward should consist of enabling pool tagging on this system.

 

I began my review with !analyze –v. From the output I can see that a process required for the system to function properly unexpectedly exited or was terminated. The goal of this debugging session will be to determine what failed and why.

 

0: kd> !analyze -v

*******************************************************************************

*                                                                             *

*                        Bugcheck Analysis                                    *

*                                                                             *

*******************************************************************************

 

CRITICAL_OBJECT_TERMINATION (f4)

A process or thread crucial to system operation has unexpectedly exited or been

terminated.

Several processes and threads are necessary for the operation of the

system; when they are terminated (for any reason), the system can no

longer function.

Arguments:

Arg1: 00000003, Process                                A value of 0x3 in this parameter indicated that it was a process that terminated, not a thread

Arg2: 8a03ada0, Terminating object        This value is a pointer to the _EPROCESS object that terminated

Arg3: 8a03af14, Process image file name              Process Name

Arg4: 805d1204, Explanatory message (ascii)         text message about the problem

 

We shall begin by dumping out all the parameters of the bugcheck. Let’s dump out the "Terminating Object" below

0: kd> !object 8a03ada0

Object: 8a03ada0  Type: (8a490900) Process

    ObjectHeader: 8a03ad88 (old version)

    HandleCount: 3  PointerCount: 228

 

First, let’s dump out the process image file name from the bugcheck parameter 3 above.

0: kd> dc 8a03af14

8a03af14  73727363 78652e73 00000065 00000000  csrss.exe

 

 

0: kd> dt _EPROCESS 8a03ada0 imageFileName

CSRSRV!_EPROCESS

   +0x174 ImageFileName : [16]  "csrss.exe"

 

Notice that if we add the base of the _EPROCESS object (8a03ada0- Parameter 2)to the offset of the imageFileName field (+0x174) we get parameter 3. The imageFileName field.

0: kd> ? 8a03ada0+0x174

Evaluate expression: -1979470060 = 8a03af14

 

0: kd> dc 8a03af14

8a03af14  73727363 78652e73 00000065 00000000  csrss.exe.......

 

Let’s dump out the ASCII message from parameter number 4

0: kd> dc 805d1204

805d1204  6d726554 74616e69 20676e69 74697263  Terminating crit

805d1214  6c616369 6f727020 73736563 25783020  ical process

 

Let’s review the debugger help file for more information. We can see that this bugcheck occurs when a critical process or thread terminates. “Several processes and threads are necessary for the operation of the system. When they are terminated for any reason, the system can no longer function. “

0: kd> .hh bug check 0xf4

 

 

 

 

 

 

Next, we need to determine why this process terminated. !analyze –v also provided us with an exception record which provides us with an error code:

 

PROCESS_NAME:  csrss.exe

 

EXCEPTION_RECORD:  9a85e9d8 -- (.exr 0xffffffff9a85e9d8)

ExceptionAddress: 7c92c375 (ntdll!RtlFindMessage+0x0000007c)

   ExceptionCode: c0000006 (In-page I/O error)

  ExceptionFlags: 00000000

NumberParameters: 3

   Parameter[0]: 00000000

   Parameter[1]: 7c99c3d8

   Parameter[2]: c000009a

Inpage operation failed at 7c99c3d8, due to I/O error c000009a

 

EXCEPTION_CODE: (NTSTATUS) 0xc0000006 - The instruction at 0x%p referenced memory at 0x%p. The required data was not placed into memory because of an I/O error status of 0x%x.

 

Since we have an error code, let’s investigate that error code. We can quickly perform this operation from within the debugger using the !error command

 

0: kd> !error c000009a

Error code: (NTSTATUS) 0xc000009a (3221225626) - Insufficient system resources exist to complete the API.

 

0: kd > .hh !error

 

 

Let’s check the output of the !vm command

 

0: kd> !vm 2

 

*** Virtual Memory Usage ***

                Physical Memory:      760875 (   3043500 Kb)

                Page File: \??\C:\pagefile.sys

                  Current:   4190208 Kb  Free Space:   4156380 Kb

                  Minimum:   4190208 Kb  Maximum:      4190208 Kb

                Available Pages:      579241 (   2316964 Kb)

                ResAvail Pages:       673481 (   2693924 Kb)

                Locked IO Pages:          69 (       276 Kb)

                Free System PTEs:     115226 (    460904 Kb)

                Free NP PTEs:              0 (         0 Kb)

                Free Special NP:           0 (         0 Kb)

                Modified Pages:          221 (       884 Kb)

                Modified PF Pages:       219 (       876 Kb)

                NonPagedPool Usage:    65534 (    262136 Kb)

                NonPagedPool Max:      65536 (    262144 Kb)

                ********** Excessive NonPaged Pool Usage *****

                PagedPool 0 Usage:     24167 (     96668 Kb)

                PagedPool 1 Usage:       967 (      3868 Kb)

                PagedPool 2 Usage:       967 (      3868 Kb)

                PagedPool 3 Usage:       984 (      3936 Kb)

                PagedPool 4 Usage:       977 (      3908 Kb)

                PagedPool Usage:       28062 (    112248 Kb)

                PagedPool Maximum:     92160 (    368640 Kb)

 

                ********** 2075 pool allocations have failed **********

 

                Session Commit:         1562 (      6248 Kb)

                Shared Commit:          2526 (     10104 Kb)

                Special Pool:              0 (         0 Kb)

                Shared Process:         4821 (     19284 Kb)

                PagedPool Commit:      28062 (    112248 Kb)

                Driver Commit:          5138 (     20552 Kb)

                Committed pages:      153449 (    613796 Kb)

                Commit limit:        1767229 (   7068916 Kb)

 

0: kd> !poolused

unable to get PoolTrackTable - pool tagging is disabled, enable it to use this command

Use gflags.exe and check the box that says "Enable pool tagging".

 

The output above has informed us that pool tagging is disabled. Let’s demonstrate how you can verify that it is disabled:

 

0: kd> dd nt!NtGlobalFlag L1

805597ec  00000000

 

0: kd> !gflag

Current NtGlobalFlag contents: 0x00000000

 

Let’s explore the debugging help file entry on the !poolused command

 

0: kd > .hh !poolused

 

 

 

Reading the text above, we are informed that “Pool tagging is permanently enabled on Windows Server 2003 and later versions of Windows. On Windows XP and earlier versions of Windows, you must enable pool tagging by using Gflags.”

 

 

Using the vertarget command, I can see that this system was running Windows XP.

0: kd> vertarget

Windows XP Kernel Version 2600 (Service Pack 2) MP (2 procs) Free x86 compatible

 

 

 

0: kd > .hh !gflag

 

 

 

By reviewing the help file entry for the !gflag extension, I was able to determine that if pooltagging was set, the following bit would have been set:

 

0x400 "ptg"         Enable pool tagging.

 

0: kd> .formats 0x400

Evaluate expression:

….

  Binary:  00000000 00000000 00000100 00000000                0x00000400

 

Gflags is included in the Debugging Tools for Windows. The screenshot below is from a Windows 7 system. Notice that Pool Tagging is enabled permanently as described above.

 

 

 

Summary: This system bugchecked when the critical process csrss.exe failed an I/O operation due to insufficient non-paged pool. For an action plan, we recommended the use of gflags to enable pool tagging in order to obtain more information about pool consumption.

 

 

Share this post :

Hi - my name is Naresh and I am a Sr. Escalation Engineer on the Microsoft GES platforms team. Today I'm discussing a simple, yet powerful GUI tool used to configure a Windows system locally or remotely for a memory dump. The name of the tool is DumpConfigurator.hta and it can be accessed from CodePlex.  Check out the following Microsoft KB article which references the use of the tool.

969028  How to generate a kernel or a complete memory dump file in Windows Server 2008

http://support.microsoft.com/default.aspx?scid=kb;EN-US;969028

 

The tool can be used with all currently supported versions of the Windows Operating System. Once you download it, launch it with Administrator privileges to get the following UI: 

GUI

The GUI is self-explanatory and all the settings can be edited and saved by clicking Save Settings. The system will have to be rebooted for the settings to take effect.
NOTE: Read the Warranty Disclaimer for the tool before use:)

 

Share this post :

Hello - This is Omer and I recently came across a case where the customer reported that they could not reboot into safe mode using their custom image. Whenever they booted into safe mode, the machine would get to the logon screen, wait for 5 seconds and then reboot regardless of any user input. Nothing was being logged in the event logs either, so it was very strange.

At first it looked like the machine was going through a power cycle, since the shutdown was so quick (we would not see the usual shutdown messages like “Shutting down Services”, etc.). I thought maybe there was some issue with the hardware, but the customer reported that they had the same issue on every machine, regardless of the hardware vendor.

To figure this out, I attached a kernel debugger to the machine, and broke in to make sure the connection was good. I then let the machine go, and it got to the logon screen. Sure enough, after 5 seconds the machine rebooted. I thought that I would run into some kind of exception, and the debugger would break, however nothing of the sort happened. The only message that I got was that the following

Shutdown occurred at (Fri Jun 26 17:27:12.714 2009 (GMT-7))...unloading all symbol tables.

Very strange! The OS disconnected the debugger gracefully. I did a quick source code review and found that one of the places that we disconnect the debugger was in the system shutdown path. Maybe the OS was shutting down gracefully, but since it happened so fast, it looked like a power cycle. To test my theory, I put a breakpoint on nt!NtShutdownSystem to see if it was being called, and find the caller as well. Rebooted the machine, and let it rip.

nt!NtShutdownSystem()

nt!KiSystemServiceCopyEnd()+0x13

ntdll!ZwShutdownSystem(void)+0xa

services!ScRevertToLastKnownGood()+0x1af

services!ScStartMarkedServices()+0x154

services!ScStartServiceAndDependencies()+0x43d

services!ScAutoStartServices()+0x225

services!SvcctrlMain()+0xa75

services!main()+0x31

services!__mainCRTStartup()+0x13d

kernel32!BaseThreadInitThunk()+0xd

ntdll!RtlUserThreadStart()+0x1d

 

Voila! Services.exe is shutting down the system. Probably some service is not starting, which is then somehow causing the machine to shutdown. From the stack, I was able to figure out which service was not starting. Based on the service record, it was some third party remote assistance service.

But, how could this non-critical service not starting successfully, cause the Service Control Manager to reboot the machine? And what is that stack frame about reverting to last known good (services!ScRevertToLastKnownGood()+0x1af) doing on the stack?

Looking at the service record, I found that the SCM returned an error code 0x43c. This can be translated to ERROR_NOT_SAFEBOOT_SERVICE(This service cannot be started in Safe Mode). Also, the ErrorControl value for this service value was set to 0x2, which meant that if the service was not started successfully, the system needs to revert to the last known good configuration and reboot. However if the system was already using last known good, then it should just continue the boot process and log the error.

Error                      Control                 Meaning

Level

0x3                         (Critical)               Fail the attempted system startup.

                                                                If the startup is not using the

                                                                LastKnownGood control set, switch to

                                                                LastKnownGood. If the startup attempt

                                                                is using LastKnownGood, run a bug-check

                                                                routine.

0x2                         (Severe)               If the startup is not using the

                                                                LastKnownGood control set, switch to

                                                                LastKnownGood. If the startup attempt

                                                                is using LastKnownGood, continue on

                                                                in case of error.

0x1                         (Normal)             If the driver fails to load or initialize,

                                                                startup should proceed, but display a

                                                                warning.

0x0                         (Ignore)                               If the driver fails to load or initialize,

                                                                start up proceeds. No warning is displayed.

 

Because the service’s ErrorControl value is set to 0x2, the machine would revert to the last known good configuration and silently reboot. I booted the machine normally, and changed the ErrorControl value in the registry.

I also had to change the value in the other ControlSets, since they were identical to the current control set. This also explains why the machine kept rebooting every time, the value in the Last Known Good Configuration was also set incorrectly. L

I rebooted the machine and was able to boot into safe mode normally. Hence, the mystery of the silent reboots was solved.

Share this post :

More Posts Next page »
 
Page view tracker