• Ntdebugging Blog

    Understanding Pool Corruption Part 1 – Buffer Overflows


    Before we can discuss pool corruption we must understand what pool is.  Pool is kernel mode memory used as a storage space for drivers.  Pool is organized in a similar way to how you might use a notepad when taking notes from a lecture or a book.  Some notes may be 1 line, others may be many lines.  Many different notes are on the same page.


    Memory is also organized into pages, typically a page of memory is 4KB.  The Windows memory manager breaks up this 4KB page into smaller blocks.  One block may be as small as 8 bytes or possibly much larger.  Each of these blocks exists side by side with other blocks.


    The !pool command can be used to see the pool blocks stored in a page.


    kd> !pool fffffa8003f42000

    Pool page fffffa8003f42000 region is Nonpaged pool

    *fffffa8003f42000 size:  410 previous size:    0  (Free)      *Irp

                Pooltag Irp  : Io, IRP packets

    fffffa8003f42410 size:   40 previous size:  410  (Allocated)  MmSe

    fffffa8003f42450 size:  150 previous size:   40  (Allocated)  File

    fffffa8003f425a0 size:   80 previous size:  150  (Allocated)  Even

    fffffa8003f42620 size:   c0 previous size:   80  (Allocated)  EtwR

    fffffa8003f426e0 size:   d0 previous size:   c0  (Allocated)  CcBc

    fffffa8003f427b0 size:   d0 previous size:   d0  (Allocated)  CcBc

    fffffa8003f42880 size:   20 previous size:   d0  (Free)       Free

    fffffa8003f428a0 size:   d0 previous size:   20  (Allocated)  Wait

    fffffa8003f42970 size:   80 previous size:   d0  (Allocated)  CM44

    fffffa8003f429f0 size:   80 previous size:   80  (Allocated)  Even

    fffffa8003f42a70 size:   80 previous size:   80  (Allocated)  Even

    fffffa8003f42af0 size:   d0 previous size:   80  (Allocated)  Wait

    fffffa8003f42bc0 size:   80 previous size:   d0  (Allocated)  CM44

    fffffa8003f42c40 size:   d0 previous size:   80  (Allocated)  Wait

    fffffa8003f42d10 size:  230 previous size:   d0  (Allocated)  ALPC

    fffffa8003f42f40 size:   c0 previous size:  230  (Allocated)  EtwR


    Because many pool allocations are stored in the same page, it is critical that every driver only use the space they have allocated.  If DriverA uses more space than it allocated they will write into the next driver’s space (DriverB) and corrupt DriverB’s data.  This overwrite into the next driver’s space is called a buffer overflow.  Later either the memory manager or DriverB will attempt to use this corrupted memory and will encounter unexpected information.  This unexpected information typically results in a blue screen.


    The NotMyFault application from Sysinternals has an option to force a buffer overflow.  This can be used to demonstrate pool corruption.  Choosing the “Buffer overflow” option and clicking “Crash” will cause a buffer overflow in pool.  The system may not immediately blue screen after clicking the Crash button.  The system will remain stable until something attempts to use the corrupted memory.  Using the system will often eventually result in a blue screen.




    Often pool corruption appears as a stop 0x19 BAD_POOL_HEADER or stop 0xC2 BAD_POOL_CALLER.  These stop codes make it easy to determine that pool corruption is involved in the crash.  However, the results of accessing unexpected memory can vary widely, as a result pool corruption can result in many different types of bugchecks.


    As with any blue screen dump analysis the best place to start is with !analyze -v.  This command will display the stop code and parameters, and do some basic interpretation of the crash.


    kd> !analyze -v


    *                                                                             *

    *                        Bugcheck Analysis                                    *

    *                                                                             *




    An exception happened while executing a system service routine.


    Arg1: 00000000c0000005, Exception code that caused the bugcheck

    Arg2: fffff8009267244a, Address of the instruction which caused the bugcheck

    Arg3: fffff88004763560, Address of the context record for the exception that caused the bugcheck

    Arg4: 0000000000000000, zero.


    In my example the bugcheck was a stop 0x3B SYSTEM_SERVICE_EXCEPTION.  The first parameter of this stop code is c0000005, which is a status code for an access violation.  An access violation is an attempt to access invalid memory (this error is not related to permissions).  Status codes can be looked up in the WDK header ntstatus.h.


    The !analyze -v command also provides a helpful shortcut to get into the context of the failure.


    CONTEXT:  fffff88004763560 -- (.cxr 0xfffff88004763560;r)


    Running this command shows us the registers at the time of the crash.


    kd> .cxr 0xfffff88004763560

    rax=4f4f4f4f4f4f4f4f rbx=fffff80092690460 rcx=fffff800926fbc60

    rdx=0000000000000000 rsi=0000000000001000 rdi=0000000000000000

    rip=fffff8009267244a rsp=fffff88004763f60 rbp=fffff8009268fb40

    r8=fffffa8001a1b820  r9=0000000000000001 r10=fffff800926fbc60

    r11=0000000000000011 r12=0000000000000000 r13=fffff8009268fb48

    r14=0000000000000012 r15=000000006374504d

    iopl=0         nv up ei pl nz na po nc

    cs=0010  ss=0018  ds=002b  es=002b  fs=0053  gs=002b             efl=00010206


    fffff800`9267244a 4c8b4808        mov     r9,qword ptr [rax+8] ds:002b:4f4f4f4f`4f4f4f57=????????????????


    From the above output we can see that the crash occurred in ExAllocatePoolWithTag, which is a good indication that the crash is due to pool corruption.  Often an engineer looking at a dump will stop at this point and conclude that a crash was caused by corruption, however we can go further.


    The instruction that we failed on was dereferencing rax+8.  The rax register contains 4f4f4f4f4f4f4f4f, which does not fit with the canonical form required for pointers on x64 systems.  This tells us that the system crashed because the data in rax is expected to be a pointer but it is not one.


    To determine why rax does not contain the expected data we must examine the instructions prior to where the failure occurred.


    kd> ub .

    nt!KzAcquireQueuedSpinLock [inlined in nt!ExAllocatePoolWithTag+0x421]:

    fffff800`92672429 488d542440      lea     rdx,[rsp+40h]

    fffff800`9267242e 49875500        xchg    rdx,qword ptr [r13]

    fffff800`92672432 4885d2          test    rdx,rdx

    fffff800`92672435 0f85c3030000    jne     nt!ExAllocatePoolWithTag+0x7ec (fffff800`926727fe)

    fffff800`9267243b 48391b          cmp     qword ptr [rbx],rbx

    fffff800`9267243e 0f8464060000    je      nt!ExAllocatePoolWithTag+0xa94 (fffff800`92672aa8)

    fffff800`92672444 4c8b03          mov     r8,qword ptr [rbx]

    fffff800`92672447 498b00          mov     rax,qword ptr [r8]


    The assembly shows that rax originated from the data pointed to by r8.  The .cxr command we ran earlier shows that r8 is fffffa8001a1b820.  If we examine the data at fffffa8001a1b820 we see that it matches the contents of rax, which confirms this memory is the source of the unexpected data in rax.


    kd> dq fffffa8001a1b820 l1

    fffffa80`01a1b820  4f4f4f4f`4f4f4f4f


    To determine if this unexpected data is caused by pool corruption we can use the !pool command.


    kd> !pool fffffa8001a1b820

    Pool page fffffa8001a1b820 region is Nonpaged pool

     fffffa8001a1b000 size:  810 previous size:    0  (Allocated)  None


    fffffa8001a1b810 doesn't look like a valid small pool allocation, checking to see

    if the entire page is actually part of a large page allocation...


    fffffa8001a1b810 is not a valid large pool allocation, checking large session pool...

    fffffa8001a1b810 is freed (or corrupt) pool

    Bad previous allocation size @fffffa8001a1b810, last size was 81



    *** An error (or corruption) in the pool was detected;

    *** Attempting to diagnose the problem.


    *** Use !poolval fffffa8001a1b000 for more details.



    Pool page [ fffffa8001a1b000 ] is __inVALID.


    Analyzing linked list...

    [ fffffa8001a1b000 --> fffffa8001a1b010 (size = 0x10 bytes)]: Corrupt region



    Scanning for single bit errors...


    None found


    The above output does not look like the !pool command we used earlier.   This output shows corruption to the pool header which prevented the command from walking the chain of allocations.


    The above output shows that there is an allocation at fffffa8001a1b000 of size 810.  If we look at this memory we should see a pool header.  Instead what we see is a pattern of 4f4f4f4f`4f4f4f4f.


    kd> dq fffffa8001a1b000 + 810

    fffffa80`01a1b810  4f4f4f4f`4f4f4f4f 4f4f4f4f`4f4f4f4f

    fffffa80`01a1b820  4f4f4f4f`4f4f4f4f 4f4f4f4f`4f4f4f4f

    fffffa80`01a1b830  4f4f4f4f`4f4f4f4f 00574f4c`46524556

    fffffa80`01a1b840  00000000`00000000 00000000`00000000

    fffffa80`01a1b850  00000000`00000000 00000000`00000000

    fffffa80`01a1b860  00000000`00000000 00000000`00000000

    fffffa80`01a1b870  00000000`00000000 00000000`00000000

    fffffa80`01a1b880  00000000`00000000 00000000`00000000


    At this point we can be confident that the system crashed because of pool corruption.


    Because the corruption occurred in the past, and a dump is a snapshot of the current state of the system, there is no concrete evidence to indicate how the memory came to be corrupted.  It is possible the driver that allocated the pool block immediately preceding the corruption is the one that wrote to the wrong location and caused this corruption.  This pool block is marked with the tag “None”, we can search for this tag in memory to determine which drivers use it.


    kd> !for_each_module s -a @#Base @#End "None"

    fffff800`92411bc2  4e 6f 6e 65 e9 45 04 26-00 90 90 90 90 90 90 90  None.E.&........

    kd> u fffff800`92411bc2-1


    fffff800`92411bc1 b84e6f6e65      mov     eax,656E6F4Eh

    fffff800`92411bc6 e945042600      jmp     nt!ExAllocatePoolWithTag (fffff800`92672010)

    fffff800`92411bcb 90              nop


    The file Pooltag.txt lists the pool tags used for pool allocations by kernel-mode components and drivers supplied with Windows, the associated file or component (if known), and the name of the component. Pooltag.txt is installed with Debugging Tools for Windows (in the triage folder) and with the Windows WDK (in \tools\other\platform\poolmon).  Pooltag.txt shows the following for this tag:


    None - <unknown>    - call to ExAllocatePool


    Unfortunately what we find is that this tag is used when a driver calls ExAllocatePool, which does not specify a tag.  This does not allow us to determine what driver allocated the block prior to the corruption.  Even if we could tie the tag back to a driver it may not be sufficient to conclude that the driver using this tag is the one that corrupted the memory.


    The next step should be to enable special pool and hope to catch the corruptor in the act.  We will discuss special pool in our next article.

  • Ntdebugging Blog

    Using Netmon to figure out the source of high CPU in WMIprvse.exe


    Recently I was working with a customer who had a file server experiencing high CPU in the WMIprvse.exe process. We received multiple user dumps and noted that someone or something was running the same query again and again. We needed to figure out what was running the query in a tight loop, causing the high CPU.


    Figure 1 - Task Manager on FileServer

    Before we get into the exact troubleshooting steps, let me provide some background on WMI.  Winmgmt is the WMI service within a SVCHOST process running under the LocalSystem account.  In all cases, the WMI service automatically starts when the first management application or script requests connection to a WMI namespace. For more information, see Starting and Stopping the WMI Service. To avoid stopping all the services when a provider fails, each provider is loaded into a separate host process named "Wmiprvse.exe". This also allows each instance of Wmiprvse to run under a different account with varying security. For more details you can look at the MSDN documentation on WMI.


    I dumped out all the services in the various svchost.exe processes. You can do this from a command prompt by running the tasklist /svc command. In my instance, I found that the WinMgmt service was running in svchost, PID  452 (PID number will vary). Someone had to be making RPC calls to this svchost.exe process to run the WMI queries. It could be some local process on the machine; it could even be a process on a remote machine.


    At this point I requested user dumps of PID 452 from the customer.  This would allow me to determine who was making the RPC calls to svchost.exe to run the WMI queries.  While the customer was uploading the dumps, we decided to get a Network Monitor trace to see if the RPC calls were coming over the network.


    Immediately, I could see a lot of RPC traffic to the svchost.exe process(PID=452).



    Figure 2 - Network Monitor Output from the FileServer. Notice the Source and destination ports and IP addresses. IP addresses are hidden by the aliases

    Looking at the RPC payload, I could see the text of the WMI query. You can see this in the Hex Details Pane. The query that was running in a loop was Select * from Win32_Process. Looks like I found the source of the WMI queries.


    At this point, we got the source IP for the RPC packets. We logged into the machine, and brought up the Task Manager.



    Figure 3 - Task Manager on Remote Machine(Machine1)

    Immediately we saw that there was some script running inside a Wscript.exe process.  At this point I was pretty sure that this script was the culprit. The customer was not sure what this was, and was not comfortable terminating the process.  To prove my suspicion, I had him open a command prompt and run the following command, netstat –ano.



    Figure 4 - Netstat output from Remote Machine


    From the output in Fig. 4, I could see a TCP connection created by PID 3532 (wscript.exe). Looking at the local and foreign addresses from the above output, they matched up exactly to what we were seeing in the Network Monitor trace.


    In the above case, we already had our suspicions on the wscript.exe process; however, sometimes it might not be that easy. In that case, we could have used the netstat output to look at all connections to the file server ( If there were multiple connections, then we can also narrow it down by the port number. Based on that, we could have found the PID responsible.


    The customer called me later in the day, and told me that they had recently updated their scripts. One of their scripts had a bug which was running WMI scripts in a tight loop. Fixing the script caused the problem to go away.


    Had the query being coming from a local process, I would have had to debug the svchost.exe process, and figure out who was making the WMI calls. However, since we could see the traffic on netmon, we didn’t need to use the debugger. Interesting way to get to the root of the problem without using a debugger!



    Share this post :
  • Ntdebugging Blog

    DNS Slow to Start?


    Hi my name is Jason. I’m an escalation engineer on the Microsoft critical problem resolutions platform team. We recently dealt with an interesting issue that I would like to share with you and show how we went about discovering the cause. 

    The customer stated that when his server (Windows Server 2003 SP1) boots up the DNS Server service does not become availabe/responsive for ~5 minutes. Customer also stated he has other DC's in this domain that are the same model server (same specs too) that do not have the problem. So I ask, "What do you mean not responsive?". During this 5 minute period, the administrator cannot use the DNS MMC to administer the server, and DNS lookups to this server fail with a timeout error. The customer can: Ping the server, connect to file shares, etc.


    So my initial question was "What is DNS doing for 5 minutes?". In order to determine where dns.exe was hung up or spinning his wheels, I replaced dns.exe with the checked build.


    1.     Downloaded the checked build of SP1

    2.     Extracted and expanded the checked build of dns.exe (What is a checked build?)

    3.     Worked through the process of replacing the free build of DNS.exe (Booting into safe mode in order to bypass WFP (Windows file protection))

    4.     Created the dnsdebug file (a text file with no extension under %windir%\system32\dns). The checked build of DNS looks for this file, and reads the contents to determine the level of output. 

    5.     In the file I set the debug flags for DNS_DEBUG_INIT (0x00000010 ) & DNS_DEBUG_INIT2 (0x01000000) and set the output to go to the debugger (0x00000002), net result (0x01000012). The image below shows contents of the file.

    6.     Instead of using the debugger I decided to use DebugView from Sysinternals to view the output. I could have attached a debugger to DNS.exe to see the output, but in this case the DebugView was sufficient.


    With all this in place we rebooted. We launched DebugView and did a "net start dns". Here is a screen scrape of what the output looked like:



    I have highlighted the debug flag I have set.

    From this output we were able to determine that the Server was going through the following process:

    1.     DNS gets started by Service Control Manger (a.k.a. SCM) and then DNS communicates back to SCM (across the NtControlPipe)

    2.     DNS loads the zones from the registry (HKLM\Software\Microsoft\Windows NT\Current Version\DNS Server\Zones

    3.     DNS pulls this zones from the Active Directory (as this is an AD integrated zone)


    During these phases the ability to respond to administrative requests and name lookup requests is blocked (this is by design). So based on this I needed to figure out where we were spending the most of our 5 minutes.

    1.     For the SCM portion, I used Filemon monitoring only Named Pipes for dns.exe and services.exe.

    2.     For the registry portion, I used Regmon with a filter for dns.exe

    3.     For the AD portion, I used perfmon with the NTDS>LDAP Searches/Sec counter

    With these running we were able to isolate the delay to the point where DNS was pulling the zones from AD. Here is a scrape of me running this test on my VM:



    You can see (top left) the service communicating back to SCM across the NtControlPipe. Lets take a moment here and have a short discussion how a service starting works.

    1. SCM goes to a registry value that will be used as a seed value to build a named pipe. (HKLM\System\CurrentControlSet\Control\ServiceCurrent\(default) = <a number>
    2. T
    he result is incremented by 1 and then returned back to SCM
    3. SCM then creates a named pipe using NtControlPipe as the prefix followed by the numerical value from the registry. So when starting the first service, the pipe name would be NtControlPipe1 (To get a peek at these pipes download PipeList from Sysinternals)
    The SCM then starts the service process
    The service then goes to the same registry value and uses this seed to know which named pipe he will need to talk to in order to convey progress back to SCM.

    We can then see (top right) the service pulling in the dns zone information from the registry. Finally we see (bottom) the small spike in LDAP searches/sec (highlighted) where the service was pulling the zone information from the AD. Until the pull of the zones from the registry and AD is complete, the DNS service does not respond to MMC administration or DNS requests.

    The majority of the 5 minute delta fell into the last bucket (the LDAP queries to pull the zone data). So begins our next discussion, What causes LDAP to be slow?

    So LDAP lives in LSASS.EXE. There are two LDAP interfaces in AD. The entire directory is available of TCP port 389, and then the subset of AD that is published (a.k.a. Global Catalog) over TCP port 3268. The DNS services is pulling the DNS zones from the "Entire Directory" interface.

    Lsass memory usage on domain controllers has two components:

    • Data structures, which are like those in other processes (i.e. threads, heaps, and stacks)
    • Database buffer cache, which consists of database pages and index pages for the directory ß This is where a "In-RAM" copy of AD database would be


    In Windows Server 2003, there is no limit to how large the database buffer cache can grow (outside of the limits of the 2GB of virtual address space that can be modified via the /3GB switch). This helps explain the behavior of lsass.exe normally being the largest value in the "Mem Usage" category in task manager on a DC.

    Q. What can affect the amount of Database buffer cache that LSASS is maintaining "In-RAM?
    A. Memory pressure. This could be caused by configuration (pagefile settings, etc.) or another application(s) using and paging a lot of memory on the machine.

    lsass.exe on the affected machine was using 300mb in the "Mem Usage" column in task manager. The machine that was not affected was closer to 450mb. This means there was 150mb more data in the Database buffer cache, and due to this, the process is able to respond a lot faster to LDAP requests since he can pull from the RAM cache more and this limits the amount of time spent pulling this data into the cache from the DIT file on the disk.

    The solution is to tune the pagefile settings to match the better performing server.

    By simply increasing the pagefile from ~1 gig to ~3gig the DNS Server can be available within ~30 seconds

    So what are we (Microsoft) doing about this?

    Excerpt from the "Book of Longhorn"

    What new DNS functionality is provided by this feature in Windows Server "Longhorn"?

    The DNS Server service in Windows Server "Longhorn" includes a number of new and enhanced features compared to the DNS Server service that was available in the Microsoft Windows NT® Server, Windows 2000 Server, and Windows Server® 2003 operating systems. The following sections describe these features.


    Background zone loading

    Very large organizations with extremely large zones that store their DNS data in AD DS sometimes discover that restarting a DNS server can take an hour or more while the DNS data is retrieved from the directory service. The result is that the DNS server is effectively unavailable to service client requests for the entire time that it takes to load AD DS-based zones.

    A DNS server running Windows Server "Longhorn" now loads zone data from AD DS in the background while it restarts so that it can respond to requests for data from other zones. When the DNS server starts, it:

    • Enumerates all zones to be loaded.
    • Loads root hints from files or AD DS storage.
    • Loads all file-backed zones, that is, zones that are stored in files rather than in AD DS.
    • Begins responding to queries and remote procedure calls (RPCs).
    • Spawns one or more threads to load the zones that are stored in AD DS.

    Because the task of loading zones is performed by separate threads, the DNS server is able to respond to queries while zone loading is in progress. If a DNS client requests data for a host in a zone that has already been loaded, the DNS server responds with the dat (or, if appropriate, a negative response) as expected. If the request is for a node that has not yet been loaded into memory, the DNS server reads the node's data from AD DS and updates the node's record list accordingly.

    Why is this functionality important?

    The DNS server can use background zone loading to begin responding to queries almost immediately when it restarts, instead of waiting until its zones are fully loaded. The DNS server can respond to queries for the nodes that it has loaded or that can be retrieved from AD DS. This functionality also provides another advantage when zone data is stored in AD DS rather than in a file: AD DS can be accessed asynchronously and immediately when a query is received, while file-based zone data can be accessed only through a sequential read of the file.

    I hope you find this information helpful.

  • Ntdebugging Blog

    How Windows Shuts Down


    Hi my name is Bryan,  I'm a escalation engineer on the Microsoft CPR platforms team.   A common problem scenario involves shutting down Windows. When troubleshooting problems during shut down we have to look at the Winlogon.exe process which can be tricky and must be done correctly.


    Troubleshooting Techniques


    Common Settings


    Settings that would affect how Windows shuts down are mostly in HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Winlogon, and are documented thoroughly at http://www.microsoft.com/technet/prodtechnol/windows2000serv/reskit/regentry/12314.mspx. The setting to really pay attention is:


    DisableCAD (REG_DWORD). When set to 1 the GINA will bypass the Ctrl-Alt-Del dialog and go straight to the logon prompt. The system cannot be shutdown when the logon prompt at the console is displayed.


    Another common setting that is sometimes needed for security reasons is in HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management.


    ClearPageFileAtShutdown (REG_DWORD). When set to 1 the memory manager will clear the paging file on shutdown. If the paging file is large this can significantly the time it takes for the system to shut down.


    Winlogon Event Notification


    Prior to Windows Vista/2008 Server, Winlogon Notification Packages can be used to reliably monitor the state of Winlogon. These are simple export DLLs that are registered with entry points for each event monitored by the package. Winlogon explicitly calls these entry points when a specific event occurs. Each package handles the notification within the context of the Winlogon process. Keep in mind that a bug in a notification package can cause problems in Winlogon and render the system inaccessible. You can find more information about Winlogon Notification Packages at http://msdn2.microsoft.com/en-us/library/aa380545.aspx.


    Starting with Windows Vista support for Winlogon Notification Packages has been pulled. Most of this functionality is still provided through SCM event notifications. See the following link for more information about this.




    You can find information about how to write a service here.




    However these notifications are performed asynchronously so the exact moment of each state within Winlogon cannot be reliably known.


    Winlogon Logging


    Winlogon logging is obtained by applying the checked build of Winlogon and setting the following registry values under HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Winlogon.


    DebugFlags (REG_SZ). This is a comma-separated list of debug flags to capture. The list of flags is Error, Warning, Trace, Init, Timeout, Sas, State, MPR, CoolSwitch, Profile, DebugLsa, DebugSpm, DebugMpr, DebugGo, Migrate, DebugServices, Setup, SC, Notify, and Job.


    LogFile (REG_SZ). This is the path to the log file where the events should be written. If this value is missing then the events are written to the debug port.


    These options can also be placed in the win.ini file. This option and other debug logging options for Winlogon can be found in these articles.




    Checked builds of Windows binaries are available through MSDN subscriptions. The checked binary applied to the system must match the service pack level of the system.


    When looking at the Winlogon logs there are three check points to identify. They are the three ExitWindowsEx calls made during shutdown. These log entries were made during a shutdown initiated with “shutdown –r –f –t 0”.


    328.372> Winlogon-Trace: Doing remote-initiated (Administrator) Reboot=true, Force=true

    328.372> Winlogon-Trace: Starting user thread for Logoff, flags = 3807

    328.764> Winlogon-Trace: Doing immediate shutdown, LastGinaRet = b, Flags = 3804

    328.764> Winlogon-Trace: Calling ExitWindowsEx(0x3804, 0)




    328.1528> Winlogon-Trace: ExitWindowsEx called to shut down COM processes




    328.332> Winlogon-Trace: Starting shutdown

    328.332> Winlogon-Trace: Starting user thread for Logoff, flags = 7




    328.284> Winlogon-Trace: Calling ExitWindowsEx(0x7, 0)


    If shutdown stalls then based on the log entries you should be able to identify which part of shutdown it is stuck in.




    In order to debug Winlogon you will need to first install the debugging tools for Windows (http://www.microsoft.com/whdc/devtools/debugging/default.mspx). It contains information about how to set up the debugger for Winlogon debugging. If the debugger is not set up correctly on Winlogon the system can easily get in an unusable state and the system will need to be reinstalled.


    You should debug Winlogon whenever a problem manifests itself on the Winlogon desktop where the SAS and logon prompt dialogs are displayed. When debugging Winlogon you should start by looking at the main process thread. If shutdown (or any other Winlogon activity) is hung that thread will show what happened to prevent shutdown from completing. Also look at these Winlogon flags which provide some indication of whether or not the main thread is even processing a shutdown.


    0105fe8c winlogon!g_fReadyForShutdown

    0105fdf8 winlogon!ShutdownHasBegun

    01062b3c winlogon!ShutdownInProgress

    01062b30 winlogon!ShutdownTime


    I will point out how these variables are modified.


    Shutdown Sequence of Events


    If you are debugging a shutdown issue you first need to understand the sequence of events that take place during shutdown. This will help isolate what activity may be failing.


    RPC Call


    An application, local or remote, tells Windows to shut down by calling the Win32 API InitiateSystemShutdownEx (http://msdn2.microsoft.com/en-us/library/aa376874.aspx). This API creates an RPC call over the named pipe InitShutdown. The remote connect and RPC call looks like this in a network trace.


    11:08:40.025                 CLIENT         SERVER        SMB              SMB: C; Nt Create Andx, FileName = \InitShutdown

    11:08:40.027                 CLIENT         SERVER        MSRPC        MSRPC: c/o Request: unknown Call=0x1 Opnum=0x2 Context=0x0 Hint=0x20


    A local connect would not need to go over the network, but it uses the same mechanism to make this call.


    The server side of this RPC call is handled by the function BaseInitiateShutdownEx in the Winlogon.exe process. This RPC worker performs the following actions.


    1.       Checks the privilege level of the client. If the client privilege test fails then the error status code is returned. This will typically be ERROR_ACCESS_DENIED (0x5).

    2.       Parses the shutdown command sent by the RPC client. If the command is malformed then the status error code of ERROR_INVALID_PARAMETER (0x57) is returned.

    3.       Checks the Winlogon flags ShutdownInProgress, ShutdownHasBegun, g_fReadyForShutdown, and the current state of the terminal desktop to see if we can shutdown. Winlogon cannot begin the shutdown if a shutdown is in progress or if it is not ready. Winlogon also will not start a shutdown if the force flag was not specified in the command and the desktop is locked. If the ShutdownInProgress or ShutdownHasBegun flag is set then the return value is ERROR_SHUTDOWN_IN_PROGRESS (0x45B). If the g_fReadyForShutdown is not set then the return value is ERROR_NOT_READY (0x15).


    Debugger output:

    dd winlogon!g_fReadyForShutdown l 1

    0105fe8c  00000001


    dd winlogon!ShutdownInProgress l 1

    01062b3c  00000000


    dd winlogon!ShutdownHasBegun l 1

    0105fdf8  00000000


    4.       Winlogon initializes the shutdown data including the global variable ShutdownTime. If this variable has been set then we know we’ve gotten this far.


    Debugger Output:

    dq winlogon!ShutdownTime l 1

    01062b30  01c7a859`baee0060


    .formats 01c7a859`baee0060

    Evaluate expression:

      Time:    Wed Jun  6 12:42:54.506 2007 (GMT-4)


    5.       If Winlogon has determined that it can proceed then the ShutdownInProgress flag is set. Subsequent shutdown requests from this point will fail in step 3 when it checks this flag.

    6.       Generates the shutdown audit event if auditing was enabled. Note that shutdown may fail at some point after this even though the audit log is generated.

    7.       Creates another thread to continue working on the shutdown. At this point the RPC worker thread returns to the caller.


    To this point there has been no visible indication on the server being shutdown. If the RPC worker fails for some reason then the client application will get an indication of the failure. If the RPC worker successfully hands the shutdown request to the next thread then the client application will get ERROR_SUCCESS (0) as the return code. The client application will get no indication of a failure after that point.


    Worker Thread


    The RPC worker thread hands control to the LogoffThreadProc thread. If the delay specified in the shutdown command was 0 then the ShutdownInProgress flag is cleared and the ShutdownHasBegun is set. In the Winlogon log you will see a line that starts with “Doing immediate shutdown”. Otherwise you will not see any visible indication at this point that a shutdown is occurring.


    If the delay specified in the shutdown command was greater than 0 then the countdown shutdown dialog is displayed.




    When that dialog completes the AbortShutdown flag is checked. This flag would get set as a result of a call to AbortSystemShutdown (http://msdn2.microsoft.com/en-us/library/aa376630.aspx). If it is set then the shutdown is aborted. Otherwise the ShutdownInProgress flag is cleared, the ShutdownHasBegun flag is set and if the shutdown was initiated by a system process then we shut the system down immediately at this point with no further clean up.


    After these flags are updated the ExitWindowsInProgress flag is set and we call the Win32 API ExitWindowsEx (http://msdn2.microsoft.com/en-us/library/ms893047.aspx). In the Winlogon log you will see a line that starts with “Calling ExitWindowsEx”.


    Debugger Output:

    dd winlogon!ExitWindowsInProgress l 1

    0105fd84  00000001


    The Win32 API ExitWindowsEx makes an RPC call to CSRSS.EXE. CSRSS synchronously sends a WM_QUERYENDSESSION message to all Windows applications. When an application gets this message it indicates that shutdown can continue and CSRSS then sends the WM_ENDSESSION message. After that the process is terminated. If the application indicates that it cannot be terminated then CSRSS stops processing any further applications and waits for the interactive user to close the application. The ExitWindowsEx call will fail with error ERROR_OPERATION_ABORTED (0x3E3) and the Winlogon flags are reset so that a new shutdown request can be processed.


    An application that prevents shutdown from proceeding in this manner can be seen visual since it will be the foreground window on the desktop. To confirm which application returned did this you will need to live debug CSRSS. A return code of 3 from either winsrv!ConsoleClientShutdown or winsrv!UserClientShutdown will indicate the application did this.


    Debugger Output:

    0:002> pc

    eax=00000000 ebx=7c81a3ab ecx=7ffdb000 edx=75a58ca0 esi=75a58ca0 edi=00164600

    eip=75a564de esp=0052fe40 ebp=0052fe68 iopl=0         nv up ei pl zr na po nc

    cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000246


    75a564de ff5740 call dword ptr [edi+0x40]{winsrv!UserClientShutdown (75a9db1f)} ds:0023:00164640=75a9db1f


    ; Step past the call.

    0:002> p

    eax=00000003 ebx=7c81a3ab ecx=7ffdb000 edx=75a58ca0 esi=75a58ca0 edi=00164600

    eip=75a564e1 esp=0052fe4c ebp=0052fe68 iopl=0         nv up ei pl zr na po nc

    cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000246


    75a564e1 8bf8             mov     edi,eax


    ; The first parameter is a structure that contains the process ID.

    75a564d5 ff75f4           push    dword ptr [ebp-0xc]

    75a564d8 ff750c           push    dword ptr [ebp+0xc]

    75a564db ff75f8           push    dword ptr [ebp-0x8]

    75a564de ff5740           call    dword ptr [edi+0x40]


    ; Get the pointer to this structure.

    0:002> dd ebp-8 l 1

    dd ebp-8 l 1

    0052fe60  0018a530


    ; The first DWORD is the process ID. The second DWORD is the thread ID.

    0:002> dd 0018a530 l 1

    dd 0018a530

    0018a530  0000066c


    ; Break into kernel mode so we can look at all the processes.

    0:002> .breakin


    Break instruction exception - code 80000003 (first chance)


    8081db0e cc              int     3


    ; Get the process object with that process ID.

    kd> !process 0000066c 0

    Searching for Process with Cid == 66c

    PROCESS ff62a638  SessionId: 0  Cid: 066c    Peb: 7ffdf000  ParentCid: 0108

        DirBase: 0390d000  ObjectTable: e1658e38  HandleCount:  51.

        Image: test.exe


    Console (text-based) applications are asynchronously sent a separate CTRL_SHUTDOWN_EVENT notification. This means that ExitWindowsEx will proceed no matter how the application chooses to handle this notification.


    Services.exe (and all Windows services) is a console application which receives this notification from CSRSS. Services.exe registers a control handler ScShutdownNotificationRoutine which calls ScShutdownAllServices on shutdown. This function traverses through all Windows services that are not stopped or stopping and that accept the shutdown notification and sends them the SERVICE_CONTROL_SHUTDOWN notification. Each service has 20 seconds by default to shutdown. However a service may request more time by calling SetServiceStatus with a wait hint and updated check point. It can do this so long as it continues to respond within the current timeout period. However since Services.exe received this as an asynchronous message from CSRSS.EXE it will not prevent the system from shutting down.


    After ExitWindowsEx returns control to Winlogon the ExitWindowsInProgress flag is cleared and LogoffThreadProc exits.




    The state of Winlogon is controlled by the main process thread in a function called MainLoop. As LogoffThreadProc sets the ShutdownHasBegun flag and calls ExitWindowsEx the MainLoop function picks up on this change of state and begins executing its shutdown code. Since MainLoop is responsible for interaction with the user this is the first place where the user will get visible confirmation from Winlogon that the system is shutting down.


    When MainLoop sees that the Winlogon state has changed to shutting down it takes the following actions.


    1.       Signal the shell that we are shutting down. This causes the Explorer shell to disappear.

    2.       Checks to see if there are any updates to the user’s profile.

    3.       Send out the logoff notification event.

    4.       Delete network connections.

    5.       Play the logoff sound.

    6.       Play the system exit sound.

    7.       Creates a KillComProcesses thread. This calls ExitWindowsEx and will wait up to 15 minutes for this to complete.

    8.       Save and unload the user’s profile.

    9.       Delete RAS connections.

    10.    Send out the shutdown notification event.

    11.    Stop Windows file protection.

    12.    Creates another LogoffThreadProc thread which again calls ExitWindowsEx.

    13.    Call the shutdown function in the GINA. This displays the Windows is shutting down dialog.

    14.    Wait for any remaining system processes to complete. If we are stuck here then we would need to look at System, smss.exe, or csrss.exe. One of those will be stuck in some action.

    15.    Shut down the system.


    Winlogon uses the native API NtShutdownSystem in step 15. If there is a hang after this point then you will see the main thread in Winlogon stuck in this call and the real problem is likely the result of a device driver in the System process. This call will never return.

  • Ntdebugging Blog

    How Windows Starts Up (Part the second)


    Howdy folks, David here again with part two of How Windows Starts Up.  Today we’ll be covering the Boot Loader Phase.  Let’s take a moment to recap where we are at this point.  So far, the computer has completed POST, executed the MBR code, located the active partition, executed the Boot Sector code in that partition, found NTLDR and loaded it into memory.  So what happens next?  During this phase, we’ll be discussing the following files located in the root of your system volume:  NTLDR, ntdetect.com and boot.ini.


    The Boot Loader Phase begins when NTLDR executes:


    1.    NTLDR switches the processor from real mode to 32-bit flat memory mode.

    How does it do this you ask?  NTLDR contains two parts – a 16-bit component which handles switching the processor to Protected Mode and a 32-bit component which handles everything else. 


    When NTLDR first begins execution, the processor is in an x86-operating mode called Real-Mode.  In Real-Mode, no virtual to physical translation of memory addresses occurs, which means that all memory addresses use the 16-bit Segment:Offset format, and only the first 1MB of the computer's physical memory is accessible.  However, we don’t have to concern ourselves with this too much because the first action NTLDR takes is to switch the CPU to Protected-Mode. 


    *Trivia Alert*  before we enter Protected-Mode, we enable the A20 Line – you IBM guys know what I’m talking about… 


    Still no virtual-to-physical translation occurs at this point in the boot process, but a full 32 bits of memory addressing becomes accessible, and now NTLDR can access up to 4GB of physical memory. 


    2.    NTLDR creates enough page tables to make memory below 16MB accessible with paging turned on.  This number has been increased with Windows Server 2003. 


    3.    NTLDR enables paging.  “Protected-Mode with Paging Enabled” is the mode in which Windows executes during normal operation.


    4.    NTLDR then starts the FAT and NTFS mini file system drivers contained within its own code.  Unlike the Boot Sector's file system code, NTLDR's code is sophisticated enough to read subdirectories.


    Note:   All disk access from this point until the Kernel Phase begins uses BIOS Interrupt 13h, Extended Interrupt 13h or NTBOOTDD.SYS


    5.    NTLDR locates and parses Hiberfil.sys to determine if it contains a valid hibernator file.  If a valid hibernator file is found, then that file is loaded into memory and execution continues from the point the computer was placed into hibernation.


    Hibernation files are interesting.  The file either contains a full memory image, or it contains a pointer (ARC path) to the boot partition of the last operating system that entered hibernation.  That boot partition will contain another Hiberfil.sys file, which contains a full memory image of the hibernating operating system.  If the Hiberfil.sys file is a pointer, the file will contain an ARC path using one of the following syntaxes:


                             - linkmulti()disk()rdisk()partition()

                             - linksignature()disk()rdisk()partition()

                             - linkscsi()disk()rdisk()partition()


    6.    If Hiberfil.sys is not found, NTLDR parses the Boot.ini file.  If the Boot.ini contains a reference to more than one operating system or boot option, then the boot loader screen is displayed. 


    This is a sample of a default Boot.ini file from a Windows 2000 Server-based computer:


    [boot loader]



    [operating systems]

    multi(0)disk(0)rdisk(0)partition(1)\WINNT="Windows 2000 Server" /fastdetect


    Let’s talk a bit about what’s in the Boot.ini file.  Boot.ini contains a list of ARC paths.  Ok, quick – five bonus points to the first person who can tell me what ‘ARC’ stands for…  come on…  Yes, you there in the red shirt… correct!  It stands for ‘Advanced RISC Computing’; now beam down with the captain.  The rest of you come with me…


    ARC paths may use the “Multi”, “SCSI” or “Signature” syntax.


    The “Multi” syntax instructs Windows to rely on the system BIOS to load system files.  This means that NTLDR will be using INT13 BIOS calls to find and load NTOSKRNL and any other files on the boot partition.


    The “SCSI” syntax tells Windows to load a SCSI device driver to access the boot partition.  A copy of the driver must be renamed to NTBOOTDD.SYS and placed in the root of the system partition.


    The “Signature” syntax is equivalent to the SCSI syntax, but is used instead to support the Plug and Play architecture in Windows 2000 or later.  Because Windows 2000 is a PnP OS, the SCSI controller number instance might vary each time you start Windows 2000, especially if you add new SCSI controller hardware after setup.


    The Signature syntax instructs NTLDR to locate the drive with a disk signature that matches the value in the parentheses, regardless of which SCSI controller number that the drive is connected to.  Keep in mind that the Signature syntax also requires the presence of an Ntbootdd.sys file in the root of the system partition.  If you recall from the previous post, the disk signature value is located in the MBR.



    signature(8b467c12)disk(1)rdisk(0)partition(2)\winnt="Windows 2000"


          Signature() syntax is used only if one of the following conditions exists:


    ·         The System BIOS or controller hosting the boot partition on which Windows is installed, does not support INT13 Extensions, or has the INT13 Extensions option disabled, and the partition on which you installed Windows is either larger than ~7.8 gigabytes (GB) in size, or the ending cylinder number for that partition is higher than 1024.

    ·         The drive on which you installed Windows is connected to a SCSI controller with SCSI BIOS disabled, so INT13 BIOS calls cannot be used during the boot process.


                            227704 Windows May Use Signature() Syntax in the Boot.ini File



    7.    If Boot.ini is not found, NTLDR attempts to continue the boot process by assuming that Windows was installed in the default location on the same partition (c:\windows).  If the c:\windows directory is not found, NTLDR displays this error message:  Windows could not start because the following file is missing or corrupt: \winnt root\system32\ntoskrnl.exe


    8.    If the boot loader screen is displayed, the user must select an operating system to boot.

    a.    If the user chooses to run an operating system other than Windows 2003/XP such as DOS or Windows 95/98, NTLDR loads and executes the Bootsect.dos file located in the root of the boot drive.

    (1)  If the Bootsect.dos file is found, NTLDR loads it into RAM and then passes control to it.

    (2)  Bootsect.dos then loads IO.sys, MSDOS.sys, Command.com, etc., as necessary and proceeds to boot the downlevel operating system.

    (3)  If the file is not found in the root, NTLDR displays:

    I/O Error accessing boot sector file multi(0)disk(0)rdisk(0)partition(1): \bootsect.dos

    b.    If the user selects a Windows NT-based operating system, NTLDR executes Ntdetect.com.

    (1)  If the ARC path points to a hard drive or partition that is not on the system, the following is displayed:

    Windows could not start because of a computer disk hardware configuration problem.  Could not read from the selected boot disk.  Check boot path and disk hardware.  Please check the Windows documentation about hardware disk configuration and your hardware reference manuals for additional information.

    (2)  The above message may also be displayed if you are using Signature() ARC path notation and the disk signature in the MBR is overwritten (for example, by a virus) or changed (corrupted) so that it no longer matches the Signature() value in the Boot.ini file.  The following error message may also be displayed when there are problems with the Signature () value:

    Windows could not start because of the following ARC firmware boot configuration problem: did not properly generate ARC name for HAL and system paths. Please check the Windows documentation about ARC configuration options and your hardware reference manuals for additional information.


    9.    Ntdetect.com initializes and detects all supported hardware that it can find.  (Troubleshooting tip:  There is an Ntdetect.chk on the Windows Resource Kit CD in the Debug.cab file, which can be renamed and used in place of Ntdetect.com to view what Ntdetect.com is detecting during startup.  You know you want to – go ahead, I’ll wait…)


    Ntdetect.com detects the following components:


    a)    Computer ID

    b)    Bus/adapter type

    c)    Video adapter

    d)    Keyboard

    e)    Communications ports

    f)     Floppy disks

    g)    Mouse or other pointing devices

    h)    Parallel ports


    10.  Ntdetect.com passes its list of hardware components to NTLDR.


    11.  If the /SOS switch was used in the Boot.ini for the OS that is loading, NTLDR will display the files it loads into memory from this point on.  Keep in mind – none of these files are being initialized.  They are just being loaded into RAM at this point.  I can’t tell you how many times someone has incorrectly concluded that there is a problem with MUP.SYS because it’s the last file displayed on this screen before their computer hangs.  Poor MUP – always getting a bad rap.  This list can be useful, however; it can help you identify potential malware that may be loading.


    12.  NTLDR loads Ntoskrnl.exe and Hal.dll into memory.  Again, *loading*, not initializing.


    13.  NTLDR loads the System registry hive into memory.  If Windows cannot load the registry hive, it may be necessary to restore the System hive from the ERD or by using the Windows Recovery Console to copy the files from the Windows\Repair or Windows\Repair\RegBack folder.


    14.  NTLDR reads the HKLM\System\Select registry key.

    a.    If the Last Known Good option was not selected, NTLDR looks for the “Default” value in the SELECT key to select which ControlSet to use.  (0x1 = ControlSet001, 0x2 = ContolSet002, 0x3 = ControlSet003) 

    b.    If the Last Known Good option was selected, NTLDR looks for the “LastKnownGood” value in the SELECT key to select the ControlSet to use.


    15.  NTLDR sets the “Current” value in the HKLM\System\Select key to the ControlSet being used.


    Note:  At this point, NTLDR has determined which ControlSet it needs to use, but we don’t actually create the “CurrentControlSet” symbolic link until the kernel initializes (more on that in our next installment).  Therefore, for the purposes of this discussion, I’ll use *ActiveControlSet* to refer to whichever registry ControlSet has been identified as the one NTLDR needs to use.


    16.  NTLDR scans HKLM\System\*ActiveControlSet*\Control\Nls\Codepage for the “Acp”, “Oemcp”, and “Oemhal” values.  These entries are pointers to the table of codepage information stored in the same key.  The table has the filenames that correspond to the codepage numbers.  NTLDR loads these files into memory.  The defaults are C_1252.nls, C_437.nls, and Vgaoem.fon.


    A code page is a character set specific to an operating system or hardware platform.  It maps character codes to individual characters.  Different code pages include different special characters, typically customized for a language or a group of languages.  The system uses code pages to translate keyboard input into character values for non-Unicode based applications, and to translate character values into characters for non-Unicode based output displays.


    17.  NTLDR scans HKLM\System\*ActiveControlSet*\Control\Nls\Language for the “Default” value.  This entry is a pointer to the table of language information stored in the same key.  NTLDR loads this file into memory.  The default is L_intl.nls.


    18.  NTLDR scans HKLM\System\*ActiveControlSet*\Services for device drivers with a “Start” value of 0x0.  It finds their “Group” and “Tag” values also.


    19.  NTLDR scans HKLM\System\*ActiveControlSet*\Control\ServiceGroupOrder for the order that service groups should be loaded.


    20.  NTLDR sorts the drivers by “Group” and then by their “Tag” value.  Each service or driver has a particular group to which it belongs.  All drivers within a specific group are ordered by tag. 


    21.  NTLDR loads the 0x0 device drivers into memory.  It does not initialize or start these drivers.


    22.  The board is set, and the pieces are moving.  NTLDR executes Ntoskrnl.exe.


    Troubleshooting the Boot Loader Phase


    Okay, now that we’ve seen what is supposed to happen, what could possibly go wrong?  Well, several things…


    First, all file-system activity during this phase uses BIOS Int13, Extended Int13 or Ntbootdd.sys. 


    1.    If your computer uses Extended Int 13, verify that Ext. Int13 is enabled in your system’s BIOS or SCSI BIOS as appropriate.


    2.    If you are using a SCSI driver, verify that you are using the correct SCSI driver (Ntbootdd.sys).  If you are using "SCSI()" syntax in the Boot.ini file, copy the correct device driver for the SCSI controller to the root and then rename it to Ntbootdd.sys.  This is the driver that is stored in the Winnt.sbs\System32\Drivers folder (for example, the Aic78xx.sys file).  If you use "MULTI()" syntax in the Boot.ini file, you do not need to do this.  Use the Windows Recovery Console or a Windows NT boot floppy to complete this step.


                301680 HOW TO: Create a Boot Disk for an NTFS or FAT Partition



    307654 How to install and use the Recovery Console in Windows XP




    Significant Errors during this phase:


    ·         Windows could not start because of a computer disk hardware configuration problem.  Could not read from the selected boot disk.  Check boot path and disk hardware.  Please check the Windows documentation about hardware disk configuration and your hardware reference manuals for additional information.


    This error indicates that the ARC path is pointing to a hard drive or partition that is not on the system.  This message may also be displayed if you are using Signature() ARC path notation and the disk signature in the MBR is overwritten (for example, by a virus) or changed (corrupted) so that it no longer matches the Signature() value in the Boot.ini file. 


    ·         Windows could not start because of the following ARC firmware boot configuration problem: did not properly generate ARC name for HAL and system paths. Please check the Windows documentation about ARC configuration options and your hardware reference manuals for additional information.


    This error message may be displayed when there are problems with the Signature () ARC path syntax.


    Ntoskrnl.exe missing or corrupt.  This error could be the result of an incorrect ARC path in the Boot.ini file.  Use the information in the following knowledge base articles to correct the Boot.ini information.


                102873 BOOT.INI and ARC Path Naming Conventions and Usage



                155222 How to Determine the ARC Path



                130921 Creating an FT Boot Disk With Scsi( ) and Multi( ) Identifiers



    It is also possible to use the Windows XP Recovery Console to create a new Boot.ini for a Windows NT, Windows 2000 or .NET installation.  Please see the following knowledge base article for a description of this process:


                291980 A Discussion About the Bootcfg Command and Its Uses



    Specify the /SOS switch in the Boot.ini file and monitor the output to the screen.  If the system stops responding before anything is output to the screen, then the problem may be with hardware detection.  Replace Ntdetect.com with Ntdetect.chk to view the hardware detected by the system during this phase.


                103659 Setup Hangs While Inspecting Hardware; How to Use Ntdetect.com



                320040 HOW TO: Use Ntdetect.chk to Identify Hardware Problems




    Incorrect HAL

    If after specifying the /SOS option, you see that an incorrect version of the HAL.DLL is loading, use the following KB articles to replace the HAL or troubleshoot other HAL-related issues:


                314477 "Hardware Configuration Problem" Err Msg Starting Windows



                237556 Troubleshooting Windows 2000 Hardware Abstraction Layer Issues



    330184  "Invalid Boot.ini" or "Windows could not start" error messages




    Unable to load System hive

    Several problems could happen here.  First, consider that the amount of memory we’re dealing with is only 16MB.  We have to squeeze NTLDR, the System Registry Hive, the kernel, the HAL, all the boot drivers and the PFN database into a relatively small area.  So, with Windows Server 2003, we’ve given ourselves a bit more room. 


                302594 The System hive memory limitation is improved in Windows Server 2003



    If Windows is unable to load the System registry hive into memory, then it may be necessary to restore the System hive from the ERD or use the Windows Recovery Console to copy the files from the Winnt\Repair or Winnt\Repair\RegBack folder whichever version is most recent.


                307545 How to Recover from a Corrupted Registry




    Table of Startup Type Values


    0x0 = Boot

    Loaded into RAM by the Boot loader (NTLDR) during Boot Loader Phase, then initialized by the Kernel (NTOSKRNL) during Kernel Phase

    0x1 = System

    Loaded and initialized by the Kernel during Kernel Phase

    0x2 = Auto

    Loaded or started automatically at system startup

    0x3 = Manual

    Driver is manually started by the user or another process

    0x4 = Disabled

    Driver is not to be started (There are exceptions, such as NTFS)




    Well folks, we certainly have covered a lot of ground so far.  Any questions?  Yes, you there in the front…  Ah, why have we not discussed any blue screens?  Very good question.  We haven’t seen any BSOD’s up to this point in the boot sequence because the kernel isn’t running yet and STOP errors are generated by the kernel.  No kernel, no blue screens. 


    That’s all the time we have this week boys & girls.  Join us next time for How Windows Starts Up, The Kernel Phase.


  • Ntdebugging Blog

    How it Works: DLL Injection





    Hi everyone, this is Bob again.  I recently worked on an issue where the interaction of two threads in Winlogon led to a bugcheck.  One thread was a Winlogon thread initializing GDI.  The interesting thing about this scenario is how the other thread ended up in this process.




    What was the thread doing?


    Below is the user half of the thread stack.  The thread attempted to load a DLL.  


    ChildEBP RetAddr  Args to Child

    0058eaec 773901ad 773901d9 0058eafc 00240022 ntdll!KiFastSystemCallRet

    0058eb0c 775d96f3 775d1808 00000000 77e6f032 USER32!NtUserRegisterWindowMessage+0xc

    0058ed24 775e4755 00000000 00000001 7c837512 comctl32!InitGlobalMetrics+0x44

    0058ed3c 775e426a 00000031 0058ed68 7763490c comctl32!_ProcessAttach+0x98

    0058ed48 7763490c 775d0000 00000001 00000000 comctl32!DllMain+0x21

    0058ed68 7c81a352 775d0000 00000001 00000000 comctl32!_DllMainCRTStartup+0x52

    0058ed88 7c833465 776348ba 775d0000 00000001 ntdll!LdrpCallInitRoutine+0x14

    0058ee90 7c834311 00000000 00000000 7c8e2e58 ntdll!LdrpRunInitializeRoutines+0x367

    0058f124 7c834065 00000000 00080e98 0058f3ec ntdll!LdrpLoadDll+0x3cd

    0058f3a0 77e41bf3 00080e98 0058f3ec 0058f3cc ntdll!LdrLoadDll+0x198

    0058f408 77e5c70b 7c8e2e58 00000000 00000000 kernel32!LoadLibraryExW+0x1b2

    0058f41c 7c92a6a1 7c8e2e58 00000000 7c8e2e58 kernel32!LoadLibraryW+0x11

    0058f454 7c92a65f 7c8e2e58 7c8d0000 7c9297b6 SHELL32!SHFusionLoadLibrary+0x2a

    0058f460 7c9297b6 00000020 00000008 0058f6a8 SHELL32!DelayLoadCC+0x15

    0058f694 7c929728 0058f6a8 0000007c 00000001 SHELL32!SHFusionInitializeIDCC+0x92

    0058f8b4 7c92966f 7c8d0000 0000007c 00000001 SHELL32!SHFusionInitializeFromModuleID+0x3a

    0058f8c8 7c92962c 7c8d0000 00000001 0058f8f8 SHELL32!_ProcessAttach+0x34

    0058f8d8 7c92bb63 7c8d0000 00000001 00000000 SHELL32!DllMain+0x27

    0058f8f8 7c81a352 7c8d0000 00000001 00000000 SHELL32!_DllMainCRTStartup+0x52

    0058f918 7c833465 7c92bb1b 7c8d0000 00000001 ntdll!LdrpCallInitRoutine+0x14

    0058fa20 7c834311 00000000 00000000 00000004 ntdll!LdrpRunInitializeRoutines+0x367    ß This function is

          loading and calling init

          functions of dependent DLL’s

    0058fcb4 7c834065 00000000 00080760 0058ff7c ntdll!LdrpLoadDll+0x3cd

    0058ff30 77e41bf3 00080760 0058ff7c 0058ff5c ntdll!LdrLoadDll+0x198        ß 0058ff5c is the Unicode string

       pointer to DLL name

    0058ff98 77e5c70b 00570254 00000000 00000000 kernel32!LoadLibraryExW+0x1b2

    0058ffac 0057017e 00570254 00000000 00200008 kernel32!LoadLibraryW+0x11

    WARNING: Frame IP not in any known module. Following frames may be wrong.

    0058fff4 00000000 00570228 00905a4d 00000003 0x57017e



    The DLL being loaded depends on other DLLs.  These DLLs are loaded and initialized when the first DLL is loaded.  So if DLL ‘A’ has a call to DLL ‘B’, then DLL ‘B’ is loaded by the loader when DLL ‘A’ is loaded.



    What is so unusual about this thread?

    If you examine the IP for that start address taken from !thread and where it is calling LoadLibraryW, the IP is not in a range of any loaded module.


    1: kd> !thread

    THREAD 86edd020  Cid 7884.7528  Teb: 7ffdc000 Win32Thread: bc1adb48 RUNNING on processor 1

    Not impersonating

    DeviceMap                 e10018c0

    Owning Process            87c51d88       Image:         winlogon.exe

    Wait Start TickCount      2567944        Ticks: 0

    Context Switch Count      4                 LargeStack

    UserTime                  00:00:00.015

    KernelTime                00:00:00.000

    Start Address 0x00570000   ß Starting address.  This is not in any module displayed by “!peb”


    The !PEB extension will display the loaded module list and address range for the process.  It is not shown here because of space.  However this address is not in any loaded module.


    Let’s look at the function:


    00570000 55              push    ebp

    00570001 8bec            mov     ebp,esp

    00570003 83ec3c          sub     esp,3Ch

    00570006 8365e800        and     dword ptr [ebp-18h],0

    0057000a 8365ec00        and     dword ptr [ebp-14h],0

    0057000e 8365f800        and     dword ptr [ebp-8],0

    00570012 8365dc00        and     dword ptr [ebp-24h],0

    00570016 8365f000        and     dword ptr [ebp-10h],0


    1: kd> u

    0057001a 8365e000        and     dword ptr [ebp-20h],0

    0057001e 8365f400        and     dword ptr [ebp-0Ch],0

    00570022 6a01            push    1

    00570024 8b4508          mov     eax,dword ptr [ebp+8]        ß The 1st argument is a pointer to a list of functions.

    00570027 ff5004          call    dword ptr [eax+4]

    0057002a 8945fc          mov     dword ptr [ebp-4],eax

    0057002d 8b4508          mov     eax,dword ptr [ebp+8]        ß Function block.

    00570030 ff5010          call    dword ptr [eax+10h]


    1: kd> u

    00570033 8945e4          mov     dword ptr [ebp-1Ch],eax

    00570036 837de400        cmp     dword ptr [ebp-1Ch],0

    0057003a 0f84c0010000    je      00570200


    The first argument is a block of functions.  That is what was passed as the initial parameter.  What functions were passed?


    1: kd> dds 570228 l 5

    00570228  77e5c6fa kernel32!LoadLibraryW

    0057022c  77e6c2dc kernel32!SetErrorMode

    00570230  77e70531 kernel32!GetCurrentDirectoryW

    00570234  77e70d67 kernel32!SetCurrentDirectoryW

    00570238  77e63ec7 kernel32!GetProcessHeap


    These functions are standard kernel32 calls.  So, the question is why is it doing that?



    What is this thread doing?

    Based on the fact that the IP is not in any module, the IP is page aligned, and the thread was passed function addresses as it initial parameter, it looks like this thread was “injected” into this process. 



    How was this thread injected?

    Another process, (I do not know which one) allocated a block of memory in the Winlogon process via VirtualAllocEx.  This function takes a process handle as input and it can be a different process.  Then code can be moved into the process via WriteProcessMemory.  Then a thread can be created using the starting address of the memory address returned from VirtualAllocEx.



    Are we done?

    Nope – remember the block of addresses?  This is needed because the module was not loaded by the loader.  So the functions did not get resolved by the linker.  So the addresses of functions outside of the code moved are unknown.  Since in Windows Server 2003 the functions of certain DLLs stay at the same address, they can be passed to another process.  Vista and beyond does not do this, so this method will not work.




    Hope you found this interesting.  More on DLL injection can be found here: http://www.codeproject.com/dll/DLL_Injection_tutorial.asp and here: http://en.wikipedia.org/wiki/DLL_injection.


    AppInit_DLLs is another method of getting a DLL to load (in all process that rely on user32.dll), documented in KB 197571.  This can lead to problems though, as Raymond Chen discussed here.


  • Ntdebugging Blog

    Tracking down MmSt paged pool usage



    A trend that I’ve noticed recently are cases involving paged pool depletion with high MmSt tag usage that remains after trying KB304101 (PoolUsageMaximum). These pool allocations are used by the memory manager for section object prototype PTEs. There are generally only two options when this happens: 1) upgrade to a 64-bit platform, or 2) reduce the size of the volumes. But we may want to know what mapped files are using this memory. Here is how it can be done. Start with !memusage.


    5: kd> !memusage

     loading PFN database

    loading (100% complete)

    Compiling memory usage data (99% Complete).

                 Zeroed:  19073 ( 76292 kb)

                   Free:      0 (     0 kb)

                Standby: 1468824 (5875296 kb)

               Modified:    368 (  1472 kb)

        ModifiedNoWrite:   1927 (  7708 kb)

           Active/Valid: 605772 (2423088 kb)

             Transition:      0 (     0 kb)

                    Bad:      0 (     0 kb)

                Unknown:      0 (     0 kb)

                  TOTAL: 2095964 (8383856 kb)

      Building kernel map

      Finished building kernel map

    Scanning PFN database - (100% complete)


    Following this you will see the list of mapped files and their control areas.


      Usage Summary (in Kb):

    Control Valid Standby Dirty Shared Locked PageTables  name

    8c62a638  1108  945868  3064     0     0     0  mapped_file( $Mft )


    The control area is the address at the far left and has a Segment field that contains the total number of PTEs.


    5: kd> dt 8c62a638 _CONTROL_AREA Segment->TotalNumberOfPtes


       +0x000 Segment                    :

          +0x004 TotalNumberOfPtes          : 0x1e8b00


    The MmSt allocations contain these PTEs so all we need to do is multiply this by the size of a PTE to get the total size of the MmSt allocations for this control area. Note that there may be multiple allocations for this control area, but this number will reflect the total size all these allocations.


    5: kd> ?? 0x1e8b00 * sizeof(nt!_MMPTE)

    unsigned int 0xf45800


    So now we know the MmSt size in bytes for a single control area, or mapped file. What if we would like to see the totals for all mapped files from the !memusage output? First, place the !memusage output in a text file and remove all header information. You will also need to remove all tail information including the page file and process summaries. Every line should look like these.


    8c62a638  1108  945868  3064     0     0     0  mapped_file( $Mft )

    8b06ac18   516      0     0     0     0     0    No Name for File


    We want to include the “No Name for File” entries since those are valid mapped files even though the name could not be located. Next strip out everything but the control area address. You can use Excel or any other tool that allows you to select and delete columns in a text file. Now we have a file with a single column of all the control areas on the system. The following debugger command script can be used to process this file.


    $$ countptes.txt script

    r $t2 = 0;


    $$ Replace the memusage.txt file name with your file name.

    .foreach /f (ca "memusage.txt") {

        r $t1 = @@c++(((nt!_CONTROL_AREA *)(0x${ca}))->Segment->TotalNumberOfPtes);

        .printf "Control Area %p : %d\n", ${ca}, @$t1;

        r $t2 = @$t2 + @$t1;



    .printf "Total PTEs : %d\n", @$t2;

    .printf "MmSt size  : %d bytes\n", (@$t2 * @@c++(sizeof(nt!_MMPTE)));


    The following command will execute the script.


    5: kd> $$><countptes.txt


    This will show the number of PTEs for each control area, followed by a summary.


    Total PTEs : 62790244

    MmSt size  : 502321952 bytes


    A common high user of MmSt allocations is $Mft. The cache manager will hold the MmSt allocations for these file system metadata files at a cost of up to 4 files per PTE. This technique can be used to determine how much $Mft is using MmSt pool memory by first using findstr at a command prompt to isolate just those values from the !memusage output.


    C:\Projects>findstr /c:"$Mft" memusage.txt >mftusage.txt


    After stripping out the control area addresses with Excel and running the command script you’ll have the size of the MmSt allocations for just the $Mft files. If this is consuming most of the MmSt bytes then you are limited to the options mentioned at the beginning of this article. There may be other options if something else is the primary user but it will likely involve reducing some heavy load on the system.



  • Ntdebugging Blog

    How to track down High CPU in User Mode Applications - A live debug!



    Written by Jeff Dailey.


    Hello NTDebuggers,  I’d like to talk about a common issue we deal with on a regular basis.   We are often tasked with finding what functions are using CPU within a user mode process / application.  Typically a user will find an application that is using more CPU then they expect it to, and this can affect overall system performance and responsiveness.   


    For this exercise I’ve written some contrived sample code called EATCPU.  It’s included at the bottom of the blog post.    The following image of task manager shows EATCPU consuming 41% CPU time.   A customer or user may tell you this does not “normally” happen.  It’s always good to ask what “normal” is.   In this case we will say normal is ~10%.


    The best case scenario is a live debug on the process that is running at high CPU levels.  If you’re lucky enough to have a customer / user that will let you do a remote debug, and the problem reproduces on demand, you can take the following action.


    You need to install the debugging tools for Windows, and set your symbol path.  If at all possible acquire the symbols for the application you are debugging.  We’ll assume that you are the expert that supports said program.  If it’s written in house, get the symbols from the developer.  If it’s from a third party, that vendor may be willing to provide you with symbols for their product.  Microsoft has most of the symbols for our products available on our symbol server on the internet (!sympath srv*DownstreamStore*http://msdl.microsoft.com/download/symbols.)


    The next thing is to attach windbg.exe to the process in question.

    From the debuggers directory, type TLIST, this will list your process.  Get the process id and then run WinDBG.EXE –p PROCESSID, or if your debugging a program like EATCPU, you can run WINDBG C:\program\EATCPU.EXE.


    After the debugger is attached or the process is started in the debugger, reproduce the problem.


    Microsoft (R) Windows Debugger  Version 6.8.0001.0

    Copyright (c) Microsoft Corporation. All rights reserved.



    ***** WARNING: Your debugger is probably out-of-date.

    *****          Check http://dbg for updates.


    CommandLine: eatcpu.exe

    Symbol search path is: srv*C:\symbols*\\symbols\symbols

    Executable search path is:

    ModLoad: 00400000 0041a000   eatcpu.exe

    ModLoad: 779b0000 77b00000   ntdll.dll

    ModLoad: 76780000 76890000   C:\Windows\syswow64\kernel32.dll

    ModLoad: 62bb0000 62cd1000   C:\Windows\WinSxS\x86_microsoft.vc80.debugcrt_1fc8b3b9a1e18e3b_8.0.50727.762_none_24c8a196583ff03b\MSVCR80D.dll

    ModLoad: 75cb0000 75d5a000   C:\Windows\syswow64\msvcrt.dll

    (1090.164): Break instruction exception - code 80000003 (first chance)

    eax=00000000 ebx=00000000 ecx=712b0000 edx=00000000 esi=fffffffe edi=77a90094

    eip=779c0004 esp=0017faf8 ebp=0017fb28 iopl=0         nv up ei pl zr na pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246


    779c0004 cc              int     3

    0:000> g

    (1090.11d4): Break instruction exception - code 80000003 (first chance)

    eax=7efa3000 ebx=00000000 ecx=00000000 edx=77a1d894 esi=00000000 edi=00000000

    eip=779c0004 esp=0109ff74 ebp=0109ffa0 iopl=0         nv up ei pl zr na pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246


    779c0004 cc              int     3

    0:007> .sympath SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    0:007> g

    (1090.17d4): Break instruction exception - code 80000003 (first chance)

    eax=7efa3000 ebx=00000000 ecx=00000000 edx=77a1d894 esi=00000000 edi=00000000

    eip=779c0004 esp=0109ff74 ebp=0109ffa0 iopl=0         nv up ei pl zr na pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246


    779c0004 cc              int     3


    Once the problem has started run the !runaway command.


    0:007> !runaway

     User Mode Time

      Thread       Time

       2:c04       0 days 0:01:08.827  ß Note this thread, thread 2:c04 is using more CPU than any other.

       7:17d4      0 days 0:00:00.000  ß Note the other threads are using very little if any CPU.

       6:1a4c      0 days 0:00:00.000

       5:d20       0 days 0:00:00.000

       4:157c      0 days 0:00:00.000

       3:1b28      0 days 0:00:00.000

       1:1134      0 days 0:00:00.000

       0:164       0 days 0:00:00.000


    0:007> ~2s  ß Using the thread number 2, set the thread context with the ~Ns command.

    *** WARNING: Unable to verify checksum for eatcpu.exe

    eax=cccccccc ebx=00b93c48 ecx=0000002b edx=00b937a8 esi=00000000 edi=00d9fcf0

    eip=0041169c esp=00d9fcd0 ebp=00d9fd9c iopl=0         nv up ei pl nz na po nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010202


    0041169c f3ab            rep stos dword ptr es:[edi]  es:002b:00d9fcf0=cccccccc

    0:002> k    ß Dump the call stack using k.


    If you look at the following call stack, the applications code in this thread starts where you see EATCPU, the code before that is C  Runtime code for begin thread. I want to trace all the code that is running under being thread.  The assumption here is that I’ll find something looping and eating CPU.  To do this I will use the WT command.   However first I need to specify a beginning address for WT to start trace at.


    Let's use UF to Unassemble the Function that started our code by taking the return address of  eatcpu!myThreadFunction.


    ChildEBP RetAddr 

    00d9fd9c 00411657 eatcpu!checkSomething+0x1c [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 49]

    00d9fe74 004115a8 eatcpu!trySomething+0x27 [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 45]

    00d9ff58 62bb4601 eatcpu!myThreadFunction+0x38 [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 35]

    00d9ff94 62bb459c MSVCR80D!_beginthread+0x221

    00d9ffa0 768019f1 MSVCR80D!_beginthread+0x1bc

    00d9ffac 77a2d109 kernel32!BaseThreadInitThunk+0xe

    00d9ffec 00000000 ntdll!_RtlUserThreadStart+0x23 [d:\vistartm\base\ntos\rtl\rtlexec.c @ 2695]

    0:002> uf 004115a8 ß This command will unassemble the function at this address beginning to end.  


    0:007> uf 004115a8

    eatcpu!myThreadFunction [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 30]:

       30 00411570 55              push    ebp

       30 00411571 8bec            mov     ebp,esp

       30 00411573 81eccc000000    sub     esp,0CCh

       30 00411579 53              push    ebx

       30 0041157a 56              push    esi

       30 0041157b 57              push    edi

       30 0041157c 8dbd34ffffff    lea     edi,[ebp-0CCh]

       30 00411582 b933000000      mov     ecx,33h

       30 00411587 b8cccccccc      mov     eax,0CCCCCCCCh

       30 0041158c f3ab            rep stos dword ptr es:[edi]

       31 0041158e 8b4508          mov     eax,dword ptr [ebp+8]

       31 00411591 8945f8          mov     dword ptr [ebp-8],eax


    eatcpu!myThreadFunction+0x24 [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 33]:

       33 00411594 b801000000      mov     eax,1

       33 00411599 85c0            test    eax,eax

       33 0041159b 7410            je      eatcpu!myThreadFunction+0x3d (004115ad)


    eatcpu!myThreadFunction+0x2d [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 35]:

       35 0041159d 8b45f8          mov     eax,dword ptr [ebp-8]

       35 004115a0 8b08            mov     ecx,dword ptr [eax]

       35 004115a2 51              push    ecx

       35 004115a3 e880faffff      call    eatcpu!ILT+35(?trySomethingYAHHZ) (00411028)

       35 004115a8 83c404          add     esp,4

       36 004115ab ebe7            jmp     eatcpu!myThreadFunction+0x24 (00411594)


    eatcpu!myThreadFunction+0x3d [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 37]:

       37 004115ad 5f              pop     edi

       37 004115ae 5e              pop     esi

       37 004115af 5b              pop     ebx

       37 004115b0 81c4cc000000    add     esp,0CCh

       37 004115b6 3bec            cmp     ebp,esp

       37 004115b8 e8a1fbffff      call    eatcpu!ILT+345(__RTC_CheckEsp) (0041115e)

       37 004115bd 8be5            mov     esp,ebp

       37 004115bf 5d              pop     ebp

       37 004115c0 c3              ret


    0:002> wt -or 00411570 ß We will use WT to Watch Trace this code.  I’ve selected the starting address of the myThreadFunction function.  I’ve also specified –or to print the return value of each function.  Wt produces very visual output.  It allows you to quickly identify patterns in the way the code executes that would be much more difficult just using the T (TRACE) command. 


        8     0 [  0] ntdll!RtlSetLastWin32Error eax = 0

    >> No match on ret

        8     0 [  0] ntdll!RtlSetLastWin32Error

        2     0 [  0] eatcpu!checkSomething

        1     0 [  1]   eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  1]   eatcpu!_RTC_CheckEsp eax = 0

        9     3 [  0] eatcpu!checkSomething

        1     0 [  1]   eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  1]   eatcpu!_RTC_CheckEsp eax = 0

       12     6 [  0] eatcpu!checkSomething eax = 0

    >> No match on ret

       12     6 [  0] eatcpu!checkSomething

        7     0 [  0] eatcpu!trySomething

        1     0 [  1]   eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  1]   eatcpu!_RTC_CheckEsp eax = 0

       10     3 [  0] eatcpu!trySomething eax = 0

    >> No match on ret

       10     3 [  0] eatcpu!trySomething

        9     0 [  0] eatcpu!myThreadFunction  ß I see a pattern,  a loop.   Beginning of loop.

        1     0 [  1]   eatcpu!ILT+35(?trySomethingYAHHZ)

       60     0 [  1]   eatcpu!trySomething

        1     0 [  2]     eatcpu!ILT+180(?checkSomethingYAHHZ)

       62     0 [  2]     eatcpu!checkSomething

        5     0 [  3]       kernel32!SetLastError

       16     0 [  3]       ntdll!RtlSetLastWin32Error eax = 0

       64    21 [  2]     eatcpu!checkSomething

        1     0 [  3]       eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  3]       eatcpu!_RTC_CheckEsp eax = 0

       71    24 [  2]     eatcpu!checkSomething

        1     0 [  3]       eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  3]       eatcpu!_RTC_CheckEsp eax = 0

       74    27 [  2]     eatcpu!checkSomething eax = 0

       67   102 [  1]   eatcpu!trySomething

        1     0 [  2]     eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  2]     eatcpu!_RTC_CheckEsp eax = 0

       70   105 [  1]   eatcpu!trySomething eax = 0

       18   176 [  0] eatcpu!myThreadFunction  ß End of loop / beginning of loop

        1     0 [  1]   eatcpu!ILT+35(?trySomethingYAHHZ)

       60     0 [  1]   eatcpu!trySomething

        1     0 [  2]     eatcpu!ILT+180(?checkSomethingYAHHZ)

       62     0 [  2]     eatcpu!checkSomething

        5     0 [  3]       kernel32!SetLastError  ß Always look for what might be going wrong! Last error can give you a clue.  We are setting last error at the low level of the loop

     16     0 [  3]       ntdll!RtlSetLastWin32Error eax = 0

       64    21 [  2]     eatcpu!checkSomething

        1     0 [  3]       eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  3]       eatcpu!_RTC_CheckEsp eax = 0

       71    24 [  2]     eatcpu!checkSomething

        1     0 [  3]       eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  3]       eatcpu!_RTC_CheckEsp eax = 0

       74    27 [  2]     eatcpu!checkSomething eax = 0  ß Also note checkSomething is returning ZERO, this might indicate a problem.  You need to look at the code or assembler.

       67   102 [  1]   eatcpu!trySomething

        1     0 [  2]     eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  2]     eatcpu!_RTC_CheckEsp eax = 0

       70   105 [  1]   eatcpu!trySomething eax = 0

       27   352 [  0] eatcpu!myThreadFunction ß End of loop / beginning of loop

        1     0 [  1]   eatcpu!ILT+35(?trySomethingYAHHZ)

       60     0 [  1]   eatcpu!trySomething

        1     0 [  2]     eatcpu!ILT+180(?checkSomethingYAHHZ)

       62     0 [  2]     eatcpu!checkSomething

        5     0 [  3]       kernel32!SetLastError

       16     0 [  3]       ntdll!RtlSetLastWin32Error eax = 0

       64    21 [  2]     eatcpu!checkSomething

        1     0 [  3]       eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  3]       eatcpu!_RTC_CheckEsp eax = 0

       71    24 [  2]     eatcpu!checkSomething

        1     0 [  3]       eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  3]       eatcpu!_RTC_CheckEsp eax = 0

       74    27 [  2]     eatcpu!checkSomething eax = 0

       67   102 [  1]   eatcpu!trySomething

        1     0 [  2]     eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  2]     eatcpu!_RTC_CheckEsp eax = 0

       70   105 [  1]   eatcpu!trySomething eax = 0

       36   528 [  0] eatcpu!myThreadFunction ß End of loop / beginning of loop

        1     0 [  1]   eatcpu!ILT+35(?trySomethingYAHHZ)

       60     0 [  1]   eatcpu!trySomething

        1     0 [  2]     eatcpu!ILT+180(?checkSomethingYAHHZ)

       62     0 [  2]     eatcpu!checkSomething

        5     0 [  3]       kernel32!SetLastError

       16     0 [  3]       ntdll!RtlSetLastWin32Error eax = 0

       64    21 [  2]     eatcpu!checkSomething

        1     0 [  3]       eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  3]       eatcpu!_RTC_CheckEsp eax = 0

       71    24 [  2]     eatcpu!checkSomething

        1     0 [  3]       eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  3]       eatcpu!_RTC_CheckEsp eax = 0

       74    27 [  2]     eatcpu!checkSomething eax = 0

       67   102 [  1]   eatcpu!trySomething

    12930 instructions were executed in 12929 events (0 from other threads)


    Function Name                               Invocations MinInst MaxInst AvgInst

    eatcpu!ILT+180(?checkSomethingYAHHZ)                 69       1       1       1

    eatcpu!ILT+345(__RTC_CheckEsp)                      210       1       1       1

    eatcpu!ILT+35(?trySomethingYAHHZ)                    70       1       1       1

    eatcpu!_RTC_CheckEsp                                210       2       2       2

    eatcpu!checkSomething                                70      60      74      73

    eatcpu!myThreadFunction                               1     630     630     630

    eatcpu!trySomething                                  71      10      70      68

    kernel32!SetLastError                                70       5       5       5

    ntdll!RtlSetLastWin32Error                           70      16      16      16


    0 system calls were executed


    eax=cccccccc ebx=00b93c48 ecx=00000002 edx=00b937a8 esi=00000000 edi=00d9fe6c

    eip=0041164c esp=00d9fda8 ebp=00d9fe74 iopl=0         nv up ei pl nz na pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010206


    0041164c f3ab            rep stos dword ptr es:[edi]  es:002b:00d9fe6c=cccccccc


    0:002> !gle ß Now that we have broken in let's check and see what the last error is using !GLE (Get Last Error)  This dumps out the last error from the TEB.

    LastErrorValue: (Win32) 0x57 (87) - The parameter is incorrect.

    LastStatusValue: (NTSTATUS) 0xc000000d - An invalid parameter was passed to a service or function.


    0:007> bp kernel32!SetLastError  ß Lets set a breakpoint on last error to examine what is going on in the function calling it.

    0:007> g

    Breakpoint 1 hit

    eax=cccccccc ebx=00b93c48 ecx=00000000 edx=00b937a8 esi=00d9fcd0 edi=00d9fd9c

    eip=767913dd esp=00d9fcc8 ebp=00d9fd9c iopl=0         nv up ei pl zr na pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246


    767913dd 8bff            mov     edi,edi

    0:002> kv ß Get the call stack

    ChildEBP RetAddr 

    0:002> kv

    ChildEBP RetAddr  Args to Child             

    00d9fcc4 004116cb 00000057 00d9fe74 00000000 kernel32!SetLastError (FPO: [Non-Fpo])  ß 0x57 Invalid parameter error,  Why?

    00d9fd9c 00411657 00000000 00d9ff58 00000000 eatcpu!checkSomething+0x4b (FPO: [Non-Fpo]) (CONV: cdecl) [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 57]

    00d9fe74 004115a8 00000000 00000000 00000000 eatcpu!trySomething+0x27 (FPO: [Non-Fpo]) (CONV: cdecl) [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 45]

    00d9ff58 62bb4601 0017ff34 4f9f12e9 00000000 eatcpu!myThreadFunction+0x38 (FPO: [Non-Fpo]) (CONV: cdecl) [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 35]

    00d9ff94 62bb459c 00b937a8 00d9ffac 768019f1 MSVCR80D!_beginthread+0x221 (FPO: [Non-Fpo])

    00d9ffa0 768019f1 00b937a8 00d9ffec 77a2d109 MSVCR80D!_beginthread+0x1bc (FPO: [Non-Fpo])

    00d9ffac 77a2d109 00b93c48 00d926a6 00000000 kernel32!BaseThreadInitThunk+0xe (FPO: [Non-Fpo])

    00d9ffec 00000000 62bb4520 00b93c48 00000000 ntdll!_RtlUserThreadStart+0x23 (FPO: [Non-Fpo]) (CONV: stdcall) [d:\vistartm\base\ntos\rtl\rtlexec.c @ 2695]

    0:002> !error 00000057 ß double check, using !error, this will decode the error into a human readable string.

    Error code: (Win32) 0x57 (87) - The parameter is incorrect.


    0:002> uf checkSomething  ß lets disassemble the function calling SetLastError.

    eatcpu!checkSomething [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 49]:

       49 00411680 55              push    ebp

       49 00411681 8bec            mov     ebp,esp

       49 00411683 81ecc0000000    sub     esp,0C0h

       49 00411689 53              push    ebx

       49 0041168a 56              push    esi

       49 0041168b 57              push    edi

       49 0041168c 8dbd40ffffff    lea     edi,[ebp-0C0h]

       49 00411692 b930000000      mov     ecx,30h

       49 00411697 b8cccccccc      mov     eax,0CCCCCCCCh

       49 0041169c f3ab            rep stos dword ptr es:[edi]

       50 0041169e 837d0800        cmp     dword ptr [ebp+8],0 ß Check what our first parameter is on the stack, EBP+8 remember  PLUS == Parameters.  Note looking at the stack it’s 00000000     

       50 004116a2 741d            je      eatcpu!checkSomething+0x41 (004116c1ß if parameter 1 ones 0 we are going to jump to this addres, else we execute the following code. (WE JUMP) eatcpu!checkSomething+0x24 [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 52]:

       52 004116a4 8bf4            mov     esi,esp    ß The green would have been the good code path, non errant.

       52 004116a6 68fa000000      push    0FAh

       52 004116ab ff15a8814100    call    dword ptr [eatcpu!_imp__Sleep (004181a8)] ß Note we sleep or do some work other then eat CPU here if we are passed non ZERO

       52 004116b1 3bf4            cmp     esi,esp

       52 004116b3 e8a6faffff      call    eatcpu!ILT+345(__RTC_CheckEsp) (0041115e)

       53 004116b8 b801000000      mov     eax,1 ß We are setting EAX to 1, this means we have succeded

       53 004116bd eb15            jmp     eatcpu!checkSomething+0x54 (004116d4ß Now we jump to the clean up code for the fucntionn


    eatcpu!checkSomething+0x41 [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 57]:

       57 004116c1 8bf4            mov     esi,esp  ß This appears to be a failure case.  We did not get an expected parameter so we report an error and return zero.

       57 004116c3 6a57            push    57h ß Pusing error 0x57 on the stack, invalid parameter.

       57 004116c5 ff15a4814100    call    dword ptr [eatcpu!_imp__SetLastError (004181a4)] ß Our call to setlasterror

       57 004116cb 3bf4            cmp     esi,esp

       57 004116cd e88cfaffff      call    eatcpu!ILT+345(__RTC_CheckEsp) (0041115e)

       58 004116d2 33c0            xor     eax,eax  ß XORing eax with eax will make EAX Zero.  This is an error condition.


    eatcpu!checkSomething+0x54 [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 60]:

       60 004116d4 5f              pop     edi

       60 004116d5 5e              pop     esi

       60 004116d6 5b              pop     ebx

       60 004116d7 81c4c0000000    add     esp,0C0h

       60 004116dd 3bec            cmp     ebp,esp

       60 004116df e87afaffff      call    eatcpu!ILT+345(__RTC_CheckEsp) (0041115e)

       60 004116e4 8be5            mov     esp,ebp

       60 004116e6 5d              pop     ebp

       60 004116e7 c3              ret


    The key thing that should be observed by this scenario is that when dealing with a high CPU condition there is often some problem at the lower level of some loop condition that prevents the proper execution of code from happening.  If you’re lucky the problem is reported by some error facility in the OS or the application.  In either case you can use the above technique for isolation.


    The following is the sample code for EATCPU.     


    // eatcpu.cpp : Defines the entry point for the console application.



    #include "stdafx.h"

    #include <windows.h>

    #include <process.h>


    void myThreadFunction(void *);

    int doSomething(int);

    int trySomething(int);

    int checkSomething(int);


    int _tmain(int argc, _TCHAR* argv[])


          int truevalue = 1;

          int falsevalue = 0;

          // let's start some threads.  These should represent worker threads in a process.  Some of them will do “valid work” one will fail to do so.

          _beginthread(myThreadFunction,12000,(void *)&truevalue);

          _beginthread(myThreadFunction,12000,(void *)&truevalue);

          _beginthread(myThreadFunction,12000,(void *)&truevalue);

          _beginthread(myThreadFunction,12000,(void *)&falsevalue );

          _beginthread(myThreadFunction,12000,(void *)&truevalue);

          _beginthread(myThreadFunction,12000,(void *)&truevalue);



          return 0;



    void myThreadFunction(void *value)


          int *localvalue = (int *)value;







    int doSomething(int value)


          return trySomething(value);



    int trySomething(int value)


          return checkSomething(value);



    int checkSomething(int value)



          { // Make sure we have have some valid input parameter. 

            //We will pretend we are doing work, this could be anything , file I/O etc.


                return TRUE;



          { // This is an error condition, this function expects an NON Zero parameter and will report ZERO as an invalid parameter.



                return FALSE;






  • Ntdebugging Blog

    Critical Device Database TIP


    On a fairly regular basis, Bob Golding, our resident GES storage guru, sends out debugging tips to our group. We thought our blog readers would find value in these tips so we’re posting one here. Let us know what you think of the nugget.




    Hi everyone, Bob here.  Today I thought we’d have a little discussion about the Critical Device Database (CDDB) in the registry and an issue that can be caused when a device is not contained there.  This database stores configuration data for new devices that must be installed and started before the Windows components that normally install devices have been started.  The idea behind the critical device database is to allow the installation of devices without using user mode plug-and-play manager.   If a device is determined to be new, this database is queried to see if the information needed to install it are present. 

    What was the issue?

    A customer was getting a Stop 0x7B (Inaccessible_Boot_Device) after they installed a BIOS update.  Further investigation via the debugger using the !devnode command showed the following issue with a few devices:

             DevNode 0x8631f008 for PDO 0x8631f8e0

               InstancePath is "PCI\VEN_15B3&DEV_5A44&SUBSYS_5A4415B3&REV_A1\6&38f4f1f2&0&00080310"

               State = DeviceNodeInitialized (0x302)

               Previous State = DeviceNodeUninitialized (0x301)

               Problem = CM_PROB_NOT_CONFIGURED

    The above device is a bridge, and according to the definition of CM_PROB_NOT_CONFIGURED, it does not have a ConfigFlags registry entry.  I saw that this same problem occurred with a few bridges.  If the bridge cannot be enumerated, devices on the bridge will not be discovered either.  The Instance ID is used to look up the particulars such as driver name and class in the HKLM\System\CurrentControlSet\ENUM key in the registry.  What happened here is the lookup failed and the system thought it was a new device.  Since based on the device class this device was needed for boot, a Stop 0x7B occurred.

    What caused the issue?

    When the BIOS was updated the Instance ID included the version number of the bridge.  The update increased the version number, so the Instance ID was changed.

           DevNode 0x8635ca40 for PDO 0x8634c4a8

              InstancePath is "PCI\VEN_8086&DEV_1A38&SUBSYS_02DD1014&REV_B1\3&11583659&0&40"

              State = DeviceNodeInitialized (0x302)

              Previous State = DeviceNodeUninitialized (0x301)

              Problem = CM_PROB_NOT_CONFIGURED

    Looking at the registry data we could see:


    So a revision change caused the issue.

    What stops this from happening?

    Certain devices in the system are required for boot.  The device class determines if the device is in the boot device family.  If so, the hardware ID is written to the CDDB in the registry, so that if the device is determined to be new, it can be found there during boot.

    Below is an example of a bridge entry.  The contents of the pci#ven_8086&dev_244e contain the driver and the class.  This is enough to get the device started for boot.  The user mode PNP manager will complete the installation.


    How come this bridge was not there?

    When the INF was run, the device class was set to “unknown” class, so the OS did not know to write the information in the CDDB.  

    What was done to correct the problem?

    The BIOS update was temporarily reverted, and then the correct install package was found with the correct INF that has the bridges defined as a system device.  The device was re-installed (pre-update) so the device could be written properly in the CDDB, then the BIOS update was reapplied without a problem.



  • Ntdebugging Blog

    How to use the DedicatedDumpFile registry value to overcome space limitations on the system drive when capturing a system memory dump


    Prior to Windows Vista and Windows Server 2008 we had to keep a large page file on the system drive (typically drive C:) in order to properly capture system memory dumps for troubleshooting.  This presented problems as systems with very large amounts of RAM became more common, resulting in requirements for very large amounts of free space on the C: drive, or requiring that system visible memory be artificially limited for troubleshooting purposes.  This is no longer a requirement thanks to the Dedicated Dump File feature, which is available for use in Windows Vista and later operating systems.


    What is it?

    The dedicated dump file is basically a page file that is reserved for use only by the system crash dump routines.  It is not used for paging virtual memory.  Like a page file, the system process keeps an open handle to the dedicated dump file, which prevents it from being deleted.  When you manually initiate a memory dump, or the system crashes on its own, the data is written into the dedicated dump file instead of the page file on the system drive.  This feature is enabled by setting the following registry value:

    Location:   HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\CrashControl

    Name: DedicatedDumpFile

    Type:  REG_SZ

    Value: A dedicated dump file together with a full path, such as D:\dedicateddumpfile.sys

    Where can it be stored?

    The dedicated dump file can be stored on any local volume that can support a page file.  This volume must be considered a fixed disk device and must not be hosting Windows clustered storage.  For example, this means you can store this file on a second partition of your server’s primary hard disk, or on a SAN volume that is mounted early in the boot process. 

    Fun fact: Unlike a page file, you can actually direct the dedicated dump file to a sub-folder, even one that is actually a volume mount point.  However, the sub-folder must pre-exist during boot, and there is generally little reason to actually do this.


    How much space will it use?

    You can automatically allow the system to determine the size of the dedicated dump file, or you can manually set the size by using the DumpFileSize registry value. 


    System determined size:

    If you choose to let the system determine the size, then regardless of the dump file type set via the CrashDumpEnabled value, the dedicated dump file will be roughly the size of system visible RAM plus 256 MB.  The extra 256 MB allows room for secondary dump data that may be written by driver bug check callback routines.  In order to write this data, the drivers would have to first call KeRegisterBugCheckReasonCallback() to register the routines that would execute when the system bug checks.   This is currently not very common due to the limitations imposed by the high IRQL at which bug check callbacks run.  Allowing the system to determine the size is a good idea, especially when disk space is not of huge concern.  Also, the dedicated dump file size will be adjusted automatically at boot if additional RAM is installed or removed later.  This includes artificial memory reduction implemented via the BCDEdit option, truncatememory.  This eliminates the need to manually resize your page file to support proper collection of a memory dump.


    Manual size specified via the registry:

    The following registry value will allow you to manually set the dedicated dump file size in megabytes:

    Location:   HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\CrashControl

    Name: DumpFileSize

    Type:  REG_DWORD

    Value: The dump file size in megabytes



    One good reason one might have to set this value manually is in the case of a very large memory system from which only a kernel memory dump is desired.  In this case, a dedicated dump file the size of RAM+256 MB is not required.  Generally, even on 64-bit systems, kernel memory dumps are no larger than a few gigabytes, although they have the potential to be much larger.  In nearly all cases reserving one third the size of RAM will be more than sufficient to capture kernel memory.  If you have plenty of room on the drive and want to play it safe, you can simply allow the system to determine the size for you by setting the DumpFileSize value to zero, or by not creating the value at all.

    Although you may set the DumpFileSize to a value larger than needed, or the system chooses a size larger than the type of dump selected, the resultant file size after a bug check will only be as large as the data that is stored in it.  This is most commonly the size of physical RAM + room for the dump header information.  If you choose to set the DumpFileSize value manually, then you should also set the IgnorePagefileSize value under the CrashControl key to 1.  This will help ensure a dump file is created.


    What about the final dump file?

    After a bug check, the data will be written to the dedicated dump file on disk.  However, the final memory dump location will be based on the following registry value:

    Location:   HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\CrashControl

    Name: DumpFile

    Type:  REG_EXPAND_SZ

    Value: The target dump file name together with the full path, such as D:\memory.dmp



    The system will extract the dump data from the dedicated dump file and save the resultant file to the location and file name specified for the DumpFile value.  The default for this value is %SYSTEMROOT%\MEMORY.DMP.  Since we are using the dedicated dump file to overcome space limitations of the system drive, we will likely want to also set the DumpFile value to point to an alternate location than the default. 

    If you set this value to point to the same volume as the dedicated dump file, it will save WERFAULT.EXE the step of performing an additional copy of the temporary dump file after reboot.  Instead the file will simply be renamed on the target volume.   If there is not enough room for both the dedicated dump file and the memory dump after the bug check, then the dedicated dump file will not be recreated, and only the memory dump will remain.  If this happens and the system bug checks again, only the original dump will remain because there will be no dedicated dump to write the second bug check to until space is freed up on the drive and the system is rebooted.


    DedicatedDumpFile Usage Example:

    The following example will set the dedicated dump file and resultant dump to be stored on the root if drive D:


    Windows Registry Editor Version 5.00








    (Note: The lines above can be saved as a .REG file and imported into the registry)


    The screenshot below demonstrates a system configured to capture a complete 4 GB memory dump to a manually sized dedicated dump file, and will store the final MEMORY.DMP on the D: drive.


    Important information:

    949052 - Kernel memory dump files may not be generated on Windows Server 2008-based and Windows Vista SP1 or later based computers when physical memory is larger than the size of the page file


    957517 - A dedicated complete memory dump file may not be successfully generated if the volume that stores the dedicated dump file has insufficient free space

    950858 - Dedicated dump files are unexpectedly truncated to 4 GB on a computer that is running Windows Server 2008 or Windows Vista and that has more than 4 GB of physical memory


Page 5 of 24 (240 items) «34567»