Desktop heap is probably not something that you spend a lot of time thinking about, which is a good thing. However, from time to time you may run into an issue that is caused by desktop heap exhaustion, and then it helps to know about this resource. Let me state up front that things have changed significantly in Vista around kernel address space, and much of what I’m talking about today does not apply to Vista.
Laying the groundwork: Session Space
To understand desktop heap, you first need to understand session space. Windows 2000, Windows XP, and Windows Server 2003 have a limited, but configurable, area of memory in kernel mode known as session space. A session represents a single user’s logon environment. Every process belongs to a session. On a Windows 2000 machine without Terminal Services installed, there is only a single session, and session space does not exist. On Windows XP and Windows Server 2003, session space always exists. The range of addresses known as session space is a virtual address range. This address range is mapped to the pages assigned to the current session. In this manner, all processes within a given session map session space to the same pages, but processes in another session map session space to a different set of pages.
Session space is divided into four areas: session image space, session structure, session view space, and session paged pool. Session image space loads a session-private copy of Win32k.sys modified data, a single global copy of win32k.sys code and unmodified data, and maps various other session drivers like video drivers, TS remote protocol driver, etc. The session structure holds various memory management (MM) control structures including the session working set list (WSL) information for the session. Session paged pool allows session specific paged pool allocations. Windows XP uses regular paged pool, since the number of remote desktop connections is limited. On the other hand, Windows Server 2003 makes allocations from session paged pool instead of regular paged pool if Terminal Services (application server mode) is installed. Session view space contains mapped views for the session, including desktop heap.
Session Space layout:
Session Image Space: win32k.sys, session drivers
Session Structure: MM structures and session WSL
Session View Space: session mapped views, including desktop heap
Session Paged Pool
Sessions, Window Stations, and Desktops
You’ve probably already guessed that desktop heap has something to do with desktops. Let’s take a minute to discuss desktops and how they relate to sessions and window stations. All Win32 processes require a desktop object under which to run. A desktop has a logical display surface and contains windows, menus, and hooks. Every desktop belongs to a window station. A window station is an object that contains a clipboard, a set of global atoms and a group of desktop objects. Only one window station per session is permitted to interact with the user. This window station is named "Winsta0." Every window station belongs to a session. Session 0 is the session where services run and typically represents the console (pre-Vista). Any other sessions (Session 1, Session 2, etc) are typically remote desktops / terminal server sessions, or sessions attached to the console via Fast User Switching. So to summarize, sessions contain one or more window stations, and window stations contain one or more desktops.
You can picture the relationship described above as a tree. Below is an example of this desktop tree on a typical system:
- Session 0
| ---- WinSta0 (interactive window station)
| | |
| | ---- Default (desktop)
| | ---- Disconnect (desktop)
| | ---- Winlogon (desktop)
| ---- Service-0x0-3e7$ (non-interactive window station)
| | ---- Default (desktop)
| ---- Service-0x0-3e4$ (non-interactive window station)
| ---- SAWinSta (non-interactive window station)
| | ---- SADesktop (desktop)
- Session 1
- Session 2
---- WinSta0 (interactive window station)
---- Default (desktop)
---- Disconnect (desktop)
---- Winlogon (desktop)
In the above tree, the full path to the SADesktop (as an example) can be represented as “Session 0\SAWinSta\SADesktop”.
Desktop Heap – what is it, what is it used for?
Every desktop object has a single desktop heap associated with it. The desktop heap stores certain user interface objects, such as windows, menus, and hooks. When an application requires a user interface object, functions within user32.dll are called to allocate those objects. If an application does not depend on user32.dll, it does not consume desktop heap. Let’s walk through a simple example of how an application can use desktop heap.
1. An application needs to create a window, so it calls CreateWindowEx in user32.dll.
2. User32.dll makes a system call into kernel mode and ends up in win32k.sys.
3. Win32k.sys allocates the window object from desktop heap
4. A handle to the window (an HWND) is returned to caller
5. The application and other processes in the same session can refer to the window object by its HWND value
Where things go wrong
Normally this “just works”, and neither the user nor the application developer need to worry about desktop heap usage. However, there are two primary scenarios in which failures related to desktop heap can occur:
So how do you know if you are running into these problems? Processes failing to start with a STATUS_DLL_INIT_FAILED (0xC0000142) error in user32.dll is a common symptom. Since user32.dll needs desktop heap to function, failure to initialize user32.dll upon process startup can be an indication of desktop heap exhaustion. Another symptom you may observe is a failure to create new windows. Depending on the application, any such failure may be handled in different ways. Note that if you are experiencing problem number one above, the symptoms would usually only exist in one session. If you are seeing problem two, then the symptoms would be limited to processes that use the particular desktop heap that is exhausted.
Diagnosing the problem
So how can you know for sure that desktop heap exhaustion is your problem? This can be approached in a variety of ways, but I’m going to discuss the simplest method for now. Dheapmon is a command line tool that will dump out the desktop heap usage for all the desktops in a given session. See our first blog post for a list of tool download locations. Once you have dheapmon installed, be sure to run it from the session where you think you are running out of desktop heap. For instance, if you have problems with services failing to start, then you’ll need to run dheapmon from session 0, not a terminal server session.
Dheapmon output looks something like this:
Desktop Heap Information Monitor Tool (Version 7.0.2727.0)
Copyright (c) 2003-2004 Microsoft Corp.
Session ID: 0 Total Desktop: ( 5824 KB - 8 desktops)
WinStation\Desktop Heap Size(KB) Used Rate(%)
WinSta0\Default 3072 5.7
WinSta0\Disconnect 64 4.0
WinSta0\Winlogon 128 8.7
Service-0x0-3e7$\Default 512 15.1
Service-0x0-3e4$\Default 512 5.1
Service-0x0-3e5$\Default 512 1.1
SAWinSta\SADesktop 512 0.4
__X78B95_89_IW\__A8D9S1_42_ID 512 0.4
As you can see in the example above, each desktop heap size is specified, as is the percentage of usage. If any one of the desktop heaps becomes too full, allocations within that desktop will fail. If the cumulative heap size of all the desktops approaches the total size of session view space, then new desktops cannot be created within that session. Both of the failure scenarios described above depend on two factors: the total size of session view space, and the size of each desktop heap allocation. Both of these sizes are configurable.
Configuring the size of Session View Space
Session view space size is configurable using the SessionViewSize registry value. This is a REG_DWORD and the size is specified in megabytes. Note that the values listed below are specific to 32-bit x86 systems not booted with /3GB. A reboot is required for this change to take effect. The value should be specified under:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management
Size if no registry value configured
Default registry value
Windows 2000 *
Windows Server 2003
* Settings for Windows 2000 are with Terminal Services enabled and hotfix 318942 installed. Without the Terminal Services installed, session space does not exist, and desktop heap allocations are made from a fixed 48 MB region for system mapped views. Without hotfix 318942 installed, the size of session view space is fixed at 20 MB.
The sum of the sizes of session view space and session paged pool has a theoretical maximum of slightly under 500 MB for 32-bit operating systems. The maximum varies based on RAM and various other registry values. In practice the maximum value is around 450 MB for most configurations. When the above values are increased, it will result in the virtual address space reduction of any combination of nonpaged pool, system PTEs, system cache, or paged pool.
Configuring the size of individual desktop heaps
Configuring the size of the individual desktop heaps is bit more complex. Speaking in terms of desktop heap size, there are three possibilities:
· The desktop belongs to an interactive window station and is a “Disconnect” or “Winlogon” desktop, so its heap size is fixed at 64KB or 128 KB, respectively (for 32-bit x86)
· The desktop heap belongs to an interactive window station, and is not one of the above desktops. This desktop’s heap size is configurable.
· The desktop heap belongs to a non-interactive window station. This desktop’s heap size is also configurable.
The size of each desktop heap allocation is controlled by the following registry value:
The default data for this registry value will look something like the following (all on one line):
SharedSection=1024,3072,512 Windows=On SubSystemType=Windows
The numeric values following "SharedSection=" control how desktop heap is allocated. These SharedSection values are specified in kilobytes.
The first SharedSection value (1024) is the shared heap size common to all desktops. This memory is not a desktop heap allocation, and the value should not be modified to address desktop heap problems.
The second SharedSection value (3072) is the size of the desktop heap for each desktop that is associated with an interactive window station, with the exception of the “Disconnect” and “Winlogon” desktops.
The third SharedSection value (512) is the size of the desktop heap for each desktop that is associated with a "non-interactive" window station. If this value is not present, the size of the desktop heap for non-interactive window stations will be same as the size specified for interactive window stations (the second SharedSection value).
Consider the two desktop heap exhaustion scenarios described above. If the first scenario is encountered (session view space is exhausted), and most of the desktop heaps are non-interactive, then the third SharedSection can be decreased in an effort to allow more (smaller) non-interactive desktop heaps to be created. Of course, this may not be an option if the processes using the non-interactive heaps require a full 512 KB. If the second scenario is encountered (a single desktop heap allocation is full), then the second or third SharedSection value can be increased to allow each desktop heap to be larger than 3072 or 512 KB. A potential problem with this is that fewer total desktop heaps can be created.
What are all these window stations and desktops in Session 0 anyway?
Now that we know how to tweak the sizes of session view space and the various desktops, it is worth talking about why you have so many window stations and desktops, particularly in session 0. First off, you’ll find that every WinSta0 (interactive window station) has at least 3 desktops, and each of these desktops uses various amounts of desktop heap. I’ve alluded to this previously, but to recap, the three desktops for each interactive window stations are:
· Default desktop - desktop heap size is configurable as described below
· Disconnect desktop - desktop heap size is 64k on 32-bit systems
· Winlogon desktop - desktop heap size is 128k on 32-bit systems
Note that there can potentially be more desktops in WinSta0 as well, since any process can call CreateDesktop and create new desktops.
Let’s move on to the desktops associated with non-interactive window stations: these are usually related to a service. The system creates a window station in which service processes that run under the LocalSystem account are started. This window station is named service-0x0-3e7$. It is named for the LUID for the LocalSystem account, and contains a single desktop that is named Default. However, service processes that run as LocalSystem interactive start in Winsta0 so that they can interact with the user in Session 0 (but still run in the LocalSystem context).
Any service process that starts under an explicit user or service account has a window station and desktop created for it by service control manager, unless a window station for its LUID already exists. These window stations are non-interactive window stations. The window station name is based on the LUID, which is unique for every logon. If an entity (other than System) logs on multiple times, a new window station is created for each logon. An example window station name is “service-0x0-22e1$”.
A common desktop heap issue occurs on systems that have a very large number of services. This can be a large number of unique services, or one (poorly designed, IMHO) service that installs itself multiple times. If the services all run under the LocalSystem account, then the desktop heap for Session 0\Service-0x0-3e7$\Default may become exhausted. If the services all run under another user account which logs on multiples times, each time acquiring a new LUID, there will be a new desktop heap created for every instance of the service, and session view space will eventually become exhausted.
Given what you now know about how service processes use window stations and desktops, you can use this knowledge to avoid desktop heap issues. For instance, if you are running out of desktop heap for the Session 0\Service-0x0-3e7$\Default desktop, you may be able to move some of the services to a new window station and desktop by changing the user account that the service runs under.
I hope you found this post interesting and useful for solving those desktop heap issues! If you have questions are comments, please let us know.
- Matthew Justice
[Update: 7/5/2007 - Desktop Heap, part 2 has been posted]
[Update: 9/13/2007 - Talkback video: Desktop Heap has been posted]
[Update: 3/20/2008 - The default interactive desktop heap size has been increased on 32-bit Vista SP1]
Hi! My name is Tate. I’m an Escalation Engineer on the Microsoft Critical Problem Resolution Platforms Team. I wanted to share one of the most common errors we troubleshoot here on the CPR team, its root cause being pool consumption, and the methods by which we can remedy it quickly!
This issue is commonly misdiagnosed, however, 90% of the time it is actually quite possible to determine the resolution quickly without any serious effort at all!
First, what do these events really mean?
Event ID 2020 Event Type: Error Event Source: Srv Event Category: None Event ID: 2020 Description: The server was unable to allocate from the system paged pool because the pool was empty.
Event ID 2019 Event Type: Error Event Source: Srv Event Category: None Event ID: 2019 Description: The server was unable to allocate from the system NonPaged pool because the pool was empty.
This is our friend the Server Service reporting that when it was trying to satisfy a request, it was not able to find enough free memory of the respective type of pool. 2020 indicates Paged Pool and 2019, NonPaged Pool. This doesn’t mean that the Server Service (srv.sys) is broken or the root cause of the problem, more often rather it is the first component to see the resource problem and report it to the Event Log. Thus, there could be (and usually are) a few more symptoms of pool exhaustion on the system such as hangs, or out of resource errors reported by drivers or applications, or all of the above!
What is Pool?
First, Pool is not the amount of RAM on the system, it is however a segment of the virtual memory or address space that Windows reserves on boot. These pools are finite considering address space itself is finite. So, because 32bit(x86) machines can address 2^32==4Gigs, Windows uses (by default) 2GB for applications and 2GB for kernel. Of the 2GB for kernel there are other things we must fit in our 2GB such as Page Table Entries (PTEs) and as such the maximum amount of Paged Pool for 32bit(x86) of ~460MB puts this in perspective in terms of our realistic limits per processor architecture. As this implies, 64bit(x64&ia64) machines have less of a problem here due to their larger address space but there are still limits and thus no free lunch.
*For more about determining current pool limits see the common question post “Why am I out of Paged Pool at ~200MB…” at the end of this post.
*For more info about pools: About Memory Management > Memory Pools
*This has changed a bit for Vista, see Dynamic Kernel Address space
What are these pools used for?
These pools are used by either the kernel directly, indirectly by its support of various structures due to application requests on the system (CreateFile for example), or drivers installed on the system for their memory allocations made via the kernel pool allocation functions.
Literally, NonPaged means that this memory when allocated will not be paged to disk and thus resident at all times, which is an important feature for drivers. Paged conversely, can be, well… paged out to disk. In the end though, all this memory is allocated through a common set of functions, most common is ExAllocatePoolWithTag.
Ok, so what is using it/abusing it? (our goal right!?)
Now that we know that the culprit is Windows or a component shipping with Windows, a driver, or an application requesting lots of things that the kernel has to create on its behalf, how can we find out which?
There are really four basic methods that are typically used (listing in order of increasing difficulty)
1.) Find By Handle Count
Handle Count? Yes, considering that we know that an application can request something of the OS that it must then in turn create and provide a reference to…this is typically represented by a handle, and thus charged to the process’ total handle count!
The quickest way by far if the machine is not completely hung is to check this via Task Manager. Ctrl+Shift+Esc…Processes Tab…View…Select Columns…Handle Count. Sort on Handles column now and check to see if there is a significantly large one there (this information is also obtainable via Perfmon.exe, Process Explorer, Handle.exe, etc.).
What’s large? Well, typically we should raise an eyebrow at anything over 5,000 or so. Now that’s not to say that over this amount is inherently bad, just know that there is no free lunch and that a handle to something usually means that on the other end there is a corresponding object stored in NonPaged or Paged Pool which takes up memory.
So for example let’s say we have a process that has 100,000 handles, mybadapp.exe. What do we do next?
Well, if it’s a service we could stop it (which releases the handles) or if an application running interactively, try to shut it down and look to see how much total Kernel Memory (Paged or NonPaged depending on which one we are short of) we get back. If we were at 400MB of Paged Pool (Look at Performance Tab…Kernel Memory…Paged) and after stopping mybadapp.exe with its 100,000 handles are now at a reasonable 100MB, well there’s our bad guy and following up with the owner or further investigating (Process Explorer from sysinternals or the Windows debugger for example) what type of handles are being consumed would be the next step.
For essential yet legacy applications, which there is no hope of replacing or obtaining support, we may consider setting up a performance monitor alert on the handle count when it hits a couple thousand or so (Performance Object: Process, Counter: Handle Count) and taking action to restart the bad service. This is a less than elegant solution for sure but it could keep the one rotten apple from spoiling the bunch by hanging/crashing the machine!
2.) By Pooltag (as read by poolmon.exe)
Okay, so no handle count gone wild? No problem.
For Windows 2003 and later machines, a feature is enabled by default that allows tracking of the pool consumer via something called a pooltag. For previous OS’s we will need to use a utility such as gflags.exe to Enable Pool Tagging (which requires a reboot unfortunately). This is usually just a 3-4 character string or more technically “a character literal of up to four characters delimited by single quotation marks” that the caller of the kernel api to allocate the pool will provide as its 3rd parameter. (see ExAllocatePoolWithTag)
The tool that we use to get the information about what pooltag is using the most is poolmon.exe. Launch this from a cmd prompt, hit B to sort by bytes descending and P to sort the list by the type (Paged, NonPaged, or Both) and we have a live view into what’s going on in the system. Look specifically at the Tag Name and its respective Byte Total column for the guilty party! Get Poolmon.exe Here or More info about poolmon.exe usage.
The cool thing is that we have most of the OS utilized pooltags already documented so we have an idea if there is a match for one of the Windows components in pooltag.txt. So if we see MmSt as the top tag for instance consuming far and away the largest amount, we can look at pooltag.txt and know that it’s the memory manager and also using that tag in a search engine query we might get the more popular KB304101 which may resolve the issue!
We will find pooltag.txt in the ...\Debugging Tools for Windows\triage folder when the debugging tools are installed.
Oh no, what if it’s not in the list? No problem…
We might be able to find its owner by using one of the following techniques:
• For 32-bit versions of Windows, use poolmon /c to create a local tag file that lists each tag value assigned by drivers on the local machine (%SystemRoot%\System32\Drivers\*.sys). The default name of this file is Localtag.txt.
Really all versions---->• For Windows 2000 and Windows NT 4.0, use Search to find files that contain a specific pool tag, as described in KB298102, How to Find Pool Tags That Are Used By Third-Party Drivers.
3.) Using Driver Verifier
Using driver verifier is a more advanced approach to this problem. Driver Verifier provides a whole suite of options targeted mainly at the driver developer to run what amounts to quality control checks before shipping their driver.
However, should pooltag identification be a problem, there is a facility here in Pool Tracking that does the heavy lifting in that it will do the matching of Pool consumer directly to driver!
Be careful however, the only option we will likely want to check is Pool Tracking as the other settings are potentially costly enough that if our installed driver set is not perfect on the machine we could get into an un-bootable situation with constant bluescreens notifying that xyz driver is doing abc bad thing and some follow up suggestions.
In summary, Driver Verifier is a powerful tool at our disposal but use with care only after the easier methods do not resolve our pool problems.
4.) Via Debug (live and postmortem)
As mentioned earlier the api being used here to allocate this pool memory is usually ExAllocatePoolWithTag. If we have a kernel debugger setup we can set a break point here to brute force debug who our caller is….but that’s not usually how we do it, can you say, “extended downtime?” There are other creative live debug methods with are a bit more advanced that we may post later…
Usually, debugging this problem involves a post mortem memory.dmp taken from a hung server or a machine that has experienced Event ID: 2020 or Event ID 2019 or is no longer responsive to client requests, hung, or often both. We can gather this dump via the Ctrl+Scroll Lock method see KB244139 , even while the machine is “hung” and seemingly unresponsive to the keyboard or Ctrl+Alt+Del !
When loading the memory.dmp via windbg.exe or kd.exe we can quickly get a feel for the state of the machine with the following commands.
Debugger output Example 1.1 (the !vm command)
2: kd> !vm
*** Virtual Memory Usage *** Physical Memory: 262012 ( 1048048 Kb) Page File: \??\C:\pagefile.sys Current: 1054720Kb Free Space: 706752Kb Minimum: 1054720Kb Maximum: 1054720Kb Page File: \??\E:\pagefile.sys Current: 2490368Kb Free Space: 2137172Kb Minimum: 2490368Kb Maximum: 2560000Kb Available Pages: 63440 ( 253760 Kb) ResAvail Pages: 194301 ( 777204 Kb) Modified Pages: 761 ( 3044 Kb) NonPaged Pool Usage: 52461 ( 209844 Kb)<<NOTE! Value is near NonPaged Max NonPaged Pool Max: 54278 ( 217112 Kb) ********** Excessive NonPaged Pool Usage *****
Note how the NonPaged Pool Usage value is near the NonPaged Pool Max value. This tells us that we are basically out of NonPaged Pool.
Here we can use the !poolused command to give the same information that poolmon.exe would have but in the dump….
Debugger output Example 1.2 (!poolused 2)
Note the 2 value passed to !poolused orders pool consumers by NonPaged
2: kd> !poolused 2 Sorting by NonPaged Pool Consumed
Pool Used: NonPaged Paged Tag Allocs Used Allocs Used Thre 120145 76892800 0 0 File 187113 29946176 0 0 AfdE 89683 25828704 0 0 TCPT 41888 18765824 0 0 AfdC 90964 17465088 0 0
We now see the “Thre” tag at the top of the list, the largest consumer of NonPaged Pool, let’s go look it up in pooltag.txt….
Thre - nt!ps - Thread objects
Note, the nt before the ! means that this is NT or the kernel’s tag for Thread objects.
So from our earlier discussion if we have a bunch of thread objects, I probably have an application on the system with a ton of handles and or a ton of Threads so it should be easy to find!
Via the debugger we can find this out easily via the !process 0 0 command which will show the TableSize (Handle Count) of over 90,000!
Debugger output Example 1.3 (the !process command continued)
Note the two zeros after !process separated by a space gives a list of all running processes on the system.
PROCESS 884e6520 SessionId: 0 Cid: 01a0 Peb: 7ffdf000 ParentCid: 0124DirBase: 110f6000 ObjectTable: 88584448 TableSize: 90472Image: mybadapp.exe
We can dig further here into looking at the threads…
Debugger output Example 1.4 (the !process command continued)
0: kd> !PROCESS 884e6520 4PROCESS 884e6520 SessionId: 0 Cid: 01a0 Peb: 7ffdf000 ParentCid: 0124DirBase: 110f6000 ObjectTable: 88584448 TableSize: 90472.Image: mybadapp.exe
THREAD 884d8560 Cid 1a0.19c Teb: 7ffde000 Win32Thread: a208f648 WAIT THREAD 88447560 Cid 1a0.1b0 Teb: 7ffdd000 Win32Thread: 00000000 WAIT THREAD 88396560 Cid 1a0.1b4 Teb: 7ffdc000 Win32Thread: 00000000 WAIT THREAD 88361560 Cid 1a0.1bc Teb: 7ffda000 Win32Thread: 00000000 WAIT THREAD 88335560 Cid 1a0.1c0 Teb: 7ffd9000 Win32Thread: 00000000 WAIT THREAD 88340560 Cid 1a0.1c4 Teb: 7ffd8000 Win32Thread: 00000000 WAIT
And the list goes on…
We can examine the thread via !thread 88340560 from here and so on…
So in this rudimentary example the offender is clear in mybadapp.exe in its abundance of threads and one could dig further to determine what type of thread or functions are being executed and follow up with the owner of this executable for more detail, or take a look at the code if the application is yours!
Why am I out of Paged Pool at ~200MB when we say that the limit is around 460MB?
This is because the memory manager at boot decided that given the current amount of RAM on the system and other memory manager settings such as /3GB, etc. that our max is X amount vs. the maximum. There are two ways to see the maximum’s on a system.
1.) Process Explorer using its Task Management. View…System Information…Kernel Memory section.
Note that we have to specify a valid path to dbghelp.dll and Symbols path via Options…Configure Symbols.
c:\<path to debugging tools for windows>\dbghelp.dll
2.)The debugger (live or via a memory.dmp by doing a !vm)
*NonPaged pool size is not configurable other than the /3GB boot.ini switch which lowers NonPaged Pool’s maximum.
128MB with the /3GB switch, 256MB without
Conversely, Paged Pool size is often able to be raised to around its maximum manually via the PagedPoolSize registry setting which we can find for example in KB304101.
So what is this Pool Paged Bytes counter I see in Perfmon for the Process Object?
This is when the allocation is charged to a process via ExAllocatePoolWithQuotaTag. Typically, we will see ExAlloatePoolWithTag used and thus this counter is less effective…but hey…don’t pass up free information in Perfmon so be on the lookout for this easy win.
“Who's Using the Pool?” from Driver Fundamentals > Tips: What Every Driver Writer Needs to Know
Poolmon Remarks: http://technet2.microsoft.com/WindowsServer/en/library/85b0ba3b-936e-49f0-b1f2-8c8cb4637b0f1033.mspx
I hope you have enjoyed this post and hopefully it will get you going in the right direction next time you see one of these events or hit a pool consumption issue!
Cache is used to reduce the performance impact when accessing data that resides on slower storage media. Without it your PC would crawl along and become nearly unusable. If data or code pages for a file reside on the hard disk, it can take the system 10 milliseconds to access the page. If that same page resides in physical RAM, it can take the system 10 nanoseconds to access the page. Access to physical RAM is about 1 million times faster than to a hard drive. It would be great if we could load up all the contents of the hard drive into RAM, but that scenario is cost prohibitive and dangerous. Hard disk space is far less costly and is non-volatile (the data is persistent even when disconnected from a power source).
Since we are limited with how much RAM we can stick in a box, we have to make the most of it. We have to share this crucial physical resource with all running processes, the kernel and the file system cache. You can read more about how this works here:
The file system cache resides in kernel address space. It is used to buffer access to the much slower hard drive. The file system cache will map and unmap sections of files based on access patterns, application requests and I/O demand. The file system cache operates like a process working set. You can monitor the size of your file system cache's working set using the Memory\System Cache Resident Bytes performance monitor counter. This value will only show you the system cache's current working set. Once a page is removed from the cache's working set it is placed on the standby list. You should consider the standby pages from the cache manager as a part of your file cache. You can also consider these standby pages to be available pages. This is what the pre-Vista Task Manager does. Most of what you see as available pages is probably standby pages for the system cache. Once again, you can read more about this in "The Memory Shell Game" post.
Too Much Cache is a Bad Thing
The memory manager works on a demand based algorithm. Physical pages are given to where the current demand is. If the demand isn't satisfied, the memory manager will start pulling pages from other areas, scrub them and send them to help meet the growing demand. Just like any process, the system file cache can consume physical memory if there is sufficient demand.
Having a lot of cache is generally not a bad thing, but if it is at the expense of other processes it can be detrimental to system performance. There are two different ways this can occur - read and write I/O.
Excessive Cached Write I/O
Applications and services can dump lots of write I/O to files through the system file cache. The system cache's working set will grow as it buffers this write I/O. System threads will start flushing these dirty pages to disk. Typically the disk can't keep up with the I/O speed of an application, so the writes get buffered into the system cache. At a certain point the cache manager will reach a dirty page threshold and start to throttle I/O into the cache manager. It does this to prevent applications from overtaking physical RAM with write I/O. There are however, some isolated scenarios where this throttle doesn't work as well as we would expect. This could be due to bad applications or drivers or not having enough memory. Fortunately, we can tune the amount of dirty pages allowed before the system starts throttling cached write I/O. This is handled by the SystemCacheDirtyPageThreshold registry value as described in Knowledge Base article 920739: http://support.microsoft.com/default.aspx?scid=kb;EN-US;920739
Excessive Cached Read I/O
While the SystemCacheDirtyPageThreshold registry value can tune the number of write/dirty pages in physical memory, it does not affect the number of read pages in the system cache. If an application or driver opens many files and actively reads from them continuously through the cache manager, then the memory manger will move more physical pages to the cache manager. If this demand continues to grow, the cache manager can grow to consume physical memory and other process (with less memory demand) will get paged out to disk. This read I/O demand may be legitimate or may be due to poor application scalability. The memory manager doesn't know if the demand is due to bad behavior or not, so pages are moved simply because there is demand for it. On a 32 bit system, the file system cache working set is essentially limited to 1 GB. This is the maximum size that we blocked off in the kernel for the system cache working set. Since most systems have more than 1 GB of physical RAM today, having the system cache working set consume physical RAM with read I/O is less likely.
This scenario; however, is more prevalent on 64 bit systems. With the increase in pointer length, the kernel's address space is greatly expanded. The system cache's working set limit can and typically does exceed how much memory is installed in the system. It is much easier for applications and drivers to load up the system cache with read I/O. If the demand is sustained, the system cache's working set can grow to consume physical memory. This will push out other process and kernel resources out to the page file and can be very detrimental to system performance.
Fortunately we can also tune the server for this scenario. We have added two APIs to query and set the system file cache size - GetSystemFileCacheSize() and SetSystemFileCacheSize(). We chose to implement this tuning option via API calls to allow setting the cache working set size dynamically. I’ve uploaded the source code and compiled binaries for a sample application that calls these APIs. The source code can be compiled using the Windows DDK, or you can use the included binaries. The 32 bit version is limited to setting the cache working set to a maximum of 4 GB. The 64 bit version does not have this limitation. The sample code and included binaries are completely unsupported. It is just a quick and dirty implementation with little error handling.
This information applies specifically to Windows 2000, Windows XP, and Windows Server 2003. I will blog separately on the boot changes in Windows Vista.
Additional information about this topic may be found in Chapter 5 of Microsoft Windows Internals by Russinovich and Solomon and also on TechNet: http://technet.microsoft.com/en-us/library/bb457123.aspx
The key to understanding and troubleshooting system startup issues is to accurately ascertain the point of failure. To facilitate this determination, I have divided the boot process into the following four phases.
2. Boot Loader
Over the next few weeks, I’ll be describing each of these phases in detail and providing appropriate guidance for troubleshooting relevant issues for each phase.
The Initial Phase of boot is divided into the Power-On Self Test (POST) and Initial Disk Access.
Power-On Self Test
POST activities are fully implemented by the computer’s BIOS and vary by manufacturer. Please refer to the technical documentation provided with your hardware for details. However, regardless of manufacturer, certain generic actions are performed to ensure stable voltage, check RAM, enable interrupts for system usage, initialize the video adapter, scan for peripheral cards and perform a memory test if necessary. Depending on manufacturer and configuration, a single beep usually indicates a successful POST.
Troubleshooting the POST
ü Make sure you have the latest BIOS and firmware updates for the hardware installed in the system.
ü Replace the CMOS battery if it has failed.
ü Investigate recently added hardware (RAM, Video cards, SCSI adapters, etc.)
ü Remove recently added RAM modules.
ü Remove all adapter cards, then replace individually, ensuring they are properly seated.
ü Move adapter cards to other slots on the motherboard.
ü If the computer still will not complete POST, contact your manufacturer.
Initial Disk Access
Depending on the boot device order specified in your computer’s BIOS, your computer may attempt to boot from a CD-ROM, Network card, Floppy disk, USB device or a hard disk. For the purposes of this document, we’ll assume that we’re booting to a hard disk since that is the most common scenario.
Before we discuss the sequence of events that occur during this phase of startup, we need to understand a little bit about the layout of the boot disk. The structure of the hard disk can be visualized this way: (Obviously, these data areas are not to scale)
Hard disks are divided into Cylinders, Heads and Sectors. A sector is the smallest physical storage unit on a disk and is almost always 512 bytes in size. For more information about the physical structure of a hard disk, please refer to the following Resource Kit chapter: http://www.microsoft.com/resources/documentation/windowsnt/4/server/reskit/en-us/resguide/diskover.mspx
There are two disk sectors critical to starting the computer that we’ll be discussing in detail:
· Master Boot Record (MBR)
· Boot Sector
The MBR is always located at Sector 1 of Cylinder 0, Head 0 of each physical disk. The Boot Sector resides at Sector 1 of each partition. These sectors contain both executable code and the data required to run the code.
Please note that there is some ambiguity involved in sector numbering. Cylinder/Head/Sector (CHS) notation begins numbering at C0/H0/S1. However, Absolute Sector numbering begins numbering at zero. Absolute Sector numbering is often used in disk editing utilities such as DskProbe. These differences are discussed in the following knowledge base article:
Q97819 Ambiguous References to Sector One and Sector Zero
Now that we have that straight, when does this information get written to disk? The MBR is created when the disk is partitioned. The Boot Sector is created when you format a volume. The MBR contains a small amount of executable code called the Master Boot Code, the Disk Signature and the partition table for the disk. At the end of the MBR is a 2-byte structure called a Signature Word or End of Sector marker, which should always be set to 0x55AA. A Signature Word also marks the end of an Extended Boot Record (EBR) and the Boot Sector.
The Disk Signature, a unique number at offset 0x1B8, identifies the disk to the operating system. Windows 2000 and higher operating systems use the disk signature as an index to store and retrieve information about the disk in the registry subkey HKLM\System\MountedDevices.
The Partition Table is a 64-byte data structure within the MBR used to identify the type and location of partitions on a hard disk. Each partition table entry is 16 bytes long (four entries max). Each entry starts at a predetermined offset from the beginning of the sector as follows:
Partition 1 0x1BE (446)
Partition 2 0x1CE (462)
Partition 3 0x1DE (478)
Partition 4 0x1EE (494)
The following is a partial example of a sample MBR showing three partition table entries in-use and one empty:
000001B0: 80 01
000001C0: 01 00 07 FE BF 09 3F 00 - 00 00 4B F5 7F 00 00 00
000001D0: 81 0A 07 FE FF FF 8A F5 – 7F 00 3D 26 9C 00 00 00
000001E0: C1 FF 05 FE FF FF C7 1B – 1C 01 D6 96 92 00 00 00
000001F0: 00 00 00 00 00 00 00 00 – 00 00 00 00 00 00
Let’s take a look at each of the fields of a partition table entry individually. For each of these explanations, I’ll use the first partition table entry above and highlight the relevant section. Keep in mind that these values are little-endian.
00=Do Not Use for Booting
80=Active partition (Use for Booting)
000001C0: 01 00 07 FE BF 09 3F 00 - 00 00 4B F5 7F 00
000001C0: 01 00 07 FE BF 09 3F 00 - 00 00 4B F5 7F 00
Only the first 6 bits are used. The upper 2 bits of this byte are used by the Starting Cylinder field.
Uses 1 byte + 2 bits from the Starting Sector field to make up the cylinder value. The Starting Cylinder is a 10-bit number with a maximum value of 1023.
Defines the volume type. 0x07=NTFS
Other Possible System ID Values:
FAT12 primary partition or logical drive (fewer than 32,680 sectors in the volume)
FAT16 partition or logical drive (32,680–65,535 sectors or 16 MB–33 MB)
BIGDOS FAT16 partition or logical drive (33 MB–4 GB)
Installable File System (NTFS partition or logical drive)
FAT32 partition or logical drive
FAT32 partition or logical drive using BIOS INT 13h extensions
BIGDOS FAT16 partition or logical drive using BIOS INT 13h extensions
Extended partition using BIOS INT 13h extensions
Dynamic disk volume
Legacy FT FAT16 disk *
Legacy FT NTFS disk *
Legacy FT volume formatted with FAT32 *
Legacy FT volume using BIOS INT 13h extensions formatted with FAT32 *
Partition types denoted with an asterisk (*) indicate that they are also used to designate non-FT configurations such as striped and spanned volumes.
Ending Head (0xFE=254 decimal)
As with the Starting Sector, it only uses the first 6 bits of the byte. The upper 2 bits are used by the Ending Cylinder field.
Uses 1 byte in addition to the upper 2 bits from the Ending Sector field to make up the cylinder value. The Ending Cylinder is a 10-bit number with a maximum value of 1023.
The offset from the beginning of the disk to the beginning of the volume, counting by sectors.
0x0000003F = 63
The total number of sectors in the volume.
0x007FF54B = 8,385,867 Sectors = 4GB
Are you with me so far? Good! Now, Cylinder/Sector encoding can be a bit tricky, so let’s take a closer look.
Cylinder - Sector Encoding
Cylinder bits 1-8
9 & 10
As you can see, the Sector value occupies the lower 6 bits of the word and the Cylinder occupies the upper 10 bits of the word. In our example, the starting values for Cylinder and Sector are 01 00. Values in the MBR use reverse-byte ordering, which is also referred to as ‘little-endian’ notation. Therefore, we swap the bytes and find that our starting values are Cyl 0, Sec 1.
Our ending values are more interesting: BF 09. First, we swap the bytes and obtain a hex value of 0x09BF. This value in binary notation is 100110111111. The following table illustrates how we derive the correct partition table values from these bytes:
Example: BF 09
10 Cylinder value bits 1-8
The 6 low bits are all set to 1, therefore our Sector value is 111111 or 63. You can see above how the bits are arranged for the Cylinder value. The value above is 1000001001 (521). Since both Cylinder and Head values begin numbering at zero, we have a total of 522 Cylinders and 255 Heads represented here. This gives us an ending CHS value of: 522 x 255 x 63 = 8,385,930 sectors.
Subtracting the starting CHS address (Cylinder 0, Head 1, Sector 1) (63) gives us the total size of this partition: 8,385,867 sectors or 4GB. We can verify this number by comparing it to the Total Sectors represented in the partition table: 4B F5 7F 00. Applying reverse-byte ordering gives us 00 7F F5 4B which equals 8,385,867 sectors.
So, now that we have an understanding of what is contained within the structures on the disk, let’s look at the sequence of events that occur. Remember, this is just after POST has successfully completed.
1. The motherboard ROM BIOS attempts to access the first boot device specified in the BIOS. (This is typically user configurable and can be edited using the BIOS configuration utility.)
2. The ROM BIOS reads Cylinder 0, Head 0, and Sector 1 of the first boot device.
3. The ROM BIOS loads that sector into memory and tests it.
a. For a floppy drive, the first sector is a FAT Boot Sector.
b. For a hard drive, the first sector is the Master Boot Record (MBR).
4. When booting to the hard drive, the ROM BIOS looks at the last two signature bytes of Sector 1 and verifies they are equal to 55AA.
a. If the signature bytes do not equal 55AA the system assumes that the MBR is corrupt or the hard drive has never been partitioned. This invokes BIOS Interrupt 18, which displays an error that is BIOS-vendor specific such as “Operating System not found”.
b. If the BIOS finds that the last two bytes are 55AA, the MBR program executes.
5. The MBR searches the partition table for a boot indicator byte 0x80 indicating an active partition.
a. If no active partition is found, BIOS Interrupt 18 is invoked and a BIOS error is displayed such as “Operating System not found”.
b. If any boot indicator in the MBR has a value other than 0x80 or 0x00, or if more than one boot indicator indicates an active partition (0x80), the system stops and displays “Invalid partition table”.
c. If an active partition is found, that partition’s Boot Sector is loaded and tested.
6. The MBR loads the active partition’s Boot Sector and tests it for 55AA.
a. If the Boot Sector cannot be read after five retries, the system halts and displays “Error loading operating system”.
b. If the Boot Sector can be read but is missing its 55AA marker, “Missing operating system” is displayed and the system halts.
c. The bootstrap area is made up of a total of 16 sectors (0-15). If sector 0 for the bootstrap code is valid, but there is corruption in sectors 1-15, you may get a black screen with no error. In this case, it may be possible to transplant sectors 1-15 (not Sector 0) from another NTFS partition using DskProbe.
d. If the Boot Sector is loaded and it passes the 55AA test, the MBR passes control over to the active partition’s Boot Sector.
7. The active partition’s Boot Sector begins executing and looks for NTLDR. The contents of the Boot Sector are entirely dependent on the format of the partition. For example, if the boot partition is a FAT partition, Windows writes code to the Boot Sector that understands the FAT file system. However, if the partition is NTFS, Windows writes NTFS-capable code. The role of the Boot Sector code is to give Windows just enough information about the structure and format of a logical drive to enable it to read the NTLDR file from the root directory. The errors at this point of the boot process are file system specific. The following are possible errors with FAT and NTFS Boot Sectors.
a. FAT: In all cases it displays “press any key to restart” after either of the following errors.
(1) If NTLDR is not found “NTLDR is missing” is displayed
(2) If NTLDR is on a bad sector, “Disk Error” displays.
b. NTFS: In all cases it displays “Press CTRL+ALT+DEL to restart” after any of the following errors.
(1) If NTLDR is on a bad sector, “A disk read error occurred” is displayed. This message may also be displayed on Windows 2000 or higher systems if Extended Int13 calls are required, but have been disabled in the CMOS or SCSI BIOS. This behavior may also be seen if an FRS (File Record Segment) cannot be loaded or if any NTFS Metadata information is corrupt.
(2) If NTLDR is not found, “NTLDR is missing” displays.
(3) If NTLDR is compressed, “NTLDR is compressed” displays.
8. Once NTLDR is found, it is loaded into memory and executed. At this point the Boot Loader phase begins.
Troubleshooting the Initial Phase (Initial Disk Access)
If you are unable to boot Windows and the failure is occurring very early in the boot process, begin by creating a Windows boot floppy and using it to attempt to boot the system.
Q119467 How to Create a Bootable Disk for an NTFS or FAT Partition
If you are unable to boot to Windows NT with a floppy, skip to the section titled “Unable to boot using floppy”.
If you can successfully start the computer from the boot disk, the problem is limited to the Master Boot Record, the Boot Sector, or one or more of the following files: NTLDR, NTDetect.com, or Boot.ini. After Windows is running, you should immediately back up all data before you attempt to fix the Boot Sector or Master Boot Record.
1. Run a current virus scanning program to verify that no virus is present.
2. Select the method of troubleshooting based upon the error generated.
· Operating System not found
Typical BIOS-specific error indicating no 55AA signature on the MBR. This must be repaired manually using a disk editor such as DskProbe.
Warning: Running FixMBR or FDISK /MBR in this case will clear the partition table.
· Invalid Partition Table
May be repaired using DiskProbe – check for multiple 0x80 boot indicators.
· Error Loading Operating System
This error indicates an invalid Boot Sector. Use the “Repairing the Boot Sector” section below.
· Missing Operating System
This error indicates that the 55AA signature is missing from the Boot Sector. This may be replaced using DskProbe or by using the “Repairing the Boot Sector” section below.
· NTLDR is missing
Use the “Repairing the NTLDR file” section below.
· Disk error (FAT only)
Indicates that NTLDR is located on a bad sector.
· A disk read error occurred (NTFS only)
Indicates that NTLDR is on a bad sector. Use the “Repairing the NTLDR file” section below.
· NTLDR is compressed
Indicates that NTLDR is compressed. Uncompress or replace NTLDR. Note: This error is rare.
· Computer boots to a black screen with blinking cursor
Could be an issue with the MBR or the active partition’s boot code (Q228734). Use the appropriate section below.
Repairing the MBR
Use the FIXMBR command from the Windows Recovery Console to re-write the boot code located in the first 446 bytes of the MBR, but to leave the partition table intact.
326215 How To Use the Recovery Console on a Windows Server 2003-Based Computer That Does Not Start
WARNING: DO NOT use this method if you are receiving “Operating System not found” or other BIOS-specific error. FIXMBR will zero out the partition table if the 55AA signature is missing from the MBR. The data will be unrecoverable.
Repairing the Boot Sector
Use the FIXBOOT command from the Windows Recovery Console to write a new Boot Sector to the system partition.
If you are not able to repair the Boot Sector by using this method, you may repair it manually using the following knowledge base article:
Q153973 Recovering NTFS Boot Sector on NTFS Partitions
Repairing the NTLDR file
Use the Recovery Console to copy NTLDR from \i386 directory of the Windows CD-ROM to the root of the hard drive.
Unable to boot using floppy
If you are unable to start the computer with a boot floppy and you are not receiving an error message, then the problem is most likely with your partition tables. If invalid partitions are present and you are unable to start your computer with a boot disk, formatting the disk, reinstalling Windows and restoring from backup may be your only option. It may be possible to recreate the partition table using DskProbe if a backup of the partition information is available. It also may be possible to manually reconstruct the partition table however this is outside the scope of this blog post. We may cover this later.
If you do not have a current backup of the data on the computer, as a last resort a data recovery service may be able to recover some of your information.
Helpful knowledge base articles:
Q272395 Error Message: Boot Record Signature AA55 Not Found
Q155892 Windows NT Boot Problem: Kernel File Is Missing From the Disk
Q228004 Changing Active Partition Can Make Your System Unbootable
Q155053 Black Screen on Boot
Q314503 Computer Hangs with a Black Screen When You Start Windows
Q153973 Recovering NTFS Boot Sector on NTFS Partitions
Okay boys & girls, that’s it for now. Next time, I’ll continue with the boot sequence and cover the Boot Loader phase. This is where things really get interesting… J
I recently came across a very interesting profiling tool that is available in Vista SP1 and Server 08 called the Windows Performance Analyzer. You can use this tool to profile and diagnose different kinds of symptoms that the machine is experiencing. This tool is built on top off the Event Tracing for Windows (ETW) infrastructure. It uses the ETW providers to record kernel events and then display them in a graphical format.
Performance Analyzer provides many different graphical views of trace data including:
Download the latest version of the Windows Performance Tools Kit, and install it on your machine. (http://www.microsoft.com/whdc/system/sysperf/perftools.mspx : Windows Performance Tools Kit, v.4.1.1 (QFE)) You will need to find the toolkit that corresponds to your processor architecture. Currently there are 3 versions available i.e. X86, IA64, X64.
After installation you should be able to see 2 new tools. The first one is Xperf, which is a command line tool that is used to capture the trace. The second is called XperfView, which graphically interprets the trace that has been collected by Xperf.
You will need to run the Xperf and XperfView from an elevated command prompt for all functionality.
For many tasks all you need for effective analysis is a kernel trace. For this example, we'll use the –on DiagEasy parameter to enable several kernel events including: image loading; disk I/O; process and thread events; hard faults; deferred procedure calls; interrupts; context switches; and, and performance counters. From an elevated command prompt launch xperf –on DiagEasy.
This starts the kernel logger in sequential mode to the default file "\kernel.etl"; uses a default buffer size of 64K, with a minimum of 64 and a maximum of 320 buffers.
To stop a trace, type xperf –d <filename>.etl at the command line. This will stop the trace and output the file.
There are 2 ways to view the trace. From an Elevated Command prompt, launch xperf <filename>.etl, or launch the XperfView tool and open the file manually. When you open the trace file, you should see something similar like this.
NOTE - While you need to run xperf from an elevated command prompt in order to record a trace you do not need an elevated command prompt in order to *analyze* a trace.
Using the Chart Selector tab, you can select all the graphs that you want to look at. To drill down in each chart, you can select the Summary table. For instance, in the CPU Sampling chart, the summary table gets you the summary of the processes that were running, with information like the amount of CPU time, CPU %, stacks (if the stacks were collected in the trace, see below). When looking at the Summary table for the Disk I/O chart, you can see which processes were writing files (the filename too!) to disk, as well as how much time it took.
You also have the ability to zoom in on a selected area. Another really cool feature is the ability to overlay multiple graphs on one frame. This way you can correlate different pieces of data together very easily.
Also, you select which counter instances you want to see in each specific chart. On the top right corner of each chart is a drop down box from where you can select the counter instances. For instance on the Disk I/O chart, you can select Disk 0, Disk 1, or a combination as well.
You can also view detailed information about the system that the trace was taken on. Click on the Trace menu item, and select System Configuration.
In the first sample Xperf command we ran, xperf –on DiagEasy. I am sure many of you were wondering what DiagEasy means. DiagEasy is a group of kernel events that are predefined by the Windows Performance Toolkit. This group includes Process, Threads, Kernel and User Image Load/Unload, Disk I/O, DPCs and Context Switch events.
When we used the xperf –on DiagEasy command, we did not specify an individual provider, so we enabled the kernel events for all the ETW providers on the system. If you want to enable events for a specific provider, you can the following format xperf -on: (GUID|KnownProviderName)[:Flags[:Level]]. For more information about ETW providers, Kernel Flags and Groups, you can run the xperf –help providers command.
One of the most powerful features in Performance Analyzer is the ability to visualize stacks. It's important to note that this requires no special instrumentation in the code – only that you have symbols for the binary components you are interested in analyzing.
When the trace is setup to collect the stacks, Performance Analyzer will display call stack summary information for the events that had stack walking enabled. Here is an example that takes a trace (with stack tracing enabled) of the entire system while running a "find string" utility.. We can use the Stack Tracing feature of Xperf to record a stack when certain events happen, or take sample at regular intervals over time. See xperf –help stackwalk output for more info.
Below, we will use the Stack Tracking feature of Xperf to take stack samples at regular intervals. With this output, we will be able to determine where the CPU is spending most of its time within a process.
xperf -on latency -stackwalk Profile
xperf -on latency -stackwalk Profile
Latency is the kernel group to enable certain events, including the profile event which records the CPUs' activity every millisecond. The "-stackwalk Profile" flag tells Xperf to record stack walks on every profile event, which makes the profile information much more useful. In other words, in order to get profile information with stack walks you need to turn on the profile event, and turn on stack walking for that event.
Note that decoding of stacks requires that symbol decoding be configured. However stacks can be recorded without symbols, and can even be viewed without symbols, although they are much less useful without symbols. I only mention this in the event you're trying to record a trace of a problematic machine with little time to mess around with _NT_SYMBOL_PATH.
To get a trace with the stack information, do the following:
Click on the selector tab to bring up the column chooser list. Then select "Process name", "Process", "Stack", "Weight" and "%Weight". These are the most useful columns when looking at stacks from the sample profile event. You should get a view similar to this.
At this point I need to mention a few of the restrictions with stack walking coupled with when and how it works.
· Xperf stack walking is not available on XP
· On Vista stack walking is available for x86, and is available for x64 as of Vista SP1.
· On Windows 7 stack walking is available.
· Stack walking on x64 is complicated. You have to set DisablePagingExecutive in the registry, as documented here:
REG ADD "HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management" -v DisablePagingExecutive -d 0x1 -t REG_DWORD –f
I recently came across a case where the customer was complaining that DPC processing was taking up too much CPU time. We ran Xperf on the machine and drilled down into the DPC activity on the machine.
From the Xperf graph, I was able to confirm that the customer was actually seeing high DPC usage. I selected the Summary for this chart, and got the list of drivers that were actually taking up CPU time.
Right off the bat, I could identify the driver that had a lot of DPC activity. I also noted that the average duration for each DPC from that driver was taking 788 microseconds. This is way too high. Each DPC should be taking a maximum of 100 microseconds.
Performance.Analyzer.QuickStart.xps – This is shipped with the performance toolkit.
From an elevated command prompt, launch xperf -help
In our previous article we discussed how to identify a pool leak using perfmon. Although it may be interesting to know that you have a pool leak, most customers are interested in identifying the cause of the leak so that it can be corrected. In this article we will begin the process of identifying what kernel mode driver is leaking pool, and possibly identify why.
Often when we are collecting data for a poor performance scenario there are two pieces of data that we collect. Perfmon log data is one, as we discussed in our previous article. The other piece of data is poolmon logs. The memory manager tracks pool usage according to the tag associated with the pool allocations, using a technique called pool tagging. Poolmon gathers this data and displays it in an easy to use format. Poolmon can also be configured to dump data to a log, and in some scenarios it is beneficial to schedule poolmon to periodically collect such logs. There are several available techniques to schedule poolmon, however that is beyond the scope of this article.
Poolmon has shipped with many different packages over the years; it is currently available with the Windows Driver Kit. If you install the WDK to the default folder, poolmon will be in “C:\Program Files (x86)\Windows Kits\8.0\Tools\x64”. Poolmon does not have dependencies on other modules in this folder; you can copy it to your other computers when you need to investigate pool usage.
How does pool tagging work? When a driver allocates pool it calls the ExAllocatePoolWithTag API. This API accepts a tag - a four-letter string - that will be used to label the allocation. It is up to a driver developer to choose this tag. Ideally each developer will choose a tag that is unique to their driver and use a different tag for each code path which calls ExAllocatePoolWithTag. Because each tag should be unique to each driver, if we can identify the tag whose usage corresponds with the leak we can then begin to identify the driver which is leaking the memory. The tag may also give the driver developer clues as to why the memory is being leaked, if they use a unique tag for each code path.
To view the pool usage associated with each tag run “poolmon -b” from a command prompt. This will sort by the number of bytes associated with each tag. If you are tracking pool usage over a period of time, you can log the data to a file with “poolmon -b -n poolmonlog1.txt”, replacing 1 with increasing numbers to obtain a series of logs. Once you have a series of logs you may be able to view usage increasing for a specific tag, in a corresponding fashion to what you see in perfmon.
When analyzing poolmon the important data is at the top. Typically the tag with the largest usage in bytes is the cause of the leak.
In the above data we can see that the tag with the most pool usage is “Leak”. Now that we know what tag is leaking we need to identify what driver is using this tag. Techniques for associating a leak with a tag vary, but findstr is often effective. Most drivers are located in c:\windows\system32\drivers, so that is a good starting point when looking for the driver. If you don’t find a result in that folder, go up a folder and try again, repeating until you get to the root of the drive.
C:\>findstr /s Leak *.sys
·∟ §£♂ Θ─☺ A╗☻ E☼"├Θ╡☺ Hï♣╔♂ ╞ $Θª☺ Hï♣:Hc┴ ┴ê\♦@ë
└δ_Aï ╞♣@∟ ☺ë♣▓← Aï@♦ë♣¼← δCAâ∙♦u╓AïAìI♦A╕Leak;┴☼B┴3╔ï╨ §
In the above output we can see that “Leak” is used in myfault.sys. If we hadn’t forced this leak with notmyfault, the next step in troubleshooting would be an internet search for the tag and the driver. Often such a search will allow you to identify a specific fault within the driver and a solution.
Don’t panic if findstr doesn’t find your tag, or if you find the tag but it is not unique to one driver. In future articles we will cover additional techniques for associating drivers with tags, and for associating allocations with specific code within a driver.
Matthew here again – I want to provide some follow-up information on desktop heap. In the first post I didn’t discuss the size of desktop heap related memory ranges on 64-bit Windows, 3GB, or Vista. So without further ado, here are the relevant sizes on various platforms...
Windows XP (32-bit)
· 48 MB = SessionViewSize (default registry value, set for XP Professional, x86)
· 20 MB = SessionViewSize (if no registry value is defined)
· 3072 KB = Interactive desktop heap size (defined in the registry, SharedSection 2nd value)
· 512 KB = Non-interactive desktop heap size (defined in the registry, SharedSection 3nd value)
· 128 KB = Winlogon desktop heap size
· 64 KB = Disconnect desktop heap size
Windows Server 2003 (32-bit)
· 48 MB = SessionViewSize (default registry value)
· 20 MB = SessionViewSize (if no registry value is defined; this is the default for Terminal Servers)
Windows Server 2003 booted with 3GB (32-bit)
· 20 MB = SessionViewSize (registry value has no effect)
You may also see reduced heap sizes when running 3GB. During the initialization of the window manager, an attempt is made to reserve enough session view space to accommodate the expected number of desktops heaps for a given session. If the heap sizes specified in the SharedSection registry value have been increased, the attempt to reserve session view space may fail. When this happens, the window manager falls back to a pair of “safe” sizes for desktop heaps (512KB for interactive, 128KB for non-interactive) and tries to reserve session space again, using these smaller numbers. This ensures that even if the registry values are too large for the 20MB session view space, the system will still be able to boot.
Windows Server 2003 (64-bit)
· 104 MB = SessionViewSize (if no registry value is defined; which is the default)
· 20 MB = Interactive desktop heap size (defined in the registry, SharedSection 2nd value)
· 768 KB = Non-interactive desktop heap size (defined in the registry, SharedSection 3nd value)
· 192 KB = Winlogon desktop heap size
· 96 KB = Disconnect desktop heap size
Windows Vista RTM (32-bit)
· Session View space is now a dynamic kernel address range. The SessionViewSize registry value is no longer used.
· 3072 KB = Interactive desktop heap size (defined in the registry, SharedSection 2nd value)
· 512 KB = Non-interactive desktop heap size (defined in the registry, SharedSection 3nd value)
Windows Vista (64-bit) and Windows Server 2008 (64-bit)
Windows Vista SP1 (32-bit) and Windows Server 2008 (32-bit)
· 12288 KB = Interactive desktop heap size (defined in the registry, SharedSection 2nd value)
· 64 KB = Disconnect desktop heap size
Windows Vista introduced a new public API function: CreateDesktopEx, which allows the caller to specify the size of desktop heap.
Additionally, GetUserObjectInformation now includes a new flag for retrieving the desktop heap size (UOI_HEAPSIZE).
[Update: 3/20/2008 - Added Vista SP1 and Server 2008 info. More information can be found here.]
Greetings fellow debuggers, today I will be blogging about Event ID 129 messages. These warning events are logged to the system event log with the storage adapter (HBA) driver’s name as the source. Windows’ STORPORT.SYS driver logs this message when it detects that a request has timed out, the HBA driver’s name is used in the error because it is the miniport associated with storport.
Below is an example 129 event:
Event Type: Warning
Event Source: <HBA_Name>
Event Category: None
Event ID: 129
Time: 1:15:45 AM
Reset to device, \Device\RaidPort1, was issued.
So what does this mean? Let’s discuss the Windows I/O stack architecture to answer this.
Windows I/O uses a layered architecture where device drivers are on a “device stack.” In a basic model, the top of the stack is the file system. Next is the volume manager, followed by the disk driver. At the bottom of the device stack are the port and miniport drivers. When an I/O request reaches the file system, it takes the block number of the file and translates it to an offset in the volume. The volume manager then translates the volume offset to a block number on the disk and passes the request to the disk driver. When the request reaches the disk driver it will create a Command Descriptor Block (CDB) that will be sent to the SCSI device. The disk driver imbeds the CDB into a structure called the SCSI_REQUEST_BLOCK (SRB). This SRB is sent to the port driver as part of the I/O request packet (IRP).
The port driver does most of the request processing. There are different port drivers depending on the architecture. For example, ATAPORT.SYS is the port driver for ATA devices, and STORPORT.SYS is the port driver for SCSI devices. Some of the responsibilities for a port driver are:
The port driver interfaces with a driver called the “miniport”. The miniport driver is designed by the hardware manufacturer to work with a specific adapter and is responsible for taking requests from the port driver and sending them to the target LUN. The port driver calls the HwStorStartIo() function to send requests to the miniport, and the miniport will send the requests to the HBA so they can be sent over the physical medium (fibre, ethernet, etc) to the LUN. When the request is complete, the miniport will call StorPortNotification() with the NotificationType parameter value set to RequestComplete, along with a pointer to the completed SRB.
When a request is sent to the miniport, STORPORT will put the request in a pending queue. When the request is completed, it is removed from this queue. While requests are in the pending queue they are timed.
The timing mechanism is simple. There is one timer per logical unit and it is initialized to -1. When the first request is sent to the miniport the timer is set to the timeout value in the SRB. The disk timeout value is a tunable parameter in the registry at: HKLM\System\CurrentControlSet\Services\Disk\TimeOutValue. Some vendors will tune this value to best match their hardware, we do not advise changing this value without guidance from your storage vendor.
The timer is decremented once per second. When a request completes, the timer is refreshed with the timeout value of the head request in the pending queue. So, as long as requests complete the timer will never go to zero. If the timer does go to zero, it means the device has stopped responding. That is when the STORPORT driver logs the Event ID 129 error. STORPORT then has to take corrective action by trying to reset the unit. When the unit is reset, all outstanding requests are completed with an error and they are retried. When the pending queue empties, the timer is set to -1 which is its initial value.
Each SRB has a timer value set. As requests are completed the queue timer is refreshed with the timeout value of the SRB at the head of the list.
The most common causes of the Event ID 129 errors are unresponsive LUNs or a dropped request. Dropped requests can be caused by faulty routers or other hardware problems on the SAN.
I have never seen software cause an Event ID 129 error. If you are seeing Event ID 129 errors in your event logs, then you should start investigating the storage and fibre network.
Excessive cached read I/O is a growing problem. For over one year we have been working on this problem with several companies. You can read more about it in the original blog post:
On 32 bit systems, the kernel could address at most 2GB of virtual memory. This address range is shared and divided up for the many resources that the system needs; one of which is the System File Cache's working set. On 32 bit systems the theoretical limit is almost 1GB for the cache’s working set; however, when a page is removed from the working set it will end up on the standby page list. Therefore the system can cache more than the 1 GB limit if there is available memory. The working set; however, is just limited to what can be allocated within the Kernel's 2GB virtual address range. Since most modern systems have more than 1 GB of physical RAM, the System File Cache's working set's size on a 32 bit system typically isn't a problem.
With 64 bit systems, the kernel virtual address space is very large and is typically larger than physical RAM on most systems. On these systems the System File Cache's working set can be very large and is typically about equal to the size of physical RAM. If applications or file sharing performs a lot of sustained cached read I/O, the System File Cache's working set can grow to take over all of physical RAM. If this happens, then process working sets are paged out and everyone starts fighting for physical pages and performance suffers.
The only way to mitigate this problem is to use the provided APIs of GetSystemFileCacheSize() and SetSystemFileCacheSize(). The previous blog post "Too Much Cache" contains sample code and a compiled utility that can be used to manually set the System File Cache's working set size.
The provided APIs, while offering one mitigation strategy, has a couple of limitations:
1) There is no conflict resolution between multiple applications. If you have two applications trying to set the System File Cache's working set size, the last one to call SetSystemFileCacheSize() will win. There is no centralized control of the System File Cache's working set size.
2) There is no guidance on what to set the System File Cache's working set size to. There is no one size fits all solution. A high cache working set size is good for file servers, but bad for large memory application and a low working set size could hurt everyone's I/O performance. It is essentially up to 3rd party developers or IT administrators to determine what is best for their server and often times, the limits are determined by a best guesstimate backed by some testing.
We fully understand that while we provide one way to mitigate this problem, the solution is not ideal. We spent a considerable amount of time reviewing and testing other options. The problem is that there are so many varied scenarios on how users and applications rely on the System File Cache. Some strategies worked well for the majority of usage scenarios, but ended up negatively impacting others. We could not release any code change that would knowingly hurt several applications.
We also investigated changing some memory manager architecture and algorithms to address these issues with a more elegant solution; however the necessary code changes are too extensive. We are experimenting with these changes in Windows 7 and there is no way that we could back port them to the current operating systems. If we did, we would be changing the underlying infrastructure that everyone has been accustomed to. Such a change would require stress tests of all applications that run on Windows. The test matrix and the chance of regression are far too large.
So that brings us back to the only provided solution - use the provided APIs. While this isn't an ideal solution, it does work, but with the limitations mentioned above. In order to help address these limitations, I've updated the SetCache utility to the Microsoft Windows Dynamic Cache Service. While this service does not completely address the limitations above, it does provide some additional relief.
The Microsoft Windows Dynamic Cache Service uses the provided APIs and centralizes the management of the System File Cache's working set size. With this service, you can define a list of processes that you want to prioritize over the System File Cache by monitoring the working set sizes of your defined processes and back off the System File Cache's working set size accordingly. It is always running in the background monitoring and dynamically adjusting the System File Cache's working set size. The service provides you with many options such as adding additional slack space for each process' working set or to back off during a low memory event.
Please note that this service is experimental and includes sample source code and a compiled binary. Anyone is free to re-use this code in their own solution. Please note that you may experience some performance side effects while using this service as it cannot possibly address all usage scenarios. There may be some edge usage scenarios that are negatively impacted. The service only attempts to improve the situation given the current limitations. Please report any bugs or observations here to this blog post. While we may not be able to fix every usage problem, we will try to offer a best effort support.
Side Effects may include:
Cache page churn - If the System File Cache's working set is too low and there is sustained cached read I/O, the memory manager may not be able to properly age pages. When forced to remove some pages in order to make room for new cache pages, the memory manager may inadvertently remove the wrong pages. This could result in cached page churn and decreased disk performance for all applications.
Version 1.0.0 - Initial Release
NOTE: The memory management algorithms in Windows 7 and Windows Server 2008 R2 operating systems were updated to address many file caching problems found in previous versions of Windows. There are only certain unique situations when you need to implement the Dynamic Cache service on computers that are running Windows 7 or Windows Server 2008 R2. For more information on how to determine if you are experiencing this issue and how to resolve it, please see the More Information section of Microsoft Knowledge Base article 976618 - You experience performance issues in applications and services when the system file cache consumes most of the physical RAM.
Hello my name is Bob Golding and I would like to share information on a new error you may see in the system event log. It is Event ID 157 "Disk <n> has been surprise removed" with Source: disk. This error indicates that the CLASSPNP driver has received a “surprise removal” request from the plug and play manager (PNP) for a non-removable disk.
What does this error mean?
The PNP manager does what is called enumerations. An enumeration is a request sent to a driver that controls a bus, such as PCI, to take an inventory of devices on the bus and report back a list of the devices. The SCSI bus is enumerated in a similar manner, as are devices on the IDE bus.
These enumerations can happen for a number of reasons. For example, hardware can request an enumeration when it detects a change in configuration. Also a user can initiate an enumeration by selecting “scan for new devices” in device manager.
When an enumeration request is received, the bus driver will rescan the bus for all devices. It will issue commands to the existing devices as though it was looking for new ones. If these commands fail on an existing unit, the driver will mark the device as “missing”. When the device is marked “missing”, it will not be reported back to PNP in the inventory. When PNP determines that the device is not in the inventory it will send a surprise removal request to the bus driver so the bus driver can remove the device object.
Since the CLASSPNP driver sits in the device stack and receives requests that are destined for disks, it sees the surprise removal request and logs an event if the disk is supposed to be non-removable. An example of a non-removable disk is a hard drive on a SCSI or IDE bus. An example of a removable disk is a USB thumb drive.
Previously nothing was logged when a non-removable disk was removed, as a result disks would disappear from the system with no indication. The event id 157 error was implemented in Windows 8.1 and Windows Server 2012 R2 to log a record of a disk disappearing.
Why does this error happen?
These errors are most often caused when something disrupts the system’s communication with a disk, such as a SAN fabric error or a SCSI bus problem. The errors can also be caused by a disk that fails, or when a user unplugs a disk while the system is running. An administrator that sees these errors needs to verify the heath of the disk subsystem.
Event ID 157 Example: