• Ntdebugging Blog

    Interpreting Event 153 Errors


    Hello my name is Bob Golding and I would like to share with you a new event that you may see in the system event log.  Event ID 153 is an error associated with the storage subsystem. This event was new in Windows 8 and Windows Server 2012 and was added to Windows 7 and Windows Server 2008 R2 starting with hot fix KB2819485.


    An event 153 is similar to an event 129.  An event 129 is logged when the storport driver times out a request to the disk; I described event 129 messages in a previous article.  The difference between a 153 and a 129 is that a 129 is logged when storport times out a request, a 153 is logged when the storport miniport driver times out a request.  The miniport driver may also be referred to as an adapter driver or HBA driver, this driver is typically written the hardware vendor.


    Because the miniport driver has a better knowledge of the request execution environment, some miniport drivers time the request themselves instead of letting storport handle request timing.  This is because the miniport driver can abort the individual request and return an error rather than storport resetting the drive after a timeout.  Resetting the drive is disruptive to the I/O subsystem and may not be necessary if only one request has timed out.  The error returned from the miniport driver is bubbled up to the class driver who can log an event 153 and retry the request.


    Below is an example event 153:


    Event 153 Example


    This error means that a request failed and was retried by the class driver.  In the past no message would be logged in this situation because storport did not timeout the request.  The lack of messages resulted in confusion when troubleshooting disk errors because timeouts would occur but there would be no evidence of the error.


    The details section of the event the log record will present what error caused the retry and whether the request was a read or write. Below is the details output:


    Event 153 Details


    In the example above at byte offset 29 is the SCSI status, at offset 30 is the SRB status that caused the retry, and at offset 31 is the SCSI command that is being retried.  In this case the SCSI status was 00 (SCSISTAT_GOOD), the SRB status was 09 (SRB_STATUS_TIMEOUT), and the command was 28 (SCSIOP_READ). 


    The most common SCSI commands are:

    SCSIOP_READ - 0x28



    The most common SRB statuses are below:





    A complete list of SCSI operations and statuses can be found in scsi.h in the WDK.  A list of SRB statuses can be found in srb.h.


    The timeout errors (SRB_STATUS_TIMEOUT and SRB_STATUS_COMMAND_TIMEOUT) indicate a request timed out in the adapter. In other words a request was sent to the drive and there was no response within the timeout period.  The bus reset error (SRB_STATUS_BUS_RESET) indicates that the device was reset and that the request is being retried due to the reset since all outstanding requests are aborted when a drive receives a reset.


    A system administrator who encounters event 153 errors should investigate the health of the computer’s disk subsystem.  Although an occasional timeout may be part of the normal operation of a system, the frequent need to retry requests indicates a performance issue with the storage that should be corrected.

  • Ntdebugging Blog

    NTFS Misreports Free Space?


    I have recently seen a number of issues where customers called in to report a significant difference between the “Size on disk” for the root of a volume, and the reported amount of “Used space” in the volume properties.  While considering this, I noticed that my own C: drive had a mysterious 20GB difference.


    Below is the story of how I found the answer.


    Before we begin, there are two methods used for calculating disk usage…


    Method 1 – Volume Bitmap Analysis

    The % used and %free indication shown below with the pie chart is based on volume bitmap analysis.  The hidden $Bitmap:$Data:”” stream is read from the volume via the FSCTL_GET_VOLUME_BITMAP IO control code, and the ones and zeroes are counted.  A binary “1” represents an allocated cluster, while a “0” indicates a free cluster.  The total number of set bits indicates the total number of allocated clusters on the volume.  This is the fastest and most reliable method of calculating allocated and free space.


    Method 2 – File Scanning with FindFirstFileEx  / FindNextFile

    Navigating to the “C:\” folder, we select all files/folders, then right click, and then select properties.  A dialog will appear as the shell enumerates all files that it can find in the root and subfolders.  The “Size” and “Size on disk” fields will eventually display the total size of all primary data streams for all files that are found by Explorer.  In my case, this method only finds 80.25 GB of 102 GB (leaving a 21.75 GB difference).


    Why such a big difference?


    The short answer is - “If you can’t see it with Explorer, it doesn’t get added to the total size.”


    How to find the hidden disk usage…


    Step 1:


    First, check to see if your files are all actually visible with the current user account and Explorer settings.  Uncheck the “Hide protected operating system files” and select “Show hidden files and folders”.  When this is completed, look at the size on disk again. 

    Does it match the pie chart within a few percent?  If not, continue to Step 2.


    Step 2:

    Run CHKDSK and save the output.  The summary information at the end is what we want to see.

    Based on CHKDSK output, we can calculate the total metadata usage by adding up the following…






    .05 GB

    Space used by 22677 indexes.


    0 GB

    Space used by $Badclus file.


    1.29 GB

    Space used by $MFT.


    .06 GB

    Space used by $Loffile.


    1.4 GB

     Metadata Total


    In this example, metadata usage accounts for only 1.4 GB.  If the metadata number were a high percentage of the usage, then we need to take closer a look at how the volume is being used instead of simply looking for hidden files.  High metadata usage will be the subject for part 2 of this blog.


    We can see from the CHKDSK output that the total space occupied by our user files is 106,095,080 KB (101.18 GB).  This is a large percentage of the total usage, so we should be looking at the user files to see why we can’t see these files with Explorer.


    Step 3:

    Do you have permissions to see all files/folders the volume? 


    a.     If this is a system drive, look through the “\Documents and Settings”, or “\users” folder and see if you can browse all of the profile folders.  If not, you may need contact the owner(s) to check their folder size and see what they have saved in their user folder.  If they are like me, they will have 5 years worth of memory dumps, MP3’s, VHD’s, PDF’s, etc.

    b.    Try “CACLS.EXE x:\ /T”, or just browse through the folders on the drive looking for an “Access Denied” error.  Once this happens, give yourself permissions to that part of the subtree (if permitted by your administrative privileges) or have someone who has access to that folder enumerate the contents for you.  Check the number of files and the size on disk for all files in that folder and add it to the total.

    c.     Continue this process until you have a total of all files in all folders.

    d.    Does the total make sense?  If not, then continue to the next step.


    In the case of my mysterious 20 GB difference, I found an old user profile from a previous installation.  Sure enough, I received an access denied error when browsing to it.  To access the folder, I acquired ownership.  This user was in the habit of collecting memory dumps and virtual machines in a subfolder on his desktop.  I recovered quite a bit of free space by using the delete key.  I rescanned the volume, but to my disappointment, there was still a significant difference.


    Step 4.

    Are there alternate streams?  Try using STREAMS.EXE from (http://technet.microsoft.com/en-us/sysinternals/default.aspx).  Streams will recurse subdirectories and report space usage by alternate named streams.  Pipe this to a text file and analyze it with a text editor or spreadsheet program.


    Step 5.

    Hard links can be problematic when calculating the total usage via the file scan method.  Hard links are not extensively used, but this is worth mentioning.  A hard link is an extra index entry that points to an existing file.  These are created via the CreateHardLink function.  Although the file actually exists in one location, each hard link has its own size field.  Because of this, a single file can be added twice to the running total.  Another thing to know is that hard links are not synchronized, so it is possible that only one link will show the actual file size (see the example below).

    Unfortunately, there are few options available for detecting hard link paradoxes, but it is something to consider when the file scan shows more usage than the bitmap.  Since we have the opposite situation here, hard links are not a significant factor.


    Step 6.

    Is Volume Shadow Copy Service maintaining diff area files for snapshots?  Use VSSADMIN LIST SHADOWSTORAGE to find out.  Add shadow storage to the running total.


    VSSVC pre-allocates space for making volume snapshots.  In order to support this feature, diff area files are kept in the “\System Volume Information” folder.  This pre-allocated space is used to maintain point-in-time information for the “Previous Versions” feature and for the “System Restore” application.  If you are the type of user who prefers to minimize the impact of such features, then you can resize your shadow storage with VSSADMIN so it has less impact on disk usage.  I prefer to leave these features at their default settings (and just make a note of how much disk space it is using).

    Step 7.


    If your numbers still don’t make sense, then it’s time to check for actively hidden files / folders.  There are many different rootkit scanners that can help you identify the presence of actively hidden files & folders.   Try using a scanner like Rootkit Revealer.  If you suspect that your machine has been compromised by a rootkit, refer to http://www.microsoft.com/uk/business/security/security-malware.mspx.


    Seeing that I did not have a significant amount of mysterious usage, I was satisfied that there was no malicious logic hiding my files.  After taking into account all of the factors, there was only a 1.57GB difference remaining (which was accounted for by the contents of a user’s recycle bin).


    Whenever you see a difference between the “Used space”, and “Size on disk”, always ask yourself “What am I not seeing?”



    Best regards,


    Dennis Middleton “The NTFS Doctor”


  • Ntdebugging Blog

    Using Netmon to figure out the source of high CPU in WMIprvse.exe


    Recently I was working with a customer who had a file server experiencing high CPU in the WMIprvse.exe process. We received multiple user dumps and noted that someone or something was running the same query again and again. We needed to figure out what was running the query in a tight loop, causing the high CPU.


    Figure 1 - Task Manager on FileServer

    Before we get into the exact troubleshooting steps, let me provide some background on WMI.  Winmgmt is the WMI service within a SVCHOST process running under the LocalSystem account.  In all cases, the WMI service automatically starts when the first management application or script requests connection to a WMI namespace. For more information, see Starting and Stopping the WMI Service. To avoid stopping all the services when a provider fails, each provider is loaded into a separate host process named "Wmiprvse.exe". This also allows each instance of Wmiprvse to run under a different account with varying security. For more details you can look at the MSDN documentation on WMI.


    I dumped out all the services in the various svchost.exe processes. You can do this from a command prompt by running the tasklist /svc command. In my instance, I found that the WinMgmt service was running in svchost, PID  452 (PID number will vary). Someone had to be making RPC calls to this svchost.exe process to run the WMI queries. It could be some local process on the machine; it could even be a process on a remote machine.


    At this point I requested user dumps of PID 452 from the customer.  This would allow me to determine who was making the RPC calls to svchost.exe to run the WMI queries.  While the customer was uploading the dumps, we decided to get a Network Monitor trace to see if the RPC calls were coming over the network.


    Immediately, I could see a lot of RPC traffic to the svchost.exe process(PID=452).



    Figure 2 - Network Monitor Output from the FileServer. Notice the Source and destination ports and IP addresses. IP addresses are hidden by the aliases

    Looking at the RPC payload, I could see the text of the WMI query. You can see this in the Hex Details Pane. The query that was running in a loop was Select * from Win32_Process. Looks like I found the source of the WMI queries.


    At this point, we got the source IP for the RPC packets. We logged into the machine, and brought up the Task Manager.



    Figure 3 - Task Manager on Remote Machine(Machine1)

    Immediately we saw that there was some script running inside a Wscript.exe process.  At this point I was pretty sure that this script was the culprit. The customer was not sure what this was, and was not comfortable terminating the process.  To prove my suspicion, I had him open a command prompt and run the following command, netstat –ano.



    Figure 4 - Netstat output from Remote Machine


    From the output in Fig. 4, I could see a TCP connection created by PID 3532 (wscript.exe). Looking at the local and foreign addresses from the above output, they matched up exactly to what we were seeing in the Network Monitor trace.


    In the above case, we already had our suspicions on the wscript.exe process; however, sometimes it might not be that easy. In that case, we could have used the netstat output to look at all connections to the file server ( If there were multiple connections, then we can also narrow it down by the port number. Based on that, we could have found the PID responsible.


    The customer called me later in the day, and told me that they had recently updated their scripts. One of their scripts had a bug which was running WMI scripts in a tight loop. Fixing the script caused the problem to go away.


    Had the query being coming from a local process, I would have had to debug the svchost.exe process, and figure out who was making the WMI calls. However, since we could see the traffic on netmon, we didn’t need to use the debugger. Interesting way to get to the root of the problem without using a debugger!



    Share this post :
  • Ntdebugging Blog

    Storport ETW Logging to Measure Requests Made to a Disk Unit


    Greetings NTDEBUGGING community, Bob here again, and today I would like to let everyone know about a new feature implemented in the STORPORT.SYS binaries released in the following KB articles:

    ·         KB 979764 (Windows Server 2008)

    ·         KB 978000 (Windows Server 2008 R2)


    Beginning with these new versions, it is now possible to measure timing statistics for requests made to a system’s disk unit.  These measurements are taken at the lowest possible level of OS interaction with the storage adapter hardware, making it much easier to diagnose storage performance issues. The measurements are taken from the port driver, STORPORT.SYS, via ETW (Event Tracing for Windows).  Windows drivers are implemented in a layered architecture, so in our case the STORPORT driver interfaces directly with the adapter or miniport driver.  The STORPORT driver is responsible for managing requests and queues, and providing other upper level functions.  Each miniport driver is specific to a particular storage adapter card, and knows how to deliver requests to the disk unit over the transport medium, like fibre channel for instance.

    Configuring and Starting STORPORT ETW Logging

    To measure storage performance, the user enters a threshold value via the GUI interface. (This can be scripted as well; see KB 978000 for details.)  When an I/O time is equal to or greater than the entered threshold, an event 13 is written to the ETW log.  

    Here is a walkthrough to enable this feature via the GUI:

    Click Start, type “perfmon”, and press ENTER to start Performance Monitor.



    Expand “Data Collector Sets” and select “Event Trace Sessions”.



    Right-Click on “Event Trace Sessions” and select “New -> Data Collector Set”. 



    The following dialog will appear:



    Give the new data collector set a name in the dialog box.  In this example I called it “Storport Performance Logging”. Choose the “Create manually (Advanced)” option then click Next to see the following dialog.



    After clicking “Add...” on the dialog above the following list of providers will appear. Select “Microsoft-Windows-StorPort” and click OK.



    When the dialog below opens, select “Keywords (Any)” then click Edit.



    Check the box for IOPerfNotification, and then click OK. Note:  For Windows Server 2008 SP1 and SP2, some properties may not be present (e. g. Reset, PortNotification and PauseResume). 




    Next, select “Filter” and click Edit…



    This is where we enter our time threshold.  Remember that timing values greater than or equal to this filter value will be logged to the trace file.


    Select “Binary” and check the “Filter enabled” box


    In the “Filter data” field, enter the threshold value in the number of milliseconds. This must be entered in little endian format. Please refer to this table for the most common entries.



    Binary(little endian)









    01 00 00 00 00 00 00 00

    05 00 00 00 00 00 00 00

    0A 00 00 00 00 00 00 00

    0F 00 00 00 00 00 00 00


    Note: The “Filter type” value will always remain 0 as in the example above showing an entry for 1 ms. 


    WARNING - The whole data line needs to be filled in when entering the threshold. For demonstration purposes here is the WRONG WAY to enter 256 ms. This will not work because of the missing 00 entries on the line.


    Below is the correct way to enter the value (256ms) Note: The Filter value may have to be re-entered for next run because it could get reset.


    After clicking OK, you will see a new Event Trace session (screenshot below).  Right-Click the new Event Trace session and select “Properties”



    Next we need to set a location for the trace file.  Ensure you have sufficient disk space in the target folder because logs can be many Megabytes in size depending on the number of hits. 



    The target folder may be changed in the “Directory” tab



    I recommend choosing a directory that is more easily located, like C:\PerfLogs, in which case you can click “Browse”



    Note:  It is not recommended to store the trace file on the same storage you are troubleshooting.


    The name for the file is selected on the “File” tab.


    When you are done setting the file and directory information click OK.

    Now right-click the log and select “Start” to begin tracing.


    The “Status” should indicate that it is running.


    How long to run the log depends on you issue.  You should let the log run until the issue or slow down occur then check the log for delays.

    Viewing the Log File

    First right-click and select “Stop” to flush any cached data and stop the logging.


    Then use Event Viewer to examine the log.



    Next choose "Open the saved log…"






    After clicking OK a popup dialog will be displayed to ask if you want to convert the log. Select “No”. Here is some sample event data that you will likely see in the log.



    Example: Event ID 13



    Let's examine the “Details” tab which shows information about the I/O that exceeded or equaled the threshold we had set.  If you do not set a threshold, then all I/O events are logged, but for performance reasons we recommend setting a "reasonable" threshold like 15ms.


    Breaking down the Event Data we have …


    Miniport Extension: (Kernel address of the miniport extension, not of use to performance logging, only debugging)

    PortNumber:  This is the adapter port number (i.e. RaidPort1, etc.)

    PathID:  Path number if multi-pathing is in place

    TargetID:  Target LUN exposed to the Operating System

    LUN:  The Logical Unit Number of the physical storage where the particular I/O is directed

    *RequestDuration:  How long the request took in milliseconds (the number of note here for storage latency diagnostics)

    Command:  the SCSI command sent to the unit. (Usually read or write)  This is listed in decimal.  Convert this value to hexadecimal with Calc.exe for referencing SCSI commands http://en.wikipedia.org/wiki/SCSI_command

    ScsiStatus:  the status that came back from the drive

    SRBStatus:  the status as returned from adapter  (see srb.h and scsi.h in the Microsoft WDK or http://en.wikipedia.org/wiki/SCSI_Status_Code)


     Example: Command 42 equals 2A which is a Write of 10 bytes.



    Here’s the respective SCSI command:


    When viewing the log, the commands will be mostly reads and writes.  If I/O is determined to be slow at this level, then the issue is most likely disk or SAN related.

    Hope you enjoyed the walkthrough!









  • Ntdebugging Blog

    How to use the DedicatedDumpFile registry value to overcome space limitations on the system drive when capturing a system memory dump


    Prior to Windows Vista and Windows Server 2008 we had to keep a large page file on the system drive (typically drive C:) in order to properly capture system memory dumps for troubleshooting.  This presented problems as systems with very large amounts of RAM became more common, resulting in requirements for very large amounts of free space on the C: drive, or requiring that system visible memory be artificially limited for troubleshooting purposes.  This is no longer a requirement thanks to the Dedicated Dump File feature, which is available for use in Windows Vista and later operating systems.


    What is it?

    The dedicated dump file is basically a page file that is reserved for use only by the system crash dump routines.  It is not used for paging virtual memory.  Like a page file, the system process keeps an open handle to the dedicated dump file, which prevents it from being deleted.  When you manually initiate a memory dump, or the system crashes on its own, the data is written into the dedicated dump file instead of the page file on the system drive.  This feature is enabled by setting the following registry value:

    Location:   HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\CrashControl

    Name: DedicatedDumpFile

    Type:  REG_SZ

    Value: A dedicated dump file together with a full path, such as D:\dedicateddumpfile.sys

    Where can it be stored?

    The dedicated dump file can be stored on any local volume that can support a page file.  This volume must be considered a fixed disk device and must not be hosting Windows clustered storage.  For example, this means you can store this file on a second partition of your server’s primary hard disk, or on a SAN volume that is mounted early in the boot process. 

    Fun fact: Unlike a page file, you can actually direct the dedicated dump file to a sub-folder, even one that is actually a volume mount point.  However, the sub-folder must pre-exist during boot, and there is generally little reason to actually do this.


    How much space will it use?

    You can automatically allow the system to determine the size of the dedicated dump file, or you can manually set the size by using the DumpFileSize registry value. 


    System determined size:

    If you choose to let the system determine the size, then regardless of the dump file type set via the CrashDumpEnabled value, the dedicated dump file will be roughly the size of system visible RAM plus 256 MB.  The extra 256 MB allows room for secondary dump data that may be written by driver bug check callback routines.  In order to write this data, the drivers would have to first call KeRegisterBugCheckReasonCallback() to register the routines that would execute when the system bug checks.   This is currently not very common due to the limitations imposed by the high IRQL at which bug check callbacks run.  Allowing the system to determine the size is a good idea, especially when disk space is not of huge concern.  Also, the dedicated dump file size will be adjusted automatically at boot if additional RAM is installed or removed later.  This includes artificial memory reduction implemented via the BCDEdit option, truncatememory.  This eliminates the need to manually resize your page file to support proper collection of a memory dump.


    Manual size specified via the registry:

    The following registry value will allow you to manually set the dedicated dump file size in megabytes:

    Location:   HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\CrashControl

    Name: DumpFileSize

    Type:  REG_DWORD

    Value: The dump file size in megabytes



    One good reason one might have to set this value manually is in the case of a very large memory system from which only a kernel memory dump is desired.  In this case, a dedicated dump file the size of RAM+256 MB is not required.  Generally, even on 64-bit systems, kernel memory dumps are no larger than a few gigabytes, although they have the potential to be much larger.  In nearly all cases reserving one third the size of RAM will be more than sufficient to capture kernel memory.  If you have plenty of room on the drive and want to play it safe, you can simply allow the system to determine the size for you by setting the DumpFileSize value to zero, or by not creating the value at all.

    Although you may set the DumpFileSize to a value larger than needed, or the system chooses a size larger than the type of dump selected, the resultant file size after a bug check will only be as large as the data that is stored in it.  This is most commonly the size of physical RAM + room for the dump header information.  If you choose to set the DumpFileSize value manually, then you should also set the IgnorePagefileSize value under the CrashControl key to 1.  This will help ensure a dump file is created.


    What about the final dump file?

    After a bug check, the data will be written to the dedicated dump file on disk.  However, the final memory dump location will be based on the following registry value:

    Location:   HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\CrashControl

    Name: DumpFile

    Type:  REG_EXPAND_SZ

    Value: The target dump file name together with the full path, such as D:\memory.dmp



    The system will extract the dump data from the dedicated dump file and save the resultant file to the location and file name specified for the DumpFile value.  The default for this value is %SYSTEMROOT%\MEMORY.DMP.  Since we are using the dedicated dump file to overcome space limitations of the system drive, we will likely want to also set the DumpFile value to point to an alternate location than the default. 

    If you set this value to point to the same volume as the dedicated dump file, it will save WERFAULT.EXE the step of performing an additional copy of the temporary dump file after reboot.  Instead the file will simply be renamed on the target volume.   If there is not enough room for both the dedicated dump file and the memory dump after the bug check, then the dedicated dump file will not be recreated, and only the memory dump will remain.  If this happens and the system bug checks again, only the original dump will remain because there will be no dedicated dump to write the second bug check to until space is freed up on the drive and the system is rebooted.


    DedicatedDumpFile Usage Example:

    The following example will set the dedicated dump file and resultant dump to be stored on the root if drive D:


    Windows Registry Editor Version 5.00








    (Note: The lines above can be saved as a .REG file and imported into the registry)


    The screenshot below demonstrates a system configured to capture a complete 4 GB memory dump to a manually sized dedicated dump file, and will store the final MEMORY.DMP on the D: drive.


    Important information:

    949052 - Kernel memory dump files may not be generated on Windows Server 2008-based and Windows Vista SP1 or later based computers when physical memory is larger than the size of the page file


    957517 - A dedicated complete memory dump file may not be successfully generated if the volume that stores the dedicated dump file has insufficient free space

    950858 - Dedicated dump files are unexpectedly truncated to 4 GB on a computer that is running Windows Server 2008 or Windows Vista and that has more than 4 GB of physical memory


  • Ntdebugging Blog

    The Memory Shell Game


    Hello, this is Somak.  Today I’d like to drop some Memory Manager info on the blog that I’ve used to communicate in brief (believe it or not) how the system deals with memory.  If you are ever faced with checking how much Available Memory you have(or don’t have), poor system performance, questions about page faults, or having a Working Set Trimming performance issue, or just want a primer into how Windows manages memory on the system, be not afraid and read on!

    How this memory stuff works

    The fastest and most expensive memory is built into the CPU.  The next fastest and less expensive is physical RAM.  Next we have the hard drive, followed by remote storage or backup.  Each step down the ladder adds significantly to access time.  For example, physical RAM can be almost one million times faster than a hard disk.  Since it is volatile and cost considerably more than hard disk space, we are limited at how much we can put in the system.  This limit is far less than the much slower and non-volatile hard disk.  Due to these design constraints we have a tiered memory structure in the computer.  To achieve the fastest response times and best overall performance, an operating system must efficiently use this tiered memory structure.  It must do its best to reduce the need to retrieve data from a slower storage media like the hard disk.  It has to do this while juggling the memory and I/O demand from all running processes.  The following paragraphs are an overview of how the Memory Manager achieves this in Windows.  A more detailed description can be found in Chapter 7 of Microsoft Windows Internals, Fourth Edition (ISBN: 0-7356-1917-4).  This book is the great source for how stuff works in Windows.

    First let’s lay out some definitions.  They will be useful later on when I talk about the interactions.  These definitions are high level to maintain brevity.

    Virtual Memory – This is a memory that an operating system can address.  Regardless of the amount of physical RAM or hard drive space, this number is limited by your processor architecture.  On a 32 bit processor you are limited to 4 GB of addressable virtual memory (2^32).  With a default installation on a 32 bit box (not using /3GB) the kernel reserves 2GB for itself.  Applications are left with 2GB of addressable virtual memory.  When applications execute they are only presented with this 2GB of addressable memory.  Each application gets its own 2GB virtual memory to play with.  If you have 50 processes running, you’ll have 50 independent 2GB Virtual Memory address spaces and one 2GB Virtual Address space for kernel.  This is possible because Virtual Memory always exists, but doesn't really exist (hence the term virtual).  We basically lie to the application and say, here is 2GB of memory address space for you to use.  This memory isn’t allocated until the application explicitly uses it.  Once the application uses the page, it becomes committed.  A virtual memory page is then translated to a physical memory page.  From this translation, the virtual page can reside in physical RAM or on the hard disk.

    Physical Memory –  This is the physical storage media.  It can be physical RAM, the hard disk, optical disks, tape backups, etc.  This is anything that can store data.  Most times when people talk about Physical Memory, they refer to physical RAM (the memory sticks on your motherboard), but with virtual page translation, physical memory can also be on the hard drive (in your paging file).  Physical RAM is limited by your processor architecture.

    Committed Memory –  When an application touches a virtual memory page (reads/write/programmatically commits) the page becomes a committed page.  It is now backed by a physical memory page.  This will usually be a physical RAM page, but could eventually be a page in the page file on the hard disk, or it could be a page in a memory mapped file on the hard disk.  The memory manager handles the translations from the virtual memory page to the physical page.   A virtual page could be in located in physical RAM, while the page next to it could be on the hard drive in the page file.

    Commit Limit –  This is the maximum amount of memory that all your applications and the OS can commit.  If you had 50 applications fully allocate their 2 GB of virtual address space, you would need 100GB of commit limit (ignore kernel memory usage to keep the numbers simple).  So 100GB of committed pages can be backed by physical RAM or the hard drive.  If you have 100 GB of RAM, you could handle this memory load.   In most cases, 100 GB of RAM isn't economically feasible so the Commit Limit is comprised of physical RAM and the page file.  If you have 2 GB of physical RAM and 98 GB of page file, then your commit limit would be 100 GB.

    Page file –   This is the storage area for virtual memory that has been committed.  It is located on the hard drive.  Since hard drive space is cheaper than physical RAM, it is an inexpensive way to increase the commit limit.

    Working Set – This is a set of virtual memory pages (that are committed) for a process and are located in physical RAM.  These pages fully belong to the process.  A working set is like a "currently/recently working on these pages" list.

    Modified pages - Once a virtual memory page leaves the process's working set, it is moved to another list.  If the page has been modified, it is placed on the modified page list.  This page is in physical RAM.  A thread will then write the page to the page file and move it to the standby list.

    Standby pages - This is a page that has left the process' working set.  This page is in physical RAM.  A standby page is like a cache for virtual memory pages.  It is still associated with the process, but not in its working set.  If the process touches the page, it is quickly faulted back into the working set.  That page also has one foot out the door.  If another process or cache needs more memory, the process association is broken and it is moved to the free page list.  Most of the pages in available memory are actually standby pages.  This makes sense when you realize that these pages can be quickly given to another process (hence available), but you should also understand that they are page caches for working sets and can be quickly given back if the process touches the page again.  The vast majority of available memory is not wasted or empty memory.

    Free pages - When a page is taken off of the standby page list, it is moved to the Free page list.  This page is in physical RAM.  These pages are not associated with any process.   When a process exits, all of its pages are then dumped onto this list. Typically, there is a very small to no amount of free pages hanging around physical RAM. 

    Zeroed pages - When a free page is zeroed out, it is placed on the Zero page list.  This page is in physical RAM.  These are the pages that are given to processes that are making memory allocations.  Due to C2 security requirements, all pages must be scrubbed before handed to a new process.  When the system is idle, a thread will scrub free pages and put them on this list.  Only a small amount of zero pages are required to handle the typical small memory allocations of processes.  Once this list is depleted, and if there is demand for more pages, we pull pages off of the Free page list and scrub them on the fly.  If the Free page list is depleted, then we pull pages off of the standby list, scrub them on the fly and hand them to the new process.

    What is Task Manager telling me?

    Prior to Windows Vista, Task Manager reports memory usage using accounting methods that you probably are not expecting.  It is because of these accounting practices, that we rarely use Task Manager to gauge system performance and memory usage.  We typically use it for a quick overview or to kill processes.  I highly recommend using Performance Monitor (perfmon.msc) for investigating performance issues.  Here's the breakdown of the numbers on the Performance tab:

    Physical Memory

    Total - The is the total physical RAM installed in the system.

    Available - This is the total of the Standby, Free and Zeroed list.  Free and Zeroed makes sense, but Standby seems odd at first.  Standby pages were added to this number because they are available for a quick scrub and given to a new process.  So they are technically available (with minimal effort).

    System Cache- This is the total of the Standby list and the size of the system working set (which includes the file cache).  Standby pages are added to this list because they are cached pages for working sets.

    PF Usage - This is the total number of committed pages on the system.  It does not tell you how many are actually written to the page file.  It only tells you how much of the page file would be used if all committed pages had to be written out to the page file at the same time.

    Commit Charge

    Total - This is the total virtual memory that has been committed.  This includes all committed memory for all processes and the kernel.

    Limit - This is the maximum amount of committed memory this system can handle.  This is a combination of physical RAM and the page file.

    Peak – This is the highest amount of memory committed thus far on this system, since boot.

    How does this work?

    The memory manager optimizes physical RAM usage across the entire system.  Since physical RAM is a finite resource, it has to balance sharing this critical resource amongst all process, the kernel and file I/O.   It tries to keep disk I/O to a minimum, which results in a more responsive system.  It does this by moving pages around to meet the demand of the system.

    Typically, large sections of physical RAM are used for file cache.  This is cache is necessary to improve disk performance.  Without it, disk I/O would make the system crawl along at an nearly unusable pace.  The file system cache is just like a working set for a process.  Pages removed from the file cache are moved to the standby or modified page list.  Many of the standby pages in RAM are probably file cache pages that were removed from its working set.  For example, on a file server, if you see 8 GB of available memory, most of these pages are probably standby pages for the file cache.  The file cache's working set could be 500 MB, but the 8 GB of standby pages should also be considered part of the file cache.

    Now let's take a look at how the memory manager handles processes.  While an application is working with its virtual memory pages, the memory manager keeps the pages in the process' working set.  Since the vast majority of application do not use all of its memory all the time, some pages will age.  Old pages are removed from the working set.  If they are modified, they are moved to the modified list.  The page is saved to the page file and moved to the standby list.  If the page hasn't been modified, it is moved directly to the standby list.  These pages will remain on the standby page list until there is a demand for it.

    If the application touches the page again, it is soft faulted back into the process' working set.  If the process doesn't use the page for a very long time, or if the demand for the page is greater elsewhere, the page is moved off of the standby list.  It is disassociated with the process and moved to a the free page list.  From the free page list, the page is scrub on demand or lazily and placed on the zero page list.  It is from the zero page list that other processes or the kernel or the file cache will get a new page.

    If after a very long time the application once again needs a page that is not in its working set, the memory manager will handle the memory fault.  If the page is on the standby list, it is quickly put back into the process' working set.  If the page is no longer in the standby list, a hard fault occurs.  The memory manager issues I/O to the hard disk to read the page(s) from the page file.  Once the I/O complete, the page is placed back into the process' work set.

    All of this is done to keep physical RAM highly utilized and disk I/O to a minimum.  We don't want to allow process to horde physical RAM for pages that are rarely used.  The physical RAM must be shared with other processes, the kernel and the file cache.  If you see lots of available memory on your system, rest assured that it is not going to waste.  The vast majority is on standby lists for processes and the file cache.

    Also note that page file usage isn't that bad.  The page file allows the Memory Manager to save modified pages before placing the page on the standby list.  The page is still in physical RAM and can be quickly faulted back into the process.  This method gives the process a chance to reclaim an old page and it allows the page to be quickly used if there is demand elsewhere.

    The best way to see the totals of these lists is to use a kernel debugger (live or postmortem).  Use the !memusage command and you'll get an output like this:

    0: kd> !memusage

    loading PFN database

    loading (100% complete)

    Compiling memory usage data (99% Complete).

    Zeroed:    414 (  1656 kb)

                   Free:      2 (     8 kb)

    Standby: 864091 (3456364 kb)

    Modified:    560 (  2240 kb)

        ModifiedNoWrite:     30 (   120 kb)

    Active/Valid: 182954 (731816 kb)

             Transition:      2 (     8 kb)

                    Bad:      0 (     0 kb)

                Unknown:      0 (     0 kb)

    TOTAL: 1048053 (4192212 kb)

    Of the 4GB of physical RAM, only 1.6 MB are on Zeroed or free pages.  731 MB is in process, system and file cache working sets.  2 MB are on the modified page list.  The vast majority, 3.4 GB, is on the standby list.  On this server, most people will see 3.4 GB of wasted physical RAM, but you will know better.

    What should I be worried about?

    Typically you shouldn't be worried about these things, until you have a performance problem.  If your system is sluggish or slow to respond or you are getting errors about out of memory, then you need to rely on this information.  You will need to collect a performance monitor log of the problem time.  If the counter list is daunting, then use the Performance Monitor Wizard to configure the performance monitor log.

    Once the log is collected, you'll need to analyze several counters.  I'm not going into detail about how to review performance monitor logs this time.  I'll save that lengthy topic for another time.  For now I'll focus on the counters relevant to the Memory Manager.

    One of the biggest reasons for slow performance and sluggish system responsiveness is disk bottleneck.  Look at Physical Disk\% Idle Time, Avg. Disk sec/Read and Avg. Disk sec/Write counters.  If your system drive is under 50% idle or your disk response times are way above your drive specifications, then you need to investigate further.  Look at the Memory\Available Mbytes.  You should have a couple hundred Mbytes of Available Memory.  This is one of the most important performance monitor counters.  If this number drops too low, your standby lists, process working sets and cache will be greatly reduced.  You'll need to find out if a process is consuming physical RAM.  Check for large process working sets or for large file cache.

    You will also need to see if paging is really affecting system performance.  Take a look at Memory\Pages Input/sec and correlate that to Physical Disk\Avg. Disk sec/Read.  Pages Input/sec is the number of pages being read in from the page file.  These are the hard faults (when the page wasn't on the standby list).  If your Avg. Disk sec/Read is close to your drive's specification and the drive's idle time is high, than paging really isn't a problem.  Small amounts of hard faults are expected as applications will every once in a while re-touch an old page.  As long as this I/O is not consistent or the disk can't keep up, you probably will not notice this impact.

    You can also look at Memory\Pages Output/sec and Physical Disk\Avg. Disk sec/Write.  These are the page commits to the page file when a modified page is removed from a process' working set.  As long as the disk can keep up with the writes, this shouldn't be a problem.  Remember that once the page is saved to the page file, it is placed on the standby list.  If there isn't a great demand for new pages, it can remain on the standby list for a very long time.  When the process touches the old virtual page again, it can be soft faulted back into the working set.  If there is great demand for memory, you'll see process working sets aggressively being trimmed.  Unless there is memory pressure, this is all done with lazy I/O, so you should not see much of an impact from this activity.

    The Memory Manager works to meet current demand and prepares for future demand when possible.  You need to look at a performance monitor log to see if there is memory pressure on the system.  You'll see this in low Available Mbytes and reductions in process working sets.  You'll be able to correlate this to increase disk I/O to the page file.  If you have established that there is memory pressure on the box, you need to figure where that demand is coming from.  Check for working set increases from processes, file cache or bottlenecked disk I/O to data drives.

  • Ntdebugging Blog

    Understanding ARM Assembly Part 1


    My name is Marion Cole, and I am a Sr. EE in Microsoft Platforms Serviceability group.  You may be wondering why Microsoft support would need to know ARM assembly.  Doesn’t Windows only run on x86 and x64 machines?  No.  Windows has ran on a variety of processors in the past.  Those include i860, Alpha, MIPS, Fairchild Clipper, PowerPC, Itanium, SPARC, 286, 386, IA-32, x86, x64, and the newest one is ARM.  Most of these processors are antiquated now.  The common ones now are IA-32, x86, x64.  However Windows has started supporting ARM processors in order to jump into the portable devices arena.  You will find them in the Microsoft Surface RT, Windows Phones, and other things in the future I am sure.  So you may be saying that these devices are locked, and cannot be debugged.  That is true from a live debug perspective, but you can get memory dumps and application dumps from them and those can be debugged.




    There are limitations on ARM processors that Windows supports.  There are 3 System on Chip (SOC) vendors that are supported.  nVidia, Texas-Instruments, and Qualcomm. Windows only supports the ARMv7 (Cortex, Scorpion) architecture in ARMv7-A in (Application Profile) mode.  This implements a traditional ARM architecture with multiple modes and supporting a Virtual Memory System Architecture (VMSA) based on an MMU.  It supports the ARM and Thumb-2 instruction sets which allows for a mixture of 16 (Thumb) and 32 (ARM) bit opcodes.  So it will look strange in the assembly.  Luckily the debuggers know this and handle it for you.  This also helps to shrink the size of the assembly code in memory.  The processor also has to have the Optional ISA extensions of VFP (Hardware Floating Point) and NEON (128-bit SIMD Architecture).


    In order to understand the assembly that you will see you need to understand the processor internals.


    ARM is a Reduced Instruction Set Computer (RISC) much like some of the previous processors that Windows ran on.  It is a 32 bit load/store style processor.  It has a “Weakly-ordered” memory model: similar to Alpha and IA64, and it requires specific memory barriers to enforce ordering.  In ARM devices these as ISB, DSB, and DMB instructions.




    The processor has 16 available registers r0 – r15.

    0: kd> r

    r0=00000001  r1=00000000  r2=00000000  r3=00000000  r4=e1820044  r5=e17d0580

    r6=00000001  r7=e17f89b9  r8=00000002  r9=00000000 r10=1afc38ec r11=e1263b78

    r12=e127813c  sp=e1263b20  lr=e16c12c3  pc=e178b6d0 psr=00000173 ----- Thumb


    r0, r1, r2, r3, and r12 are volatile registers.  Volatile registers are scratch registers presumed by the caller to be destroyed across a call.  Nonvolatile registers are required to retain their values across a function call and must be saved by the callee if used. 


    On Windows four of these registers have a designated purpose.  Those are:

    • PC (r15) – Program Counter (EIP on x86)
    • LR (r14) – Link Register.  Used as a return address to the caller.
    • SP (r13) – Stack Pointer (ESP on x86).
    • R11 – Frame Pointer (EBP on x86).
    • CPSR – Current Program Status Register (Flags on x86).


    In Windbg all but r11 will be labeled appropriately for you.  So you may be asking why r11 is not labeled “fp” in the debugger.  That is because r11 is only used as a frame pointer when you are calling a non-leaf subroutine.  The way it works is this: when a call to a non-leaf subroutine is made, the called subroutine pushes the value of the previous frame pointer (in r11) to the stack (right after the lr) and then r11 is set to point to this location in the stack, so eventually we end up with a linked list of frame pointers in the stack that easily enables the construction of the call stack. The frame pointer is not pushed to the stack in leaf functions.  Will discuss leaf functions later.


    CPSR (Current Program Status Register)


    Now we need to understand some about the CPSR register.  Here is the bit breakdown:


















































    • Bits [31:28] – Condition Code Flags
      • N – bit 31 – If this bit is set, the result was negative.  If bit is cleared the result was positive or zero.
      • Z – bit 30 – If set this bit indicates the result was zero or values compared were equal.  If it is cleared, the value is non-zero or the compared values are not equal.
      • C – bit 29 – If this bit is set the instruction resulted in a carry condition.  E.g. Adding two unsigned values resulted in a value too large to be strored.
      • V – bit 28 – If this bit is set then the instruction resulted in an overflow condition.  E.g. An overflow of adding two signed values.
    • Instructions variants ending with ‘s’ set the condition codes (mov/movs)
    • E – bit 9 – Endianness (big = 1/Little = 0)
    • T – bit 5 – Set if executing Thumb instructions
    • M – bits [4:0] – CPU Mode (User 10000/Supervisor 10011)


    So why do I need to know about the CPSR (Current Program Status Register)?  You will need to know where some of these bits are due to how some of the assembly instruction affect these flags.  Example of this is:


    ADD will add two registers together, or add an immediate value to a register.  However it will not affect the flags.


    ADDS will do the same as ADD, but it does affect the flags.


    MOV will allow you to move a value into a register, and a value between registers.  This is not like the x86/x64.  MOV will not let you read or write to memory.  This does not affect the flags.


    MOVS does the same thing as MOV, but it does affect the flags.


    I hope you are seeing a trend here.  There are instructions that will look the same.  However if they end in “S” then you need to know that this will affect the flags.  I am not going to list all of those assembly instructions here.  Those are already listed in the ARM Architecture Reference Manual ARMv7-A and ARMv7-R edition at http://infocenter.arm.com/help/topic/com.arm.doc.ddi0406b/index.html.


    So now we have an idea of what can set the flags.  Now we need to understand what the flags are used for.  They are mainly used for branching instructions.  Here is an example:

    003a11d2 429a     cmp         r2,r3

    003a11d4 d104     bne         |MyApp!FirstFunc+0x28 (003a11e0)|


    The first instruction in this code (cmp) compares the value stored in register r2 to the value stored in register r3. This comparison instruction sets or resets the Z flag in the CPSR register. The second instruction is a branch instruction (b) with the condition code ne which means that if the result of the previous comparison was that the values are not equal (the CPSR flag Z is zero) then branch to the address MyApp!FirstFunc+0x28 (003a11e0). Otherwise the execution continues.


    There are a few compare instructions.  “cmp” subtracts two register values, sets the flags, and discards the result.  “cmn” adds two register values, sets the flags, and discards the results.  “tst” does a bit wise AND of two register values, sets the flags, and discards the results.  There is even an If Then (it) instruction.  I am not going to discuss that one here as I have never seen it in any of the Windows code.


    So is “bne” the only branch instruction?  No.  There is a lot of them.  Here is a table of things that can be seen beside “b”, and what they check the CPSR register:



    Meaning (Integer)

    Condition Flags (in CPSR)





    Not Equal



    Negative (Minus)



    Positive or Zero (Plus)



    Unsigned higher

    C==1 and Z==0


    Unsigned lower or same

    C==0 or Z==1


    Signed greater than or equal



    Signed less than



    Signed greater than

    Z==0 and N==V


    Signed less than or equal

    Z==1 or N!=V





    No overflow



    Carry set



    Carry clear


    None (AL)

    Execute always



    Floating Point Registers


    As mentioned earlier the processor also has to have the ISA extensions of VFP (Hardware Floating Point) and NEON (128-bit SIMD Architecture).  Here is what they are.

    Floating Point


    As you can see this is 16 – 64bit regiters (d0-d15) that is overlaid with 32 – 32bit registers (s0-s31).  There are varieties of the ARM processor that has 32 – 64bit registers and 64 – 32bit registers.  Windows 8 will support both 16 and 32 register variants.  You have to be careful when using these, because if you access unaligned floats you may cause an exception.




    As you can see here the SIMD (NEON) extension adds 16 – 128 bit registers (q0-q15) onto the floating point registers.  So if you reference Q0 it is the same as referencing D0-D1 or S0-S1-S2-S3.


    In part 2 we will discuss how Windows utilizes this processor.

  • Ntdebugging Blog

    Easily Resolving an Event Viewer Error using a Process Memory Dump


    My name is Ryan Mangipano (ryanman) and I am a Sr. Support Escalation Engineer at Microsoft.  Today I will be blogging about how I used the SOS  .Net Framework debugging extension (and !analyze -v) to easily troubleshoot a .Net Framework exception. This exception was preventing Event Viewer from displaying properly. Event Viewer was returning an error that provided very little information about what was actually causing the issue. I will demonstrate how, in this case, it was very easy to use windbg to obtain information about what went wrong. I did not have to perform a live debug on the issue.  Instead, I used a process dump to obtain very exact information, which was returned by the debugger, relating to the root cause. I was then able to use Process Monitor to identify the file that needed to be examined. These actions led me to the source of the problem, which was easily corrected. Also, the issue discussed in this blog was easily reproduced on Windows Server 2008. This means that you should be able to practice this debug exercise, on your non-production Windows 2008 SP1 Server, for learning purposes if you are interested in doing so.


    Issue Reported:  The following error was encountered when opening eventvwr.msc (Event Viewer) on a Windows 2008 Server system:


                           "MMC could not create the snap-in."

    MMC could not create the snap-in.

    The snap-in might not have been installed correctly

    Name: Event Viewer

                   CLSID: FX:{b05566ad-fe9c-4363-be05-7a4cbb7cb510}


    First Step- Research & Data Gathering: After ensuring I first understood the problem reported, I searched for known issues. I found out that we have seen this error before.  It may occur when the following registry key gets deleted or corrupted:


                    HKLM\software\Microsoft\MMC\SnapIns\FX:{ b05566ad-fe9c-4363-be05-7a4cbb7cb510}


    I had the customer export this key and found that it was not corrupted in any way. I verified that all data was as expected


    Next,  a memory dump of the mmc.exe process was collected.  The mmc.exe process is used to host the eventvwr.msc snap-in. This was easily obtained using the built in Windows 2008 Server "Windows Task Manager" feature: "Create Dump File" . If you have several mmc console instances executing on your system, you can use the Task Manager context menu shortcuts "Switch To" and "Go To Process" to help you to identify the correct instance.




    Note: We also collected a process monitor logfile during the startup of eventvwr.msc. This log file later proved very helpful in resolving the issue (as I will show below). Process monitor can be obtained at the following URL:  http://technet.microsoft.com/en-us/sysinternals/bb896645.aspx



     Now let's have a look at the debug.


    1.          First, I navigated Windows Explorer to the location of the dump file and double-clicked it to open it in windbg.exe.


    It opened in windbg because I had previously run the command windbg -IA, which associates .dmp files with windbg. You can read more about the command line options in windbg in the help file that is included with the debugging tools.



    2.            I noticed the following output from the debugger after it loaded the dump file:


    This dump file has an exception of interest stored in it.

    The stored exception information can be accessed via .ecxr.

    (ff8.a2c): CLR exception - code e0434f4d (first/second chance not available)



    3.            Next, I wanted to ensure my symbol path was set correctly. I could have set it using the .sympath command:


    0:011> .sympath SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    Expanded Symbol search path is: srv*c:\websymbols*http://msdl.microsoft.com/download/symbols

    0:011> .sympath

    Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    Expanded Symbol search path is: srv*c:\websymbols*http://msdl.microsoft.com/download/symbols


    However, when your goal is to simply point to the default symbol server, .symfix is a very nice shortcut. It prevents one from having to try to remember the URL.  Here’s the syntax: 


    0:011> .symfix c:\websymbols

    0:011> .sympath

    Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols


    4.            To ensure that I didn't waste time reviewing the wrong data, I performed a quick check to ensure that we collected a dump of the requested snap-in.


    0:005> !peb

    PEB at 000007fffffdb000


       CommandLine:  '"C:\Windows\system32\mmc.exe" "C:\Windows\system32\eventvwr.msc" '


    You could alternatively dump the  CommandLine from the nt!_PEB using the dt command


    0:005> dt nt!_PEB ProcessParameters->CommandLine 000007fffffdb000


       +0x020 ProcessParameters              :

          +0x070 CommandLine                    : _UNICODE_STRING ""C:\Windows\system32\mmc.exe" "C:\Windows\system32\eventvwr.msc" "


    5.            Next, I then dumped out all of the threads in this process and found the following thread contained a stack that was raising a .Net Framework exception


    0:011> ~* kL

    ... (ommitted the non-relevent threads)


    # 11  Id: ff8.a2c Suspend: 1 Teb: 7ffd3000 Unfrozen

    ChildEBP RetAddr 

    0691f03c 7343a91c kernel32!RaiseException+0x58

    0691f09c 7343d81a mscorwks!RaiseTheExceptionInternalOnly+0x2a8

    *** WARNING: Unable to verify checksum for MMCEx.ni.dll

    0691f140 6bfe0b5a mscorwks!JIT_Rethrow+0xbf

    *** WARNING: Unable to verify checksum for mscorlib.ni.dll

    0691f1e8 69926cf6 MMCEx_ni+0xd0b5a

    0691f1f4 6993019f mscorlib_ni+0x216cf6

    0691f208 69926c74 mscorlib_ni+0x22019f

    0691f220 733d1b4c mscorlib_ni+0x216c74

    0691f230 733e21b1 mscorwks!CallDescrWorker+0x33

    0691f2b0 733f6501 mscorwks!CallDescrWorkerWithHandler+0xa3

    0691f3e8 733f6534 mscorwks!MethodDesc::CallDescr+0x19c

    0691f404 733f6552 mscorwks!MethodDesc::CallTargetWorker+0x1f

    0691f41c 7349d803 mscorwks!MethodDescCallSite::CallWithValueTypes+0x1a

    0691f604 733f845f mscorwks!ThreadNative::KickOffThread_Worker+0x192

    0691f618 733f83fb mscorwks!Thread::DoADCallBack+0x32a

    0691f6ac 733f8321 mscorwks!Thread::ShouldChangeAbortToUnload+0xe3

    0691f6e8 733f84ad mscorwks!Thread::ShouldChangeAbortToUnload+0x30a

    0691f710 7349d5d4 mscorwks!Thread::ShouldChangeAbortToUnload+0x33e


    6.            Out of curiosity, I also ran the Get Last Error command


    0:011> !gle

    LastErrorValue: (Win32) 0 (0) - The operation completed successfully.

    LastStatusValue: (NTSTATUS) 0 - STATUS_WAIT_0


    7.            After this, I ran analyze -v to see what helpful information the debugger would provide. The debugger did output exception information but informed me that I needed to use the x86 debugger instead.


    0:011> !analyze -v


                            EXCEPTION_RECORD:  ffffffff -- (.exr 0xffffffffffffffff)

                            ExceptionAddress: 771a42eb (kernel32!RaiseException+0x00000058)

                              ExceptionCode: e0434f4d (CLR exception)

                              ExceptionFlags: 00000001

                            NumberParameters: 1

                            Parameter[0]: 80131604


                                                    Managed code needs matching platform of sos.dll for proper analysis. Use 'x86' debugger.

                            FAULTING_THREAD:  00000a2c

                                                    PRIMARY_PROBLEM_CLASS:  CLR_EXCEPTION


    8.            I fired up the x86 debugger and loaded the appropriate version of the SOS .Net Framework debugger extension.  This extension ships in the Operating System along with the .Net Framework.  On most occasions, I would have initiated the loading of the extension through the use of the following syntax:


     0:011> .load C:\Windows\Microsoft.NET\Framework\v2.0.50727\sos.dll




    0:011> .load  c:\Windows\Microsoft.NET\Framework64\v2.0.50727\sos.dll


    However, once you realize that managed debugging will be necessary and that you need the services of the SOS extension, it’s best to use the .loadby command rather than .load.  This is due to the fact that the version of SOS must match the version of the CLR loaded into that process.  Here’s the recommended syntax: 


                            0:011>.loadby sos mscorwks


    I always verify that my extensions are loaded properly by using the  .chain command.


    0:011> .chain

    ...   Extension DLL chain:

                            C:\Windows\Microsoft.NET\Framework\v2.0.50727\sos.dll: image 2.0.50727.1434, API 1.0.0, built Wed Dec 05 22:42:38 2007


    9.            Running !help printed out the following helpful information about the SOS extension since sos.dll was at the top of the .chain output:


    0:011> !help


    SOS is a debugger extension DLL designed to aid in the debugging of managed

    programs. Functions are listed by category, then roughly in order of

    importance. Shortcut names for popular functions are listed in parenthesis.

    Type "!help <functionname>" for detailed info on that function.


    Object Inspection                  Examining code and stacks

    -----------------------------      -----------------------------

    DumpObj (do)                       Threads

    DumpArray (da)                     CLRStack

    DumpStackObjects (dso)             IP2MD

    DumpHeap                           U

    DumpVC                             DumpStack

    GCRoot                             EEStack

    ObjSize                            GCInfo

    FinalizeQueue                      EHInfo

    PrintException (pe)                COMState

    TraverseHeap                       BPMD


    10.            Using the exception address, displayed by the debugger when opening the dump, and the !pe command listed above, I obtained more information about the exception:


    0:011> !pe 771a42eb

    Invalid object

    There are nested exceptions on this thread. Run with -nested for details


    0:011> !pe -nested 771a42eb

    Invalid object


    Nested exception -------------------------------------------------------------

    Exception object: 040a676c

    Exception type: System.Reflection.TargetInvocationException

    Message: Exception has been thrown by the target of an invocation.

    InnerException: System.Reflection.TargetInvocationException, use !PrintException 040a6a20 to see more

    StackTrace (generated):

        SP       IP       Function


    StackTraceString: <none>

    HResult: 80131604


    0:011> !PrintException 040a6a20

    Exception object: 040a6a20

    Exception type: System.Reflection.TargetInvocationException

    Message: Exception has been thrown by the target of an invocation.

    InnerException: System.Configuration.ConfigurationErrorsException, use !PrintException 040a6cf8 to see more

    StackTrace (generated):


    StackTraceString: <none>

    HResult: 80131604

    There are nested exceptions on this thread. Run with -nested for details


    0:011> !PrintException 040a6cf8

    Exception object: 040a6cf8

    Exception type: System.Configuration.ConfigurationErrorsException

    Message: Configuration system failed to initialize

    InnerException: System.Configuration.ConfigurationErrorsException, use !PrintException 040a7174 to see more

    StackTrace (generated):


    StackTraceString: <none>

    HResult: 80131902

    There are nested exceptions on this thread. Run with -nested for details


    0:011> !PrintException 040a7174

    Exception object: 040a7174

    Exception type: System.Configuration.ConfigurationErrorsException

    Message: Unrecognized configuration section system.web/myInvalidData  

    InnerException: <none>

    StackTrace (generated):


    StackTraceString: <none>

    HResult: 80131902

    There are nested exceptions on this thread. Run with -nested for details


    11.            Based on the exception information listed above, it appeared that a .Net Framework configuration section, system.web, contained an invalid configuration section named myInvalidData  inside of it. I then re-ran !analyze -v against the dump again (now that I had loaded the x86 debugger) and found that !analyze -v will load the sos.dll extension and even run the !pe extension automatically. It then automatically displayed the exception record information for me as well. Also, notice that the thread listed by !analyze -v matches the thread I examined earlier.


                0:011> !analyze -v


                            EXCEPTION_MESSAGE:  Unrecognized configuration section system.web/myInvalidData.

                                                    MANAGED_OBJECT_NAME:  System.Configuration.ConfigurationErrorsException

                                                    FAULTING_THREAD:  00000a2c


    0:011> ~

       0  Id: ff8.c84 Suspend: 1 Teb: 7ffdf000 Unfrozen

       1  Id: ff8.96c Suspend: 1 Teb: 7ffde000 Unfrozen

       2  Id: ff8.d10 Suspend: 1 Teb: 7ffdd000 Unfrozen

       3  Id: ff8.d94 Suspend: 1 Teb: 7ffdc000 Unfrozen

       4  Id: ff8.a14 Suspend: 1 Teb: 7ffda000 Unfrozen

       5  Id: ff8.fbc Suspend: 1 Teb: 7ffd9000 Unfrozen

       6  Id: ff8.f88 Suspend: 1 Teb: 7ffd8000 Unfrozen

       7  Id: ff8.a64 Suspend: 1 Teb: 7ffd6000 Unfrozen

       8  Id: ff8.bf8 Suspend: 1 Teb: 7ffd5000 Unfrozen

       9  Id: ff8.d24 Suspend: 1 Teb: 7ffd4000 Unfrozen

      10  Id: ff8.ff0 Suspend: 1 Teb: 7ffd7000 Unfrozen

    . 11  Id: ff8.a2c Suspend: 1 Teb: 7ffd3000 Unfrozen


    12. At this point I was interested in identifying the source of this unrecognized configuration. Instead of engaging our .Net support team, I started with a quick search using www.live.com for     


                "unrecognized configuration section" system.web site:microsoft.com


    This returned the following results http://search.live.com/results.aspx?q=%22unrecognized+configuration+section%22+system.web+site%3Amicrosoft.com&form=QBRE


    By quickly reviewing some of the hits returned, I found that others had encountered this exception in their own applications.  This is due to invalid entries in the various .config  files used in .Net. Looking through the posts, different configuration file names and paths were observed.


    So, I opened up the process monitor logfile to see which configuration files we were reading data from. I added filter criterion to match entries from the mmc.exe process, the TID from the FAULTING_THREAD listed in the exception data, path data containing .config,  and a successful status result. It's best to be as specific as possible.



    I found that we were reading in a large amount of settings over and over again from the .net Framework global configuration file:             


                    (on x64 this would be     C:\Windows\Microsoft.NET\Framework64\v2.0.50727\CONFIG\machine.config)


    Final Step- Putting it all together, Reproducing the issue, & confirming resolution : Using notepad, a quick search of the suspect xml file (C:\Windows\Microsoft.NET\Framework64\v2.0.50727\CONFIG\machine.config) on my system revealed a <system.web> section. At this point, I suspected that this section contained an invalid section which may have been related to the problem. To verify this, and since I like to break things, I added an invalid configuration setting <myInvalidData/> to my global configuration file. Doing so, I successfully reproduced the issue on my system. I then contacted the customer and asked if they had by any chance added any settings under the <system.web> in the configuration file: c:\Windows\Microsoft.NET\Framework\v2.0.50727\CONFIG\machine.config.


    The customer informed me that, per the request of their ASP.net developer, they had in fact added settings to that section of the file.  By researching http://msdn.microsoft.com/en-us/library/system.web.aspx and the schema documentation at http://msdn.microsoft.com/en-us/library/dayb112d.aspx, we were able to determine that the settings that were present in this file should not have been present inside of <system.web> . The settings were moved to the proper location per the developer and the issue was resolved.


    Here are the steps I used to reproduce the issue in case you are attempting to replicate this at home-


    A. Using notepad, open the following configuration file on a non-production Windows Server 2008 SP1 system:

         (please make a backup copy first in case you make a mistake)  


         OR   (Open the version that matches the architecture of your platform )




    B. Find the section <system.web> in this file (you can use the find function in notepad):


    C.  Add the following line directly after <system.web> as shown in the example below:



    D. Save the file and then open eventvwr.msc and verify that the following error is displayed:



    Hopefully this blog has demonstrated an example of how you can use the "create dump file" feature of Windows 2008, windbg, and other related tools in an attempt to gain more specific data when your error message is not revealing the source of the problem. Feel free to post any questions, comments, or concerns.


  • Ntdebugging Blog

    Getting Ready for Windows Debugging



    Welcome to the Microsoft NTDebugging blog!  I’m Matthew Justice, an Escalation Engineer on Microsoft’s Platforms Critical Problem Resolution (CPR) team.  Our team will be blogging about troubleshooting Windows problems at a low level, often by using the Debugging Tools for Windows.  For more information about us and this blog, check out the about page.


    To get things started I want to provide you with a list of tools that we’ll be referencing in our upcoming blog posts, as well as links to some technical documents to help you get things configured.


    The big list of tools:


    The following tools are part of the “Debugging Tools for Windows” – you’ll definitely need these


    ·         windbg

    ·         cdb

    ·         ntsd

    ·         tlist

    ·         gflags

    ·         adplus

    ·         UMDH

    ·         symcheck


    Sysinternals provides some great tools that we’ll be discussing


    ·         Process Explorer

    ·         Process Monitor

    ·         Regmon

    ·         Filemon

    ·         DbgView

    ·         Handle.exe

    ·         Tcpview

    ·         LiveKD

    ·         AutoRuns

    ·         WinObj


    There are many tools contained in “MPS Reports” (MPSRPT_SETUPPerf.EXE), but I’m listing it here specifically for Checksym


    ·         Checksym


    “Windows Server 2003 Resource Kit Tools” is another great set of tools.  In particular Kernrate is a part of that package


    ·         Kernrate


    Windows XP SP2 Support Tools


    ·         netcap

    ·         poolmon

    ·         memsnap

    ·         tracefmt  (64-bit versions available in the DDK)

    ·         tracelog

    ·         tracepdb

    ·         depends

    ·         pstat


    “Visual Studio “ – in addition to the compilers and IDE, the following tools come in handy:

    ·         SPY++

    ·         dumpbin


    Perfwiz (Performance Monitor Wizard)






    Userdump (User Mode Process Dumper)



    Dheapmon (Desktop Heap Monitor)



    Netmon 3.0

    §  Go to http://connect.microsoft.com/

    §  Sign in with your passport account

    §  Choose "Available Connections" on the left

    §  Choose "Apply for Network Monitor 3.0” (once you've finished with the application, the selection appears in your "My Participation" page)

    §  Go to the Downloads page (On the left side), and select the appropriate build 32 or 64 bit build.




    Some articles you may find useful:


    Debugging Tools and Symbols: Getting Started



    Boot Parameters to Enable Debugging



    How to Generate a Memory Dump File When a Server Stops Responding (Hangs)



    After installing the “Debugging Tools for Windows”, you’ll find two documents at the root of the install folder that are helpful:


    ·         kernel_debugging_tutorial.doc - A guide to help you get started using the kernel debugger.


    ·         debugger.chm - The help file for the debuggers.  It details the commands you can use in the debugger.  Think of this as a reference manual, rather than a tutorial.

  • Ntdebugging Blog

    The Mystery of Lsass.exe Memory Consumption, (When all components get involved)


    Hi All, this is Karim Elsaid and I’m a Support Escalation Engineer working with the Dubai platforms support team.  Recently I was working on a very challenging and interesting case, and I wanted to share that experience with you.


    One of our customers was experiencing a problem on all his Domain Controllers (Running x86 Windows Server 2003), where intermittently the DCs ran out of resources.  When one of the servers ran out of resources it had to be restarted to restore its functionality.  The following event was logged on the DC when the problem occurred:


    Event Type: Error
    Event Source: NTDS General
    Event Category: Internal Processing
    Event ID: 1169
    User: N/A
    Active Directory could not allocate the needed amount of memory.

    Memory (bytes):

    Active Directory will continue to operate, but may not function correctly.

    User Action
    Restart this domain controller. If this condition continues, increase the available
    physical or virtual memory.

    Additional Data
    Internal ID:
    For more information, see Help and Support Center at


    This event gets logged when the Local Security Authority Subsystem (lsass.exe) process is unable to allocate memory from the heap.


    When this problem occurs, users are not able to authenticate and Exchange servers cease functioning until the DC gets rebooted.


    By the time we got the call, the DC that was exhibiting the issue had been rebooted, hence when we checked the lsass.exe memory usage it was very low, but when we checked another DC on the same site, the lsass.exe memory usage was around 1.7GB.  Well, that’s pretty high but it is worth mentioning that by default lsass.exe may utilize a large amount of virtual memory for the ESENT Cache (Active Directory database), nevertheless this should not cause any issues on DCs as it should back-off when needed.


    The issue here is that the memory usage for lsass.exe kept growing and never got released back again - this is called a process memory leak.


    To confirm our theory we obtained a Perfmon Log from the DC and checked how Lsass.exe is doing with memory, which proved the theory that lsass.exe is actually leaking memory.

    Performance Monitor


    We can see from the graph above a steady increase in lsass.exe private bytes.  The DC was booted without /3GB and hence the limit of the user mode process address space is 2GB.


    For more information on this topic, please check the previous post “The Memory Shell Game


    Now, we need to explore the options on how to track and address such leaks. We could obtain an lsass.exe process dump and check the heap, or obtain a Full memory dump of the server while in the failed state, to check the lsass.exe heap usage and overall system state, but the thing is none of the above may immediately reveal what’s going on.  We would need to be very lucky to get the dump while the actual memory call that happens inside lsass.exe process is executing, which is very slim.


    What we are looking for is a “pattern”, the pattern which is causing the actual leak inside the lsass.exe process.


    For that we will turn to a tool that I see many of the customers are not aware of, but in fact it is very handy and easy to troubleshoot such a memory leak problem. We will use the “User Mode Dump Heap” or UMDH. By using this tool it will do the job for us, it will monitor the all heap calls that happen in the lsass.exe process.  All we need to do is to download the “Debugging Tools For Windows”, enable user mode stack acquisition from the command line, download the symbols from Microsoft public symbol servers, reboot the server and we are good to go.


    You can find information on UMDH; how to set it up and use it “Here


    So now, all heap blocks that the lsass.exe process allocates will get monitored by UMDH, and the tool will tell us information about that specific call stack.  For example, the number of allocations using this stack, number of bytes consumed, etc…


    Back to our main problem: We have setup the DC for UMDH and started monitoring the lsass.exe process.  We verified UMDH is working after the reboot by generating a sample log -  it is very critical to generate this simple log to make sure that you are able to resolve the symbols correctly.


    We kept generating UMDH logs daily and then the interesting pattern started to appear, which shows:


    + 123097760 ( 123098480 -    720) 1538731 allocs      BackTrace4BA0

    + 1538722 ( 1538731 -      9) BackTrace4BA0     allocations

























    From the backtrace 4BA0 we can see that there are ~1.5 heap allocation calls that have occurred without being freed from the first day we started to monitor lsass using UMDH.


    There is also another call stack with exactly the same symptoms:


    + 51525199 ( 51525506 -    307) 1538802 allocs  BackTrace4B92

    + 1538793 ( 1538802 -      9) BackTrace4B92     allocations





          RPCRT4!operator new+0000000D























    From both call stacks above we can see that the originating function that leads to the heap allocation is “NTDSA!NspiBind


    NSPI is the interface which the Exchange Address Book provider uses to talk to the Global Catalogs. Typically when Outlook clients want to talk to the directory for address book, they talk to the Exchange server directly and then the Exchange server will look up the appropriate GC, forward the request to it and finally send the result back to the originating clients.  This method is called the DSproxy method.  Outlook clients can also send referral requests (RFR) to the Exchange server to get a referral to a GC and then start talking to this GC directly.  You can find more information about this operation “Here



    With this new information in hand, we turned our attention to the Exchange server for a while. We did so by obtaining a netmon trace from the Exchange server, and started examining it to reveal a massive number of RFR requests originating from outlook client machines toward the Exchange server:

    3795  11:50:59.1426170  10.3876170        GC_SERVER   EXCHSERVER NSPI  NSPI:NspiBind Response, ReturnValue=0x0     {MSRPC:535, TCP:488, IPv4:118}

    3796  11:50:59.1426170  10.3876170        EXCHSERVER    NSPI      NSPI:NspiBind Response, ReturnValue=0x0   {MSRPC:502, TCP:496, IPv4:326}

    3797  11:50:59.1426170  10.3876170        EXCHSERVER    NSPI      NSPI:NspiBind Response, ReturnValue=0x0   {MSRPC:509, TCP:508, IPv4:326}

    3798  11:50:59.1426170  10.3876170        EXCHSERVER    NSPI      NSPI:NspiBind Response, ReturnValue=0x0   {MSRPC:494, TCP:484, IPv4:326}

    9807  11:51:13.1894920  24.4344920    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1144, TCP:1142, IPv4:326}

    10084 11:51:13.3769920  24.6219920        EXCHSERVER GC_SERVER   NSPI  NSPI:NspiBind Request, Flags=0x0      {MSRPC:1145, TCP:1143, IPv4:118}

    10085 11:51:13.3769920  24.6219920        GC_SERVER   EXCHSERVER NSPI  NSPI:NspiBind Response, ReturnValue=0x0     {MSRPC:1145, TCP:1143, IPv4:118}

    10086 11:51:13.3926170  24.6376170        EXCHSERVER    NSPI      NSPI:NspiBind Response, ReturnValue=0x0   {MSRPC:1144, TCP:1142, IPv4:326}

    10103 11:51:13.3926170  24.6376170    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1182, TCP:1180, IPv4:326}

    10108 11:51:13.3926170  24.6376170    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1151, TCP:1149, IPv4:326}

    10111 11:51:13.3926170  24.6376170    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1200, TCP:1196, IPv4:326}

    10115 11:51:13.3926170  24.6376170    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1164, TCP:1163, IPv4:326}

    10148 11:51:13.4863670  24.7313670 EXCHSERVER NSPI  NSPI:NspiBind Request, Flags=0x0      {MSRPC:447, TCP:446, IPv4:367}

    10149 11:51:13.4863670  24.7313670        EXCHSERVER GC_SERVER   NSPI  NSPI:NspiBind Request, Flags=0x0      {MSRPC:473, TCP:448, IPv4:69}

    10172 11:51:13.4863670  24.7313670    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1199, TCP:1197, IPv4:326}

    10174 11:51:13.4863670  24.7313670    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1187, TCP:1186, IPv4:326}

    10180 11:51:13.4863670  24.7313670    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1170, TCP:1168, IPv4:326}

    10183 11:51:13.4863670  24.7313670    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1153, TCP:1152, IPv4:326}

    10185 11:51:13.4863670  24.7313670        GC_SERVER   EXCHSERVER NSPI  NSPI:NspiBind Response, ReturnValue=0x0     {MSRPC:473, TCP:448, IPv4:69}

    10186 11:51:13.4863670  24.7313670        EXCHSERVER NSPI  NSPI:NspiBind Response, ReturnValue=0x0     {MSRPC:447, TCP:446, IPv4:367}

    10224 11:51:13.5176170  24.7626170    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1209, TCP:1208, IPv4:326}

    10227 11:51:13.5176170  24.7626170    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1192, TCP:1190, IPv4:326}

    10232 11:51:13.5176170  24.7626170    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1159, TCP:1157, IPv4:326}

    10235 11:51:13.5176170  24.7626170    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1175, TCP:1173, IPv4:326}

    10240 11:51:13.5176170  24.7626170 EXCHSERVER NSPI  NSPI:NspiBind Request, Flags=0x0      {MSRPC:447, TCP:446, IPv4:367}

    10241 11:51:13.5176170  24.7626170        EXCHSERVER GC_SERVER   NSPI  NSPI:NspiBind Request, Flags=0x0      {MSRPC:473, TCP:448, IPv4:69}

    10251 11:51:13.5332420  24.7782420        GC_SERVER   EXCHSERVER NSPI  NSPI:NspiBind Response, ReturnValue=0x0     {MSRPC:473, TCP:448, IPv4:69}

    10255 11:51:13.5332420  24.7782420        EXCHSERVER NSPI  NSPI:NspiBind Response, ReturnValue=0x0     {MSRPC:447, TCP:446, IPv4:367}



    In the netmon capture we saw an excessive number of NspiBind requests been sent from the Outlook client machines to the Exchange server and Exchange server in turn sending these NspiBind requests to the GC, in the netmon capture we didn’t find any frames for NspiUnBind requests, which should be used to destroy that connection.


    When we checked the Exchange server security logs we found a massive number of logon events that are occurring for every single client:

    Event Type: Success Audit

    Event Source:     Security

    Event Category:   (2)

    Event ID:   540

    Date:       5/26/2010

    Time:       3:49:09 PM


    Computer:   EXCHSERVER


    Successful Network Logon:

          User Name:  User_Name

          Domain:           CONTOSO.COM

          Logon ID:         (0x0,0x3DA1F42)

          Logon Type: 3

          Logon Process:    Kerberos

          Authentication Package: Kerberos

          Workstation Name:

          Logon GUID: {b03f9ad2-3150-e1be-f50a-b3abb9d2c09d}

          Caller User Name: -

          Caller Domain:    -

          Caller Logon ID:  -

          Caller Process ID: -

          Transited Services: -

          Source Network Address:

          Source Port:      0


    For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.


    So, now we know the problem of lsass.exe memory consumption on the DCs is actually occurring because of a massive number of NspiBind requests that are being sent to the DCs without proper NspiUnbind requests.


    So we turned our attention to the client machines that have Outlook installed and started to list all 3rd party add-ins that are installed on them.


    We found that all affected client machines are actually sharing 3rd party add-ins for email content archival. Further investigation on this software revealed a strange MAPI Wrapper DLL library, and our customer confirmed that the start time of the problem approximately matches the time they have rolled out the content archival software on the client machines.


    We uninstalled this content archival add-in from a couple of machines and these machines stopped hammering the Exchange Server and GC with NspiBind requests.


    Later we uninstalled it from the rest of the client machines and the lsass.exe process on the DCs has finally stabilized and we no longer have any issues with its memory usage.


    I’d like to mention that starting with Windows Server 2008 we have implemented a “NSPI Max Sessions Per User” where you can limit the number of NSPI requests from a client to the server.


    You can read more information about this new functionality “Here


    That’s it! I hope you enjoyed reading and hopefully this should help you to troubleshoot any memory leaks you may encounter.  The information contained here can be applied to other processes when you suspect heap leaks.


    See you again on another adventure.


    Best Regards,

    Karim Elsaid

Page 2 of 25 (242 items) 12345»