• Ntdebugging Blog

    Using Netmon to figure out the source of high CPU in WMIprvse.exe

    • 7 Comments

    Recently I was working with a customer who had a file server experiencing high CPU in the WMIprvse.exe process. We received multiple user dumps and noted that someone or something was running the same query again and again. We needed to figure out what was running the query in a tight loop, causing the high CPU.

    image

    Figure 1 - Task Manager on FileServer

    Before we get into the exact troubleshooting steps, let me provide some background on WMI.  Winmgmt is the WMI service within a SVCHOST process running under the LocalSystem account.  In all cases, the WMI service automatically starts when the first management application or script requests connection to a WMI namespace. For more information, see Starting and Stopping the WMI Service. To avoid stopping all the services when a provider fails, each provider is loaded into a separate host process named "Wmiprvse.exe". This also allows each instance of Wmiprvse to run under a different account with varying security. For more details you can look at the MSDN documentation on WMI.

     

    I dumped out all the services in the various svchost.exe processes. You can do this from a command prompt by running the tasklist /svc command. In my instance, I found that the WinMgmt service was running in svchost, PID  452 (PID number will vary). Someone had to be making RPC calls to this svchost.exe process to run the WMI queries. It could be some local process on the machine; it could even be a process on a remote machine.

     

    At this point I requested user dumps of PID 452 from the customer.  This would allow me to determine who was making the RPC calls to svchost.exe to run the WMI queries.  While the customer was uploading the dumps, we decided to get a Network Monitor trace to see if the RPC calls were coming over the network.

     

    Immediately, I could see a lot of RPC traffic to the svchost.exe process(PID=452).

     

    image

    Figure 2 - Network Monitor Output from the FileServer. Notice the Source and destination ports and IP addresses. IP addresses are hidden by the aliases

    Looking at the RPC payload, I could see the text of the WMI query. You can see this in the Hex Details Pane. The query that was running in a loop was Select * from Win32_Process. Looks like I found the source of the WMI queries.

     

    At this point, we got the source IP for the RPC packets. We logged into the machine, and brought up the Task Manager.

     

    image

    Figure 3 - Task Manager on Remote Machine(Machine1)

    Immediately we saw that there was some script running inside a Wscript.exe process.  At this point I was pretty sure that this script was the culprit. The customer was not sure what this was, and was not comfortable terminating the process.  To prove my suspicion, I had him open a command prompt and run the following command, netstat –ano.

     

    image

    Figure 4 - Netstat output from Remote Machine

     

    From the output in Fig. 4, I could see a TCP connection created by PID 3532 (wscript.exe). Looking at the local and foreign addresses from the above output, they matched up exactly to what we were seeing in the Network Monitor trace.

     

    In the above case, we already had our suspicions on the wscript.exe process; however, sometimes it might not be that easy. In that case, we could have used the netstat output to look at all connections to the file server (157.59.123.121). If there were multiple connections, then we can also narrow it down by the port number. Based on that, we could have found the PID responsible.

     

    The customer called me later in the day, and told me that they had recently updated their scripts. One of their scripts had a bug which was running WMI scripts in a tight loop. Fixing the script caused the problem to go away.

     

    Had the query being coming from a local process, I would have had to debug the svchost.exe process, and figure out who was making the WMI calls. However, since we could see the traffic on netmon, we didn’t need to use the debugger. Interesting way to get to the root of the problem without using a debugger!

     

     

    Share this post :
  • Ntdebugging Blog

    Event ID 157 "Disk # has been surprise removed"

    • 14 Comments

    Hello my name is Bob Golding and I would like to share information on a new error you may see in the system event log. It is Event ID 157 "Disk <n> has been surprise removed" with Source: disk.  This error indicates that the CLASSPNP driver has received a “surprise removal” request from the plug and play manager (PNP) for a non-removable disk.

     

    What does this error mean?

    The PNP manager does what is called enumerations.  An enumeration is a request sent to a driver that controls a bus, such as PCI, to take an inventory of devices on the bus and report back a list of the devices.  The SCSI bus is enumerated in a similar manner, as are devices on the IDE bus.

     

    These enumerations can happen for a number of reasons.  For example, hardware can request an enumeration when it detects a change in configuration.  Also a user can initiate an enumeration by selecting “scan for new devices” in device manager.  

     

    When an enumeration request is received, the bus driver will rescan the bus for all devices.  It will issue commands to the existing devices as though it was looking for new ones.  If these commands fail on an existing unit, the driver will mark the device as “missing”.  When the device is marked “missing”, it will not be reported back to PNP in the inventory.  When PNP determines that the device is not in the inventory it will send a surprise removal request to the bus driver so the bus driver can remove the device object.

     

    Since the CLASSPNP driver sits in the device stack and receives requests that are destined for disks, it sees the surprise removal request and logs an event if the disk is supposed to be non-removable.  An example of a non-removable disk is a hard drive on a SCSI or IDE bus.  An example of a removable disk is a USB thumb drive.

     

    Previously nothing was logged when a non-removable disk was removed, as a result disks would disappear from the system with no indication.  The event id 157 error was implemented in Windows 8.1 and Windows Server 2012 R2 to log a record of a disk disappearing.

     

    Why does this error happen?

    These errors are most often caused when something disrupts the system’s communication with a disk, such as a SAN fabric error or a SCSI bus problem.  The errors can also be caused by a disk that fails, or when a user unplugs a disk while the system is running.  An administrator that sees these errors needs to verify the heath of the disk subsystem.

     

    Event ID 157 Example:

     

  • Ntdebugging Blog

    The Memory Shell Game

    • 12 Comments

    Hello, this is Somak.  Today I’d like to drop some Memory Manager info on the blog that I’ve used to communicate in brief (believe it or not) how the system deals with memory.  If you are ever faced with checking how much Available Memory you have(or don’t have), poor system performance, questions about page faults, or having a Working Set Trimming performance issue, or just want a primer into how Windows manages memory on the system, be not afraid and read on!

    How this memory stuff works

    The fastest and most expensive memory is built into the CPU.  The next fastest and less expensive is physical RAM.  Next we have the hard drive, followed by remote storage or backup.  Each step down the ladder adds significantly to access time.  For example, physical RAM can be almost one million times faster than a hard disk.  Since it is volatile and cost considerably more than hard disk space, we are limited at how much we can put in the system.  This limit is far less than the much slower and non-volatile hard disk.  Due to these design constraints we have a tiered memory structure in the computer.  To achieve the fastest response times and best overall performance, an operating system must efficiently use this tiered memory structure.  It must do its best to reduce the need to retrieve data from a slower storage media like the hard disk.  It has to do this while juggling the memory and I/O demand from all running processes.  The following paragraphs are an overview of how the Memory Manager achieves this in Windows.  A more detailed description can be found in Chapter 7 of Microsoft Windows Internals, Fourth Edition (ISBN: 0-7356-1917-4).  This book is the great source for how stuff works in Windows.

    First let’s lay out some definitions.  They will be useful later on when I talk about the interactions.  These definitions are high level to maintain brevity.

    Virtual Memory – This is a memory that an operating system can address.  Regardless of the amount of physical RAM or hard drive space, this number is limited by your processor architecture.  On a 32 bit processor you are limited to 4 GB of addressable virtual memory (2^32).  With a default installation on a 32 bit box (not using /3GB) the kernel reserves 2GB for itself.  Applications are left with 2GB of addressable virtual memory.  When applications execute they are only presented with this 2GB of addressable memory.  Each application gets its own 2GB virtual memory to play with.  If you have 50 processes running, you’ll have 50 independent 2GB Virtual Memory address spaces and one 2GB Virtual Address space for kernel.  This is possible because Virtual Memory always exists, but doesn't really exist (hence the term virtual).  We basically lie to the application and say, here is 2GB of memory address space for you to use.  This memory isn’t allocated until the application explicitly uses it.  Once the application uses the page, it becomes committed.  A virtual memory page is then translated to a physical memory page.  From this translation, the virtual page can reside in physical RAM or on the hard disk.

    Physical Memory –  This is the physical storage media.  It can be physical RAM, the hard disk, optical disks, tape backups, etc.  This is anything that can store data.  Most times when people talk about Physical Memory, they refer to physical RAM (the memory sticks on your motherboard), but with virtual page translation, physical memory can also be on the hard drive (in your paging file).  Physical RAM is limited by your processor architecture.

    Committed Memory –  When an application touches a virtual memory page (reads/write/programmatically commits) the page becomes a committed page.  It is now backed by a physical memory page.  This will usually be a physical RAM page, but could eventually be a page in the page file on the hard disk, or it could be a page in a memory mapped file on the hard disk.  The memory manager handles the translations from the virtual memory page to the physical page.   A virtual page could be in located in physical RAM, while the page next to it could be on the hard drive in the page file.

    Commit Limit –  This is the maximum amount of memory that all your applications and the OS can commit.  If you had 50 applications fully allocate their 2 GB of virtual address space, you would need 100GB of commit limit (ignore kernel memory usage to keep the numbers simple).  So 100GB of committed pages can be backed by physical RAM or the hard drive.  If you have 100 GB of RAM, you could handle this memory load.   In most cases, 100 GB of RAM isn't economically feasible so the Commit Limit is comprised of physical RAM and the page file.  If you have 2 GB of physical RAM and 98 GB of page file, then your commit limit would be 100 GB.

    Page file –   This is the storage area for virtual memory that has been committed.  It is located on the hard drive.  Since hard drive space is cheaper than physical RAM, it is an inexpensive way to increase the commit limit.

    Working Set – This is a set of virtual memory pages (that are committed) for a process and are located in physical RAM.  These pages fully belong to the process.  A working set is like a "currently/recently working on these pages" list.

    Modified pages - Once a virtual memory page leaves the process's working set, it is moved to another list.  If the page has been modified, it is placed on the modified page list.  This page is in physical RAM.  A thread will then write the page to the page file and move it to the standby list.

    Standby pages - This is a page that has left the process' working set.  This page is in physical RAM.  A standby page is like a cache for virtual memory pages.  It is still associated with the process, but not in its working set.  If the process touches the page, it is quickly faulted back into the working set.  That page also has one foot out the door.  If another process or cache needs more memory, the process association is broken and it is moved to the free page list.  Most of the pages in available memory are actually standby pages.  This makes sense when you realize that these pages can be quickly given to another process (hence available), but you should also understand that they are page caches for working sets and can be quickly given back if the process touches the page again.  The vast majority of available memory is not wasted or empty memory.

    Free pages - When a page is taken off of the standby page list, it is moved to the Free page list.  This page is in physical RAM.  These pages are not associated with any process.   When a process exits, all of its pages are then dumped onto this list. Typically, there is a very small to no amount of free pages hanging around physical RAM. 

    Zeroed pages - When a free page is zeroed out, it is placed on the Zero page list.  This page is in physical RAM.  These are the pages that are given to processes that are making memory allocations.  Due to C2 security requirements, all pages must be scrubbed before handed to a new process.  When the system is idle, a thread will scrub free pages and put them on this list.  Only a small amount of zero pages are required to handle the typical small memory allocations of processes.  Once this list is depleted, and if there is demand for more pages, we pull pages off of the Free page list and scrub them on the fly.  If the Free page list is depleted, then we pull pages off of the standby list, scrub them on the fly and hand them to the new process.

    What is Task Manager telling me?

    Prior to Windows Vista, Task Manager reports memory usage using accounting methods that you probably are not expecting.  It is because of these accounting practices, that we rarely use Task Manager to gauge system performance and memory usage.  We typically use it for a quick overview or to kill processes.  I highly recommend using Performance Monitor (perfmon.msc) for investigating performance issues.  Here's the breakdown of the numbers on the Performance tab:

    Physical Memory

    Total - The is the total physical RAM installed in the system.

    Available - This is the total of the Standby, Free and Zeroed list.  Free and Zeroed makes sense, but Standby seems odd at first.  Standby pages were added to this number because they are available for a quick scrub and given to a new process.  So they are technically available (with minimal effort).

    System Cache- This is the total of the Standby list and the size of the system working set (which includes the file cache).  Standby pages are added to this list because they are cached pages for working sets.

    PF Usage - This is the total number of committed pages on the system.  It does not tell you how many are actually written to the page file.  It only tells you how much of the page file would be used if all committed pages had to be written out to the page file at the same time.

    Commit Charge

    Total - This is the total virtual memory that has been committed.  This includes all committed memory for all processes and the kernel.

    Limit - This is the maximum amount of committed memory this system can handle.  This is a combination of physical RAM and the page file.

    Peak – This is the highest amount of memory committed thus far on this system, since boot.

    How does this work?

    The memory manager optimizes physical RAM usage across the entire system.  Since physical RAM is a finite resource, it has to balance sharing this critical resource amongst all process, the kernel and file I/O.   It tries to keep disk I/O to a minimum, which results in a more responsive system.  It does this by moving pages around to meet the demand of the system.

    Typically, large sections of physical RAM are used for file cache.  This is cache is necessary to improve disk performance.  Without it, disk I/O would make the system crawl along at an nearly unusable pace.  The file system cache is just like a working set for a process.  Pages removed from the file cache are moved to the standby or modified page list.  Many of the standby pages in RAM are probably file cache pages that were removed from its working set.  For example, on a file server, if you see 8 GB of available memory, most of these pages are probably standby pages for the file cache.  The file cache's working set could be 500 MB, but the 8 GB of standby pages should also be considered part of the file cache.

    Now let's take a look at how the memory manager handles processes.  While an application is working with its virtual memory pages, the memory manager keeps the pages in the process' working set.  Since the vast majority of application do not use all of its memory all the time, some pages will age.  Old pages are removed from the working set.  If they are modified, they are moved to the modified list.  The page is saved to the page file and moved to the standby list.  If the page hasn't been modified, it is moved directly to the standby list.  These pages will remain on the standby page list until there is a demand for it.

    If the application touches the page again, it is soft faulted back into the process' working set.  If the process doesn't use the page for a very long time, or if the demand for the page is greater elsewhere, the page is moved off of the standby list.  It is disassociated with the process and moved to a the free page list.  From the free page list, the page is scrub on demand or lazily and placed on the zero page list.  It is from the zero page list that other processes or the kernel or the file cache will get a new page.

    If after a very long time the application once again needs a page that is not in its working set, the memory manager will handle the memory fault.  If the page is on the standby list, it is quickly put back into the process' working set.  If the page is no longer in the standby list, a hard fault occurs.  The memory manager issues I/O to the hard disk to read the page(s) from the page file.  Once the I/O complete, the page is placed back into the process' work set.

    All of this is done to keep physical RAM highly utilized and disk I/O to a minimum.  We don't want to allow process to horde physical RAM for pages that are rarely used.  The physical RAM must be shared with other processes, the kernel and the file cache.  If you see lots of available memory on your system, rest assured that it is not going to waste.  The vast majority is on standby lists for processes and the file cache.

    Also note that page file usage isn't that bad.  The page file allows the Memory Manager to save modified pages before placing the page on the standby list.  The page is still in physical RAM and can be quickly faulted back into the process.  This method gives the process a chance to reclaim an old page and it allows the page to be quickly used if there is demand elsewhere.

    The best way to see the totals of these lists is to use a kernel debugger (live or postmortem).  Use the !memusage command and you'll get an output like this:

    0: kd> !memusage

    loading PFN database

    loading (100% complete)

    Compiling memory usage data (99% Complete).

    Zeroed:    414 (  1656 kb)

                   Free:      2 (     8 kb)

    Standby: 864091 (3456364 kb)

    Modified:    560 (  2240 kb)

        ModifiedNoWrite:     30 (   120 kb)

    Active/Valid: 182954 (731816 kb)

             Transition:      2 (     8 kb)

                    Bad:      0 (     0 kb)

                Unknown:      0 (     0 kb)

    TOTAL: 1048053 (4192212 kb)

    Of the 4GB of physical RAM, only 1.6 MB are on Zeroed or free pages.  731 MB is in process, system and file cache working sets.  2 MB are on the modified page list.  The vast majority, 3.4 GB, is on the standby list.  On this server, most people will see 3.4 GB of wasted physical RAM, but you will know better.

    What should I be worried about?

    Typically you shouldn't be worried about these things, until you have a performance problem.  If your system is sluggish or slow to respond or you are getting errors about out of memory, then you need to rely on this information.  You will need to collect a performance monitor log of the problem time.  If the counter list is daunting, then use the Performance Monitor Wizard to configure the performance monitor log.

    Once the log is collected, you'll need to analyze several counters.  I'm not going into detail about how to review performance monitor logs this time.  I'll save that lengthy topic for another time.  For now I'll focus on the counters relevant to the Memory Manager.

    One of the biggest reasons for slow performance and sluggish system responsiveness is disk bottleneck.  Look at Physical Disk\% Idle Time, Avg. Disk sec/Read and Avg. Disk sec/Write counters.  If your system drive is under 50% idle or your disk response times are way above your drive specifications, then you need to investigate further.  Look at the Memory\Available Mbytes.  You should have a couple hundred Mbytes of Available Memory.  This is one of the most important performance monitor counters.  If this number drops too low, your standby lists, process working sets and cache will be greatly reduced.  You'll need to find out if a process is consuming physical RAM.  Check for large process working sets or for large file cache.

    You will also need to see if paging is really affecting system performance.  Take a look at Memory\Pages Input/sec and correlate that to Physical Disk\Avg. Disk sec/Read.  Pages Input/sec is the number of pages being read in from the page file.  These are the hard faults (when the page wasn't on the standby list).  If your Avg. Disk sec/Read is close to your drive's specification and the drive's idle time is high, than paging really isn't a problem.  Small amounts of hard faults are expected as applications will every once in a while re-touch an old page.  As long as this I/O is not consistent or the disk can't keep up, you probably will not notice this impact.

    You can also look at Memory\Pages Output/sec and Physical Disk\Avg. Disk sec/Write.  These are the page commits to the page file when a modified page is removed from a process' working set.  As long as the disk can keep up with the writes, this shouldn't be a problem.  Remember that once the page is saved to the page file, it is placed on the standby list.  If there isn't a great demand for new pages, it can remain on the standby list for a very long time.  When the process touches the old virtual page again, it can be soft faulted back into the working set.  If there is great demand for memory, you'll see process working sets aggressively being trimmed.  Unless there is memory pressure, this is all done with lazy I/O, so you should not see much of an impact from this activity.

    The Memory Manager works to meet current demand and prepares for future demand when possible.  You need to look at a performance monitor log to see if there is memory pressure on the system.  You'll see this in low Available Mbytes and reductions in process working sets.  You'll be able to correlate this to increase disk I/O to the page file.  If you have established that there is memory pressure on the box, you need to figure where that demand is coming from.  Check for working set increases from processes, file cache or bottlenecked disk I/O to data drives.

  • Ntdebugging Blog

    Storport ETW Logging to Measure Requests Made to a Disk Unit

    • 10 Comments

    Greetings NTDEBUGGING community, Bob here again, and today I would like to let everyone know about a new feature implemented in the STORPORT.SYS binaries released in the following KB articles:

    ·         KB 979764 (Windows Server 2008)

    ·         KB 978000 (Windows Server 2008 R2)

     

    Beginning with these new versions, it is now possible to measure timing statistics for requests made to a system’s disk unit.  These measurements are taken at the lowest possible level of OS interaction with the storage adapter hardware, making it much easier to diagnose storage performance issues. The measurements are taken from the port driver, STORPORT.SYS, via ETW (Event Tracing for Windows).  Windows drivers are implemented in a layered architecture, so in our case the STORPORT driver interfaces directly with the adapter or miniport driver.  The STORPORT driver is responsible for managing requests and queues, and providing other upper level functions.  Each miniport driver is specific to a particular storage adapter card, and knows how to deliver requests to the disk unit over the transport medium, like fibre channel for instance.

    Configuring and Starting STORPORT ETW Logging

    To measure storage performance, the user enters a threshold value via the GUI interface. (This can be scripted as well; see KB 978000 for details.)  When an I/O time is equal to or greater than the entered threshold, an event 13 is written to the ETW log.  

    Here is a walkthrough to enable this feature via the GUI:

    Click Start, type “perfmon”, and press ENTER to start Performance Monitor.

    image

     

    Expand “Data Collector Sets” and select “Event Trace Sessions”.

    image

     

    Right-Click on “Event Trace Sessions” and select “New -> Data Collector Set”. 

    image

     

    The following dialog will appear:

    image

     

    Give the new data collector set a name in the dialog box.  In this example I called it “Storport Performance Logging”. Choose the “Create manually (Advanced)” option then click Next to see the following dialog.

    image

     

    After clicking “Add...” on the dialog above the following list of providers will appear. Select “Microsoft-Windows-StorPort” and click OK.

    image

     

    When the dialog below opens, select “Keywords (Any)” then click Edit.

    image

     

    Check the box for IOPerfNotification, and then click OK. Note:  For Windows Server 2008 SP1 and SP2, some properties may not be present (e. g. Reset, PortNotification and PauseResume). 

    image

     

     

    Next, select “Filter” and click Edit…

    image

     

    This is where we enter our time threshold.  Remember that timing values greater than or equal to this filter value will be logged to the trace file.

     

    Select “Binary” and check the “Filter enabled” box

    image

    In the “Filter data” field, enter the threshold value in the number of milliseconds. This must be entered in little endian format. Please refer to this table for the most common entries.

    Decimal

    Hexadecimal

    Binary(little endian)

    1ms

    5ms

    10ms

    15ms

    1

    5

    A

    F

    01 00 00 00 00 00 00 00

    05 00 00 00 00 00 00 00

    0A 00 00 00 00 00 00 00

    0F 00 00 00 00 00 00 00

     

    Note: The “Filter type” value will always remain 0 as in the example above showing an entry for 1 ms. 

     

    WARNING - The whole data line needs to be filled in when entering the threshold. For demonstration purposes here is the WRONG WAY to enter 256 ms. This will not work because of the missing 00 entries on the line.

    image 

    Below is the correct way to enter the value (256ms) Note: The Filter value may have to be re-entered for next run because it could get reset.

    image

    After clicking OK, you will see a new Event Trace session (screenshot below).  Right-Click the new Event Trace session and select “Properties”

    image

     

    Next we need to set a location for the trace file.  Ensure you have sufficient disk space in the target folder because logs can be many Megabytes in size depending on the number of hits. 

    image

     

    The target folder may be changed in the “Directory” tab

    image

     

    I recommend choosing a directory that is more easily located, like C:\PerfLogs, in which case you can click “Browse”

    image

     

    Note:  It is not recommended to store the trace file on the same storage you are troubleshooting.

    image

    The name for the file is selected on the “File” tab.

    image

    When you are done setting the file and directory information click OK.

    Now right-click the log and select “Start” to begin tracing.

    image

    The “Status” should indicate that it is running.

    image

    How long to run the log depends on you issue.  You should let the log run until the issue or slow down occur then check the log for delays.

    Viewing the Log File

    First right-click and select “Stop” to flush any cached data and stop the logging.

    image

    Then use Event Viewer to examine the log.

    image

     

    Next choose "Open the saved log…"

    image

     

    image

    image

     

    After clicking OK a popup dialog will be displayed to ask if you want to convert the log. Select “No”. Here is some sample event data that you will likely see in the log.

    image

     

    Example: Event ID 13

    image

     

    Let's examine the “Details” tab which shows information about the I/O that exceeded or equaled the threshold we had set.  If you do not set a threshold, then all I/O events are logged, but for performance reasons we recommend setting a "reasonable" threshold like 15ms.

    image

    Breaking down the Event Data we have …

     

    Miniport Extension: (Kernel address of the miniport extension, not of use to performance logging, only debugging)

    PortNumber:  This is the adapter port number (i.e. RaidPort1, etc.)

    PathID:  Path number if multi-pathing is in place

    TargetID:  Target LUN exposed to the Operating System

    LUN:  The Logical Unit Number of the physical storage where the particular I/O is directed

    *RequestDuration:  How long the request took in milliseconds (the number of note here for storage latency diagnostics)

    Command:  the SCSI command sent to the unit. (Usually read or write)  This is listed in decimal.  Convert this value to hexadecimal with Calc.exe for referencing SCSI commands http://en.wikipedia.org/wiki/SCSI_command

    ScsiStatus:  the status that came back from the drive

    SRBStatus:  the status as returned from adapter  (see srb.h and scsi.h in the Microsoft WDK or http://en.wikipedia.org/wiki/SCSI_Status_Code)

     

     Example: Command 42 equals 2A which is a Write of 10 bytes.

    image

    image

    Here’s the respective SCSI command:

    image

    When viewing the log, the commands will be mostly reads and writes.  If I/O is determined to be slow at this level, then the issue is most likely disk or SAN related.

    Hope you enjoyed the walkthrough!

    -Bob

     

     

     

     

     

     

     

  • Ntdebugging Blog

    How to use the DedicatedDumpFile registry value to overcome space limitations on the system drive when capturing a system memory dump

    • 6 Comments

    Prior to Windows Vista and Windows Server 2008 we had to keep a large page file on the system drive (typically drive C:) in order to properly capture system memory dumps for troubleshooting.  This presented problems as systems with very large amounts of RAM became more common, resulting in requirements for very large amounts of free space on the C: drive, or requiring that system visible memory be artificially limited for troubleshooting purposes.  This is no longer a requirement thanks to the Dedicated Dump File feature, which is available for use in Windows Vista and later operating systems.

     

    What is it?

    The dedicated dump file is basically a page file that is reserved for use only by the system crash dump routines.  It is not used for paging virtual memory.  Like a page file, the system process keeps an open handle to the dedicated dump file, which prevents it from being deleted.  When you manually initiate a memory dump, or the system crashes on its own, the data is written into the dedicated dump file instead of the page file on the system drive.  This feature is enabled by setting the following registry value:

    Location:   HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\CrashControl

    Name: DedicatedDumpFile

    Type:  REG_SZ

    Value: A dedicated dump file together with a full path, such as D:\dedicateddumpfile.sys

    Where can it be stored?

    The dedicated dump file can be stored on any local volume that can support a page file.  This volume must be considered a fixed disk device and must not be hosting Windows clustered storage.  For example, this means you can store this file on a second partition of your server’s primary hard disk, or on a SAN volume that is mounted early in the boot process. 

    Fun fact: Unlike a page file, you can actually direct the dedicated dump file to a sub-folder, even one that is actually a volume mount point.  However, the sub-folder must pre-exist during boot, and there is generally little reason to actually do this.

     

    How much space will it use?

    You can automatically allow the system to determine the size of the dedicated dump file, or you can manually set the size by using the DumpFileSize registry value. 

     

    System determined size:

    If you choose to let the system determine the size, then regardless of the dump file type set via the CrashDumpEnabled value, the dedicated dump file will be roughly the size of system visible RAM plus 256 MB.  The extra 256 MB allows room for secondary dump data that may be written by driver bug check callback routines.  In order to write this data, the drivers would have to first call KeRegisterBugCheckReasonCallback() to register the routines that would execute when the system bug checks.   This is currently not very common due to the limitations imposed by the high IRQL at which bug check callbacks run.  Allowing the system to determine the size is a good idea, especially when disk space is not of huge concern.  Also, the dedicated dump file size will be adjusted automatically at boot if additional RAM is installed or removed later.  This includes artificial memory reduction implemented via the BCDEdit option, truncatememory.  This eliminates the need to manually resize your page file to support proper collection of a memory dump.

     

    Manual size specified via the registry:

    The following registry value will allow you to manually set the dedicated dump file size in megabytes:

    Location:   HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\CrashControl

    Name: DumpFileSize

    Type:  REG_DWORD

    Value: The dump file size in megabytes

     

     

    One good reason one might have to set this value manually is in the case of a very large memory system from which only a kernel memory dump is desired.  In this case, a dedicated dump file the size of RAM+256 MB is not required.  Generally, even on 64-bit systems, kernel memory dumps are no larger than a few gigabytes, although they have the potential to be much larger.  In nearly all cases reserving one third the size of RAM will be more than sufficient to capture kernel memory.  If you have plenty of room on the drive and want to play it safe, you can simply allow the system to determine the size for you by setting the DumpFileSize value to zero, or by not creating the value at all.

    Although you may set the DumpFileSize to a value larger than needed, or the system chooses a size larger than the type of dump selected, the resultant file size after a bug check will only be as large as the data that is stored in it.  This is most commonly the size of physical RAM + room for the dump header information.  If you choose to set the DumpFileSize value manually, then you should also set the IgnorePagefileSize value under the CrashControl key to 1.  This will help ensure a dump file is created.

     

    What about the final dump file?

    After a bug check, the data will be written to the dedicated dump file on disk.  However, the final memory dump location will be based on the following registry value:

    Location:   HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\CrashControl

    Name: DumpFile

    Type:  REG_EXPAND_SZ

    Value: The target dump file name together with the full path, such as D:\memory.dmp

     

     

    The system will extract the dump data from the dedicated dump file and save the resultant file to the location and file name specified for the DumpFile value.  The default for this value is %SYSTEMROOT%\MEMORY.DMP.  Since we are using the dedicated dump file to overcome space limitations of the system drive, we will likely want to also set the DumpFile value to point to an alternate location than the default. 

    If you set this value to point to the same volume as the dedicated dump file, it will save WERFAULT.EXE the step of performing an additional copy of the temporary dump file after reboot.  Instead the file will simply be renamed on the target volume.   If there is not enough room for both the dedicated dump file and the memory dump after the bug check, then the dedicated dump file will not be recreated, and only the memory dump will remain.  If this happens and the system bug checks again, only the original dump will remain because there will be no dedicated dump to write the second bug check to until space is freed up on the drive and the system is rebooted.

     

    DedicatedDumpFile Usage Example:

    The following example will set the dedicated dump file and resultant dump to be stored on the root if drive D:

     

    Windows Registry Editor Version 5.00

     

    [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\CrashControl]

    "DumpFile"=hex(2):44,00,3a,00,5c,00,4d,00,45,00,4d,00,4f,00,52,00,59,00,2e,00,\

      44,00,4d,00,50,00,00,00

    "DedicatedDumpFile"="D:\\DedicatedDump.sys"

     

     

    (Note: The lines above can be saved as a .REG file and imported into the registry)

     

    The screenshot below demonstrates a system configured to capture a complete 4 GB memory dump to a manually sized dedicated dump file, and will store the final MEMORY.DMP on the D: drive.

     image

    Important information:

    949052 - Kernel memory dump files may not be generated on Windows Server 2008-based and Windows Vista SP1 or later based computers when physical memory is larger than the size of the page file

     

    957517 - A dedicated complete memory dump file may not be successfully generated if the volume that stores the dedicated dump file has insufficient free space

    950858 - Dedicated dump files are unexpectedly truncated to 4 GB on a computer that is running Windows Server 2008 or Windows Vista and that has more than 4 GB of physical memory

     

  • Ntdebugging Blog

    Getting Ready for Windows Debugging

    • 8 Comments

     

    Welcome to the Microsoft NTDebugging blog!  I’m Matthew Justice, an Escalation Engineer on Microsoft’s Platforms Critical Problem Resolution (CPR) team.  Our team will be blogging about troubleshooting Windows problems at a low level, often by using the Debugging Tools for Windows.  For more information about us and this blog, check out the about page.

     

    To get things started I want to provide you with a list of tools that we’ll be referencing in our upcoming blog posts, as well as links to some technical documents to help you get things configured.

     

    The big list of tools:

     

    The following tools are part of the “Debugging Tools for Windows” – you’ll definitely need these

    http://www.microsoft.com/whdc/devtools/debugging/

    ·         windbg

    ·         cdb

    ·         ntsd

    ·         tlist

    ·         gflags

    ·         adplus

    ·         UMDH

    ·         symcheck

     

    Sysinternals provides some great tools that we’ll be discussing

    http://www.sysinternals.com

    ·         Process Explorer

    ·         Process Monitor

    ·         Regmon

    ·         Filemon

    ·         DbgView

    ·         Handle.exe

    ·         Tcpview

    ·         LiveKD

    ·         AutoRuns

    ·         WinObj

     

    There are many tools contained in “MPS Reports” (MPSRPT_SETUPPerf.EXE), but I’m listing it here specifically for Checksym

    http://www.microsoft.com/downloads/details.aspx?FamilyID=CEBF3C7C-7CA5-408F-88B7-F9C79B7306C0&displaylang=en

    ·         Checksym

     

    “Windows Server 2003 Resource Kit Tools” is another great set of tools.  In particular Kernrate is a part of that package

    http://www.microsoft.com/downloads/details.aspx?displaylang=en&familyid=9D467A69-57FF-4AE7-96EE-B18C4790CFFD

    ·         Kernrate

     

    Windows XP SP2 Support Tools

    http://www.microsoft.com/downloads/details.aspx?FamilyID=49AE8576-9BB9-4126-9761-BA8011FABF38&displaylang=en

    ·         netcap

    ·         poolmon

    ·         memsnap

    ·         tracefmt  (64-bit versions available in the DDK)

    ·         tracelog

    ·         tracepdb

    ·         depends

    ·         pstat

     

    “Visual Studio “ – in addition to the compilers and IDE, the following tools come in handy:

    ·         SPY++

    ·         dumpbin

     

    Perfwiz (Performance Monitor Wizard)

    http://www.microsoft.com/downloads/details.aspx?FamilyID=31fccd98-c3a1-4644-9622-faa046d69214&DisplayLang=en

     

    DebugDiag

    http://www.iis.net/handlers/895/ItemPermaLink.ashx

     

    Userdump (User Mode Process Dumper)

    http://www.microsoft.com/downloads/details.aspx?FamilyID=E089CA41-6A87-40C8-BF69-28AC08570B7E&displaylang=en

     

    Dheapmon (Desktop Heap Monitor)

    http://www.microsoft.com/downloads/details.aspx?familyid=5CFC9B74-97AA-4510-B4B9-B2DC98C8ED8B&displaylang=en

     

    Netmon 3.0

    §  Go to http://connect.microsoft.com/

    §  Sign in with your passport account

    §  Choose "Available Connections" on the left

    §  Choose "Apply for Network Monitor 3.0” (once you've finished with the application, the selection appears in your "My Participation" page)

    §  Go to the Downloads page (On the left side), and select the appropriate build 32 or 64 bit build.

     

     

     

    Some articles you may find useful:

     

    Debugging Tools and Symbols: Getting Started

    http://www.microsoft.com/whdc/devtools/debugging/debugstart.mspx

     

    Boot Parameters to Enable Debugging

    http://msdn2.microsoft.com/en-us/library/ms791527.aspx

     

    How to Generate a Memory Dump File When a Server Stops Responding (Hangs)

    http://support.microsoft.com/kb/303021/

     

    After installing the “Debugging Tools for Windows”, you’ll find two documents at the root of the install folder that are helpful:

     

    ·         kernel_debugging_tutorial.doc - A guide to help you get started using the kernel debugger.

     

    ·         debugger.chm - The help file for the debuggers.  It details the commands you can use in the debugger.  Think of this as a reference manual, rather than a tutorial.





  • Ntdebugging Blog

    Interpreting Event 153 Errors

    • 16 Comments

    Hello my name is Bob Golding and I would like to share with you a new event that you may see in the system event log.  Event ID 153 is an error associated with the storage subsystem. This event was new in Windows 8 and Windows Server 2012 and was added to Windows 7 and Windows Server 2008 R2 starting with hot fix KB2819485.

     

    An event 153 is similar to an event 129.  An event 129 is logged when the storport driver times out a request to the disk; I described event 129 messages in a previous article.  The difference between a 153 and a 129 is that a 129 is logged when storport times out a request, a 153 is logged when the storport miniport driver times out a request.  The miniport driver may also be referred to as an adapter driver or HBA driver, this driver is typically written the hardware vendor.

     

    Because the miniport driver has a better knowledge of the request execution environment, some miniport drivers time the request themselves instead of letting storport handle request timing.  This is because the miniport driver can abort the individual request and return an error rather than storport resetting the drive after a timeout.  Resetting the drive is disruptive to the I/O subsystem and may not be necessary if only one request has timed out.  The error returned from the miniport driver is bubbled up to the class driver who can log an event 153 and retry the request.

     

    Below is an example event 153:

     

    Event 153 Example

     

    This error means that a request failed and was retried by the class driver.  In the past no message would be logged in this situation because storport did not timeout the request.  The lack of messages resulted in confusion when troubleshooting disk errors because timeouts would occur but there would be no evidence of the error.

     

    The details section of the event the log record will present what error caused the retry and whether the request was a read or write. Below is the details output:

     

    Event 153 Details

     

    In the example above at byte offset 29 is the SCSI status, at offset 30 is the SRB status that caused the retry, and at offset 31 is the SCSI command that is being retried.  In this case the SCSI status was 00 (SCSISTAT_GOOD), the SRB status was 09 (SRB_STATUS_TIMEOUT), and the command was 28 (SCSIOP_READ). 

     

    The most common SCSI commands are:

    SCSIOP_READ - 0x28

    SCSIOP_WRITE - 0x2A

     

    The most common SRB statuses are below:

    SRB_STATUS_TIMEOUT - 0x09

    SRB_STATUS_BUS_RESET - 0x0E

    SRB_STATUS_COMMAND_TIMEOUT - 0x0B

     

    A complete list of SCSI operations and statuses can be found in scsi.h in the WDK.  A list of SRB statuses can be found in srb.h.

     

    The timeout errors (SRB_STATUS_TIMEOUT and SRB_STATUS_COMMAND_TIMEOUT) indicate a request timed out in the adapter. In other words a request was sent to the drive and there was no response within the timeout period.  The bus reset error (SRB_STATUS_BUS_RESET) indicates that the device was reset and that the request is being retried due to the reset since all outstanding requests are aborted when a drive receives a reset.

     

    A system administrator who encounters event 153 errors should investigate the health of the computer’s disk subsystem.  Although an occasional timeout may be part of the normal operation of a system, the frequent need to retry requests indicates a performance issue with the storage that should be corrected.

  • Ntdebugging Blog

    Easily Resolving an Event Viewer Error using a Process Memory Dump

    • 11 Comments

    My name is Ryan Mangipano (ryanman) and I am a Sr. Support Escalation Engineer at Microsoft.  Today I will be blogging about how I used the SOS  .Net Framework debugging extension (and !analyze -v) to easily troubleshoot a .Net Framework exception. This exception was preventing Event Viewer from displaying properly. Event Viewer was returning an error that provided very little information about what was actually causing the issue. I will demonstrate how, in this case, it was very easy to use windbg to obtain information about what went wrong. I did not have to perform a live debug on the issue.  Instead, I used a process dump to obtain very exact information, which was returned by the debugger, relating to the root cause. I was then able to use Process Monitor to identify the file that needed to be examined. These actions led me to the source of the problem, which was easily corrected. Also, the issue discussed in this blog was easily reproduced on Windows Server 2008. This means that you should be able to practice this debug exercise, on your non-production Windows 2008 SP1 Server, for learning purposes if you are interested in doing so.

     

    Issue Reported:  The following error was encountered when opening eventvwr.msc (Event Viewer) on a Windows 2008 Server system:

     

                           "MMC could not create the snap-in."

    MMC could not create the snap-in.

    The snap-in might not have been installed correctly

    Name: Event Viewer

                   CLSID: FX:{b05566ad-fe9c-4363-be05-7a4cbb7cb510}

    clip_image001

    First Step- Research & Data Gathering: After ensuring I first understood the problem reported, I searched for known issues. I found out that we have seen this error before.  It may occur when the following registry key gets deleted or corrupted:

     

                    HKLM\software\Microsoft\MMC\SnapIns\FX:{ b05566ad-fe9c-4363-be05-7a4cbb7cb510}

     

    I had the customer export this key and found that it was not corrupted in any way. I verified that all data was as expected

     

    Next,  a memory dump of the mmc.exe process was collected.  The mmc.exe process is used to host the eventvwr.msc snap-in. This was easily obtained using the built in Windows 2008 Server "Windows Task Manager" feature: "Create Dump File" . If you have several mmc console instances executing on your system, you can use the Task Manager context menu shortcuts "Switch To" and "Go To Process" to help you to identify the correct instance.

    clip_image002clip_image003

    clip_image004

     

    Note: We also collected a process monitor logfile during the startup of eventvwr.msc. This log file later proved very helpful in resolving the issue (as I will show below). Process monitor can be obtained at the following URL:  http://technet.microsoft.com/en-us/sysinternals/bb896645.aspx

     

     

     Now let's have a look at the debug.

     

    1.          First, I navigated Windows Explorer to the location of the dump file and double-clicked it to open it in windbg.exe.

    clip_image005

    It opened in windbg because I had previously run the command windbg -IA, which associates .dmp files with windbg. You can read more about the command line options in windbg in the help file that is included with the debugging tools.

    clip_image002[5]

    clip_image002[9]

    2.            I noticed the following output from the debugger after it loaded the dump file:

     

    This dump file has an exception of interest stored in it.

    The stored exception information can be accessed via .ecxr.

    (ff8.a2c): CLR exception - code e0434f4d (first/second chance not available)

     

     

    3.            Next, I wanted to ensure my symbol path was set correctly. I could have set it using the .sympath command:

     

    0:011> .sympath SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    Expanded Symbol search path is: srv*c:\websymbols*http://msdl.microsoft.com/download/symbols

    0:011> .sympath

    Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    Expanded Symbol search path is: srv*c:\websymbols*http://msdl.microsoft.com/download/symbols

     

    However, when your goal is to simply point to the default symbol server, .symfix is a very nice shortcut. It prevents one from having to try to remember the URL.  Here’s the syntax: 

     

    0:011> .symfix c:\websymbols

    0:011> .sympath

    Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

     

    4.            To ensure that I didn't waste time reviewing the wrong data, I performed a quick check to ensure that we collected a dump of the requested snap-in.

     

    0:005> !peb

    PEB at 000007fffffdb000

    ....

       CommandLine:  '"C:\Windows\system32\mmc.exe" "C:\Windows\system32\eventvwr.msc" '

     

    You could alternatively dump the  CommandLine from the nt!_PEB using the dt command

     

    0:005> dt nt!_PEB ProcessParameters->CommandLine 000007fffffdb000

    ntdll!_PEB

       +0x020 ProcessParameters              :

          +0x070 CommandLine                    : _UNICODE_STRING ""C:\Windows\system32\mmc.exe" "C:\Windows\system32\eventvwr.msc" "

     

    5.            Next, I then dumped out all of the threads in this process and found the following thread contained a stack that was raising a .Net Framework exception

     

    0:011> ~* kL

    ... (ommitted the non-relevent threads)

     

    # 11  Id: ff8.a2c Suspend: 1 Teb: 7ffd3000 Unfrozen

    ChildEBP RetAddr 

    0691f03c 7343a91c kernel32!RaiseException+0x58

    0691f09c 7343d81a mscorwks!RaiseTheExceptionInternalOnly+0x2a8

    *** WARNING: Unable to verify checksum for MMCEx.ni.dll

    0691f140 6bfe0b5a mscorwks!JIT_Rethrow+0xbf

    *** WARNING: Unable to verify checksum for mscorlib.ni.dll

    0691f1e8 69926cf6 MMCEx_ni+0xd0b5a

    0691f1f4 6993019f mscorlib_ni+0x216cf6

    0691f208 69926c74 mscorlib_ni+0x22019f

    0691f220 733d1b4c mscorlib_ni+0x216c74

    0691f230 733e21b1 mscorwks!CallDescrWorker+0x33

    0691f2b0 733f6501 mscorwks!CallDescrWorkerWithHandler+0xa3

    0691f3e8 733f6534 mscorwks!MethodDesc::CallDescr+0x19c

    0691f404 733f6552 mscorwks!MethodDesc::CallTargetWorker+0x1f

    0691f41c 7349d803 mscorwks!MethodDescCallSite::CallWithValueTypes+0x1a

    0691f604 733f845f mscorwks!ThreadNative::KickOffThread_Worker+0x192

    0691f618 733f83fb mscorwks!Thread::DoADCallBack+0x32a

    0691f6ac 733f8321 mscorwks!Thread::ShouldChangeAbortToUnload+0xe3

    0691f6e8 733f84ad mscorwks!Thread::ShouldChangeAbortToUnload+0x30a

    0691f710 7349d5d4 mscorwks!Thread::ShouldChangeAbortToUnload+0x33e

     

    6.            Out of curiosity, I also ran the Get Last Error command

     

    0:011> !gle

    LastErrorValue: (Win32) 0 (0) - The operation completed successfully.

    LastStatusValue: (NTSTATUS) 0 - STATUS_WAIT_0

     

    7.            After this, I ran analyze -v to see what helpful information the debugger would provide. The debugger did output exception information but informed me that I needed to use the x86 debugger instead.

     

    0:011> !analyze -v

    ...

                            EXCEPTION_RECORD:  ffffffff -- (.exr 0xffffffffffffffff)

                            ExceptionAddress: 771a42eb (kernel32!RaiseException+0x00000058)

                              ExceptionCode: e0434f4d (CLR exception)

                              ExceptionFlags: 00000001

                            NumberParameters: 1

                            Parameter[0]: 80131604

                            MANAGED_BITNESS_MISMATCH:

                                                    Managed code needs matching platform of sos.dll for proper analysis. Use 'x86' debugger.

                            FAULTING_THREAD:  00000a2c

                                                    PRIMARY_PROBLEM_CLASS:  CLR_EXCEPTION

     

    8.            I fired up the x86 debugger and loaded the appropriate version of the SOS .Net Framework debugger extension.  This extension ships in the Operating System along with the .Net Framework.  On most occasions, I would have initiated the loading of the extension through the use of the following syntax:

     

     0:011> .load C:\Windows\Microsoft.NET\Framework\v2.0.50727\sos.dll

     

    OR

     

    0:011> .load  c:\Windows\Microsoft.NET\Framework64\v2.0.50727\sos.dll

     

    However, once you realize that managed debugging will be necessary and that you need the services of the SOS extension, it’s best to use the .loadby command rather than .load.  This is due to the fact that the version of SOS must match the version of the CLR loaded into that process.  Here’s the recommended syntax: 

     

                            0:011>.loadby sos mscorwks

     

    I always verify that my extensions are loaded properly by using the  .chain command.

     

    0:011> .chain

    ...   Extension DLL chain:

                            C:\Windows\Microsoft.NET\Framework\v2.0.50727\sos.dll: image 2.0.50727.1434, API 1.0.0, built Wed Dec 05 22:42:38 2007

     

    9.            Running !help printed out the following helpful information about the SOS extension since sos.dll was at the top of the .chain output:

     

    0:011> !help

    -------------------------------------------------------------------------------

    SOS is a debugger extension DLL designed to aid in the debugging of managed

    programs. Functions are listed by category, then roughly in order of

    importance. Shortcut names for popular functions are listed in parenthesis.

    Type "!help <functionname>" for detailed info on that function.

     

    Object Inspection                  Examining code and stacks

    -----------------------------      -----------------------------

    DumpObj (do)                       Threads

    DumpArray (da)                     CLRStack

    DumpStackObjects (dso)             IP2MD

    DumpHeap                           U

    DumpVC                             DumpStack

    GCRoot                             EEStack

    ObjSize                            GCInfo

    FinalizeQueue                      EHInfo

    PrintException (pe)                COMState

    TraverseHeap                       BPMD

     

    10.            Using the exception address, displayed by the debugger when opening the dump, and the !pe command listed above, I obtained more information about the exception:

                   

    0:011> !pe 771a42eb

    Invalid object

    There are nested exceptions on this thread. Run with -nested for details

     

    0:011> !pe -nested 771a42eb

    Invalid object

     

    Nested exception -------------------------------------------------------------

    Exception object: 040a676c

    Exception type: System.Reflection.TargetInvocationException

    Message: Exception has been thrown by the target of an invocation.

    InnerException: System.Reflection.TargetInvocationException, use !PrintException 040a6a20 to see more

    StackTrace (generated):

        SP       IP       Function

     

    StackTraceString: <none>

    HResult: 80131604

     

    0:011> !PrintException 040a6a20

    Exception object: 040a6a20

    Exception type: System.Reflection.TargetInvocationException

    Message: Exception has been thrown by the target of an invocation.

    InnerException: System.Configuration.ConfigurationErrorsException, use !PrintException 040a6cf8 to see more

    StackTrace (generated):

    <none>

    StackTraceString: <none>

    HResult: 80131604

    There are nested exceptions on this thread. Run with -nested for details

     

    0:011> !PrintException 040a6cf8

    Exception object: 040a6cf8

    Exception type: System.Configuration.ConfigurationErrorsException

    Message: Configuration system failed to initialize

    InnerException: System.Configuration.ConfigurationErrorsException, use !PrintException 040a7174 to see more

    StackTrace (generated):

    <none>

    StackTraceString: <none>

    HResult: 80131902

    There are nested exceptions on this thread. Run with -nested for details

     

    0:011> !PrintException 040a7174

    Exception object: 040a7174

    Exception type: System.Configuration.ConfigurationErrorsException

    Message: Unrecognized configuration section system.web/myInvalidData  

    InnerException: <none>

    StackTrace (generated):

    <none>

    StackTraceString: <none>

    HResult: 80131902

    There are nested exceptions on this thread. Run with -nested for details

     

    11.            Based on the exception information listed above, it appeared that a .Net Framework configuration section, system.web, contained an invalid configuration section named myInvalidData  inside of it. I then re-ran !analyze -v against the dump again (now that I had loaded the x86 debugger) and found that !analyze -v will load the sos.dll extension and even run the !pe extension automatically. It then automatically displayed the exception record information for me as well. Also, notice that the thread listed by !analyze -v matches the thread I examined earlier.

     

                0:011> !analyze -v

                    ...

                            EXCEPTION_MESSAGE:  Unrecognized configuration section system.web/myInvalidData.

                                                    MANAGED_OBJECT_NAME:  System.Configuration.ConfigurationErrorsException

                                                    FAULTING_THREAD:  00000a2c

     

    0:011> ~

       0  Id: ff8.c84 Suspend: 1 Teb: 7ffdf000 Unfrozen

       1  Id: ff8.96c Suspend: 1 Teb: 7ffde000 Unfrozen

       2  Id: ff8.d10 Suspend: 1 Teb: 7ffdd000 Unfrozen

       3  Id: ff8.d94 Suspend: 1 Teb: 7ffdc000 Unfrozen

       4  Id: ff8.a14 Suspend: 1 Teb: 7ffda000 Unfrozen

       5  Id: ff8.fbc Suspend: 1 Teb: 7ffd9000 Unfrozen

       6  Id: ff8.f88 Suspend: 1 Teb: 7ffd8000 Unfrozen

       7  Id: ff8.a64 Suspend: 1 Teb: 7ffd6000 Unfrozen

       8  Id: ff8.bf8 Suspend: 1 Teb: 7ffd5000 Unfrozen

       9  Id: ff8.d24 Suspend: 1 Teb: 7ffd4000 Unfrozen

      10  Id: ff8.ff0 Suspend: 1 Teb: 7ffd7000 Unfrozen

    . 11  Id: ff8.a2c Suspend: 1 Teb: 7ffd3000 Unfrozen

     

    12. At this point I was interested in identifying the source of this unrecognized configuration. Instead of engaging our .Net support team, I started with a quick search using www.live.com for     

     

                "unrecognized configuration section" system.web site:microsoft.com

                 

    This returned the following results http://search.live.com/results.aspx?q=%22unrecognized+configuration+section%22+system.web+site%3Amicrosoft.com&form=QBRE

     

    By quickly reviewing some of the hits returned, I found that others had encountered this exception in their own applications.  This is due to invalid entries in the various .config  files used in .Net. Looking through the posts, different configuration file names and paths were observed.

     

    So, I opened up the process monitor logfile to see which configuration files we were reading data from. I added filter criterion to match entries from the mmc.exe process, the TID from the FAULTING_THREAD listed in the exception data, path data containing .config,  and a successful status result. It's best to be as specific as possible.

     

    clip_image002[7]

    I found that we were reading in a large amount of settings over and over again from the .net Framework global configuration file:             

                                    c:\Windows\Microsoft.NET\Framework\v2.0.50727\CONFIG\machine.config

                    (on x64 this would be     C:\Windows\Microsoft.NET\Framework64\v2.0.50727\CONFIG\machine.config)

    clip_image002[11]

    Final Step- Putting it all together, Reproducing the issue, & confirming resolution : Using notepad, a quick search of the suspect xml file (C:\Windows\Microsoft.NET\Framework64\v2.0.50727\CONFIG\machine.config) on my system revealed a <system.web> section. At this point, I suspected that this section contained an invalid section which may have been related to the problem. To verify this, and since I like to break things, I added an invalid configuration setting <myInvalidData/> to my global configuration file. Doing so, I successfully reproduced the issue on my system. I then contacted the customer and asked if they had by any chance added any settings under the <system.web> in the configuration file: c:\Windows\Microsoft.NET\Framework\v2.0.50727\CONFIG\machine.config.

     

    The customer informed me that, per the request of their ASP.net developer, they had in fact added settings to that section of the file.  By researching http://msdn.microsoft.com/en-us/library/system.web.aspx and the schema documentation at http://msdn.microsoft.com/en-us/library/dayb112d.aspx, we were able to determine that the settings that were present in this file should not have been present inside of <system.web> . The settings were moved to the proper location per the developer and the issue was resolved.

     

    Here are the steps I used to reproduce the issue in case you are attempting to replicate this at home-

     

    A. Using notepad, open the following configuration file on a non-production Windows Server 2008 SP1 system:

         (please make a backup copy first in case you make a mistake)  

       c:\Windows\Microsoft.NET\Framework\v2.0.50727\CONFIG\machine.config

         OR   (Open the version that matches the architecture of your platform )

       C:\Windows\Microsoft.NET\Framework64\v2.0.50727\CONFIG\machine.config

     

    clip_image012

    B. Find the section <system.web> in this file (you can use the find function in notepad):

    clip_image013

    C.  Add the following line directly after <system.web> as shown in the example below:

                         <myInvalidData/>   

    clip_image014

    D. Save the file and then open eventvwr.msc and verify that the following error is displayed:

    clip_image015

    Conclusion

    Hopefully this blog has demonstrated an example of how you can use the "create dump file" feature of Windows 2008, windbg, and other related tools in an attempt to gain more specific data when your error message is not revealing the source of the problem. Feel free to post any questions, comments, or concerns.

     

  • Ntdebugging Blog

    The Mystery of Lsass.exe Memory Consumption, (When all components get involved)

    • 4 Comments

    Hi All, this is Karim Elsaid and I’m a Support Escalation Engineer working with the Dubai platforms support team.  Recently I was working on a very challenging and interesting case, and I wanted to share that experience with you.

     

    One of our customers was experiencing a problem on all his Domain Controllers (Running x86 Windows Server 2003), where intermittently the DCs ran out of resources.  When one of the servers ran out of resources it had to be restarted to restore its functionality.  The following event was logged on the DC when the problem occurred:

     

    Event Type: Error
    Event Source: NTDS General
    Event Category: Internal Processing
    Event ID: 1169
    Date:
    Time:
    User: N/A
    Computer:
    Description:
    Active Directory could not allocate the needed amount of memory.

    Memory (bytes):
    4568

    Active Directory will continue to operate, but may not function correctly.

    User Action
    Restart this domain controller. If this condition continues, increase the available
    physical or virtual memory.

    Additional Data
    Internal ID:
    30205bd
    For more information, see Help and Support Center at
    http://go.microsoft.com/fwlink/events.asp.

     

    This event gets logged when the Local Security Authority Subsystem (lsass.exe) process is unable to allocate memory from the heap.

     

    When this problem occurs, users are not able to authenticate and Exchange servers cease functioning until the DC gets rebooted.

     

    By the time we got the call, the DC that was exhibiting the issue had been rebooted, hence when we checked the lsass.exe memory usage it was very low, but when we checked another DC on the same site, the lsass.exe memory usage was around 1.7GB.  Well, that’s pretty high but it is worth mentioning that by default lsass.exe may utilize a large amount of virtual memory for the ESENT Cache (Active Directory database), nevertheless this should not cause any issues on DCs as it should back-off when needed.

     

    The issue here is that the memory usage for lsass.exe kept growing and never got released back again - this is called a process memory leak.

     

    To confirm our theory we obtained a Perfmon Log from the DC and checked how Lsass.exe is doing with memory, which proved the theory that lsass.exe is actually leaking memory.

    Performance Monitor

     

    We can see from the graph above a steady increase in lsass.exe private bytes.  The DC was booted without /3GB and hence the limit of the user mode process address space is 2GB.

     

    For more information on this topic, please check the previous post “The Memory Shell Game

     

    Now, we need to explore the options on how to track and address such leaks. We could obtain an lsass.exe process dump and check the heap, or obtain a Full memory dump of the server while in the failed state, to check the lsass.exe heap usage and overall system state, but the thing is none of the above may immediately reveal what’s going on.  We would need to be very lucky to get the dump while the actual memory call that happens inside lsass.exe process is executing, which is very slim.

     

    What we are looking for is a “pattern”, the pattern which is causing the actual leak inside the lsass.exe process.

     

    For that we will turn to a tool that I see many of the customers are not aware of, but in fact it is very handy and easy to troubleshoot such a memory leak problem. We will use the “User Mode Dump Heap” or UMDH. By using this tool it will do the job for us, it will monitor the all heap calls that happen in the lsass.exe process.  All we need to do is to download the “Debugging Tools For Windows”, enable user mode stack acquisition from the command line, download the symbols from Microsoft public symbol servers, reboot the server and we are good to go.

     

    You can find information on UMDH; how to set it up and use it “Here

     

    So now, all heap blocks that the lsass.exe process allocates will get monitored by UMDH, and the tool will tell us information about that specific call stack.  For example, the number of allocations using this stack, number of bytes consumed, etc…

     

    Back to our main problem: We have setup the DC for UMDH and started monitoring the lsass.exe process.  We verified UMDH is working after the reboot by generating a sample log -  it is very critical to generate this simple log to make sure that you are able to resolve the symbols correctly.

     

    We kept generating UMDH logs daily and then the interesting pattern started to appear, which shows:

     

    + 123097760 ( 123098480 -    720) 1538731 allocs      BackTrace4BA0

    + 1538722 ( 1538731 -      9) BackTrace4BA0     allocations

     

          ntdll!RtlAllocateHeapSlowly+00000041

          ntdll!RtlAllocateHeap+00000E9F

          msvcrt!malloc+0000006C

          NTDSA!ABBind_local+0000031C

          NTDSA!NspiBind+0000013D

          RPCRT4!Invoke+00000030

          RPCRT4!NdrStubCall2+00000299

          RPCRT4!NdrServerCall2+00000019

          RPCRT4!DispatchToStubInCNoAvrf+00000038

          RPCRT4!RPC_INTERFACE::DispatchToStubWorker+0000011F

          RPCRT4!RPC_INTERFACE::DispatchToStub+000000A3

          RPCRT4!OSF_SCALL::DispatchHelper+00000149

          RPCRT4!OSF_SCALL::DispatchRPCCall+0000010D

          RPCRT4!OSF_SCALL::ProcessReceivedPDU+0000057F

          RPCRT4!OSF_SCALL::BeginRpcCall+00000194

          RPCRT4!OSF_SCONNECTION::ProcessReceiveComplete+00000435

          RPCRT4!ProcessConnectionServerReceivedEvent+00000021

          RPCRT4!LOADABLE_TRANSPORT::ProcessIOEvents+000001B8

          RPCRT4!ProcessIOEventsWrapper+0000000D

          RPCRT4!BaseCachedThreadRoutine+0000009D

          RPCRT4!ThreadStartRoutine+0000001B

          kernel32!BaseThreadStart+00000034

     

    From the backtrace 4BA0 we can see that there are ~1.5 heap allocation calls that have occurred without being freed from the first day we started to monitor lsass using UMDH.

     

    There is also another call stack with exactly the same symptoms:

     

    + 51525199 ( 51525506 -    307) 1538802 allocs  BackTrace4B92

    + 1538793 ( 1538802 -      9) BackTrace4B92     allocations

     

          ntdll!RtlAllocateHeapSlowly+00000041

          ntdll!RtlAllocateHeap+00000E9F

          LSASRV!LsapAllocatePrivateHeapNoZero+0000004F

          RPCRT4!operator new+0000000D

          RPCRT4!W2AAttachHelper+00000032

          RPCRT4!CNlDelUnicodeAnsi::Convert+00000014

          NTDSA!ABBind_local+0000005D

          NTDSA!NspiBind+0000013D

          RPCRT4!Invoke+00000030

          RPCRT4!NdrStubCall2+00000299

          RPCRT4!NdrServerCall2+00000019

          RPCRT4!DispatchToStubInCNoAvrf+00000038

          RPCRT4!RPC_INTERFACE::DispatchToStubWorker+0000011F

          RPCRT4!RPC_INTERFACE::DispatchToStub+000000A3

          RPCRT4!OSF_SCALL::DispatchHelper+00000149

          RPCRT4!OSF_SCALL::DispatchRPCCall+0000010D

          RPCRT4!OSF_SCALL::ProcessReceivedPDU+0000057F

          RPCRT4!OSF_SCALL::BeginRpcCall+00000194

          RPCRT4!OSF_SCONNECTION::ProcessReceiveComplete+00000435

          RPCRT4!ProcessConnectionServerReceivedEvent+00000021

          RPCRT4!LOADABLE_TRANSPORT::ProcessIOEvents+000001B8

          RPCRT4!ProcessIOEventsWrapper+0000000D

          RPCRT4!BaseCachedThreadRoutine+0000009D

          RPCRT4!ThreadStartRoutine+0000001B

          kernel32!BaseThreadStart+00000034

     

    From both call stacks above we can see that the originating function that leads to the heap allocation is “NTDSA!NspiBind

     

    NSPI is the interface which the Exchange Address Book provider uses to talk to the Global Catalogs. Typically when Outlook clients want to talk to the directory for address book, they talk to the Exchange server directly and then the Exchange server will look up the appropriate GC, forward the request to it and finally send the result back to the originating clients.  This method is called the DSproxy method.  Outlook clients can also send referral requests (RFR) to the Exchange server to get a referral to a GC and then start talking to this GC directly.  You can find more information about this operation “Here

     

     

    With this new information in hand, we turned our attention to the Exchange server for a while. We did so by obtaining a netmon trace from the Exchange server, and started examining it to reveal a massive number of RFR requests originating from outlook client machines toward the Exchange server:

    3795  11:50:59.1426170  10.3876170        GC_SERVER   EXCHSERVER NSPI  NSPI:NspiBind Response, ReturnValue=0x0     {MSRPC:535, TCP:488, IPv4:118}

    3796  11:50:59.1426170  10.3876170        EXCHSERVER 10.208.164.176    NSPI      NSPI:NspiBind Response, ReturnValue=0x0   {MSRPC:502, TCP:496, IPv4:326}

    3797  11:50:59.1426170  10.3876170        EXCHSERVER 10.208.164.176    NSPI      NSPI:NspiBind Response, ReturnValue=0x0   {MSRPC:509, TCP:508, IPv4:326}

    3798  11:50:59.1426170  10.3876170        EXCHSERVER 10.208.164.176    NSPI      NSPI:NspiBind Response, ReturnValue=0x0   {MSRPC:494, TCP:484, IPv4:326}

    9807  11:51:13.1894920  24.4344920        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1144, TCP:1142, IPv4:326}

    10084 11:51:13.3769920  24.6219920        EXCHSERVER GC_SERVER   NSPI  NSPI:NspiBind Request, Flags=0x0      {MSRPC:1145, TCP:1143, IPv4:118}

    10085 11:51:13.3769920  24.6219920        GC_SERVER   EXCHSERVER NSPI  NSPI:NspiBind Response, ReturnValue=0x0     {MSRPC:1145, TCP:1143, IPv4:118}

    10086 11:51:13.3926170  24.6376170        EXCHSERVER 10.208.164.176    NSPI      NSPI:NspiBind Response, ReturnValue=0x0   {MSRPC:1144, TCP:1142, IPv4:326}

    10103 11:51:13.3926170  24.6376170        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1182, TCP:1180, IPv4:326}

    10108 11:51:13.3926170  24.6376170        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1151, TCP:1149, IPv4:326}

    10111 11:51:13.3926170  24.6376170        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1200, TCP:1196, IPv4:326}

    10115 11:51:13.3926170  24.6376170        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1164, TCP:1163, IPv4:326}

    10148 11:51:13.4863670  24.7313670        10.34.51.69 EXCHSERVER NSPI  NSPI:NspiBind Request, Flags=0x0      {MSRPC:447, TCP:446, IPv4:367}

    10149 11:51:13.4863670  24.7313670        EXCHSERVER GC_SERVER   NSPI  NSPI:NspiBind Request, Flags=0x0      {MSRPC:473, TCP:448, IPv4:69}

    10172 11:51:13.4863670  24.7313670        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1199, TCP:1197, IPv4:326}

    10174 11:51:13.4863670  24.7313670        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1187, TCP:1186, IPv4:326}

    10180 11:51:13.4863670  24.7313670        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1170, TCP:1168, IPv4:326}

    10183 11:51:13.4863670  24.7313670        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1153, TCP:1152, IPv4:326}

    10185 11:51:13.4863670  24.7313670        GC_SERVER   EXCHSERVER NSPI  NSPI:NspiBind Response, ReturnValue=0x0     {MSRPC:473, TCP:448, IPv4:69}

    10186 11:51:13.4863670  24.7313670        EXCHSERVER 10.34.51.69 NSPI  NSPI:NspiBind Response, ReturnValue=0x0     {MSRPC:447, TCP:446, IPv4:367}

    10224 11:51:13.5176170  24.7626170        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1209, TCP:1208, IPv4:326}

    10227 11:51:13.5176170  24.7626170        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1192, TCP:1190, IPv4:326}

    10232 11:51:13.5176170  24.7626170        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1159, TCP:1157, IPv4:326}

    10235 11:51:13.5176170  24.7626170        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1175, TCP:1173, IPv4:326}

    10240 11:51:13.5176170  24.7626170        10.34.51.69 EXCHSERVER NSPI  NSPI:NspiBind Request, Flags=0x0      {MSRPC:447, TCP:446, IPv4:367}

    10241 11:51:13.5176170  24.7626170        EXCHSERVER GC_SERVER   NSPI  NSPI:NspiBind Request, Flags=0x0      {MSRPC:473, TCP:448, IPv4:69}

    10251 11:51:13.5332420  24.7782420        GC_SERVER   EXCHSERVER NSPI  NSPI:NspiBind Response, ReturnValue=0x0     {MSRPC:473, TCP:448, IPv4:69}

    10255 11:51:13.5332420  24.7782420        EXCHSERVER 10.34.51.69 NSPI  NSPI:NspiBind Response, ReturnValue=0x0     {MSRPC:447, TCP:446, IPv4:367}

    Snipp…..

     

    In the netmon capture we saw an excessive number of NspiBind requests been sent from the Outlook client machines to the Exchange server and Exchange server in turn sending these NspiBind requests to the GC, in the netmon capture we didn’t find any frames for NspiUnBind requests, which should be used to destroy that connection.

     

    When we checked the Exchange server security logs we found a massive number of logon events that are occurring for every single client:

    Event Type: Success Audit

    Event Source:     Security

    Event Category:   (2)

    Event ID:   540

    Date:       5/26/2010

    Time:       3:49:09 PM

    User:      

    Computer:   EXCHSERVER

    Description:

    Successful Network Logon:

          User Name:  User_Name

          Domain:           CONTOSO.COM

          Logon ID:         (0x0,0x3DA1F42)

          Logon Type: 3

          Logon Process:    Kerberos

          Authentication Package: Kerberos

          Workstation Name:

          Logon GUID: {b03f9ad2-3150-e1be-f50a-b3abb9d2c09d}

          Caller User Name: -

          Caller Domain:    -

          Caller Logon ID:  -

          Caller Process ID: -

          Transited Services: -

          Source Network Address: 10.11.12.13

          Source Port:      0

     

    For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

     

    So, now we know the problem of lsass.exe memory consumption on the DCs is actually occurring because of a massive number of NspiBind requests that are being sent to the DCs without proper NspiUnbind requests.

     

    So we turned our attention to the client machines that have Outlook installed and started to list all 3rd party add-ins that are installed on them.

     

    We found that all affected client machines are actually sharing 3rd party add-ins for email content archival. Further investigation on this software revealed a strange MAPI Wrapper DLL library, and our customer confirmed that the start time of the problem approximately matches the time they have rolled out the content archival software on the client machines.

     

    We uninstalled this content archival add-in from a couple of machines and these machines stopped hammering the Exchange Server and GC with NspiBind requests.

     

    Later we uninstalled it from the rest of the client machines and the lsass.exe process on the DCs has finally stabilized and we no longer have any issues with its memory usage.

     

    I’d like to mention that starting with Windows Server 2008 we have implemented a “NSPI Max Sessions Per User” where you can limit the number of NSPI requests from a client to the server.

     

    You can read more information about this new functionality “Here

     

    That’s it! I hope you enjoyed reading and hopefully this should help you to troubleshoot any memory leaks you may encounter.  The information contained here can be applied to other processes when you suspect heap leaks.

     

    See you again on another adventure.

     

    Best Regards,

    Karim Elsaid

  • Ntdebugging Blog

    Using Xperf to investigate slow I/O issues

    • 2 Comments

    Debuggers,

    Here’s another short video of the type of triage possible with Xperf, this time Bob shows us how to track down a slow I/O issue. Since I/O Manager is instrumented at IoCallDriver and IoCompleteRequest we can often use this information to diagnose a Slow I/O or SAN issue. The steps to enable the tracing are in the video description and here again below.

    Thanks,

    Tate

                                                                                                                                                                                                                                                                                            

    Video Description: (http://www.youtube.com/watch?v=0h-i_EfzadA)

    Here's how you can use the Windows Performance Toolkit aka Xperf to investigate slow I/O issues on Windows 2008, Vista, and higher versions. Put the following into a batch file (modifying the -f to point the etl file generation elsewhere than the slow storage) and it will start a circular trace which can be stopped when the slow I/O occurs.

    @echo off

    xperf -on PROC_THREAD+LOADER+CSWITCH+FILENAME+FILE_IO+FILE_IO_INIT+DRIVERS -f kernel.etl -stackwalk CSwitch+DiskReadInit+DiskWriteInit+DiskFlushInit+FileCreate+FileCleanup+FileClose+FileRead+FileWrite -BufferSize 1024 -MaxBuffers 1024 -MaxFile 1024 -FileMode Circular

    echo Reproduce the problem now, hit any key to finish, collect Merged Etl file

    pause

    echo merging etl...

    xperf -d XperfSlowIOcir.etl

    @echo on

     

    Share this post :
Page 2 of 24 (240 items) 12345»