• Ntdebugging Blog

    Understanding Storage Timeouts and Event 129 Errors

    • 8 Comments

    Greetings fellow debuggers, today I will be blogging about Event ID 129 messages.  These warning events are logged to the system event log with the storage adapter (HBA) driver’s name as the source.  Windows’ STORPORT.SYS driver logs this message when it detects that a request has timed out, the HBA driver’s name is used in the error because it is the miniport associated with storport.

     

    Below is an example 129 event:

     

    Event Type:       Warning

    Event Source:     <HBA_Name>

    Event Category:   None

    Event ID:         129

    Date:             4/9/2009

    Time:             1:15:45 AM

    User:             N/A

    Computer:         <Computer_Name>

    Description:

    Reset to device, \Device\RaidPort1, was issued.

     

    So what does this mean?  Let’s discuss the Windows I/O stack architecture to answer this.

     

    Windows I/O uses a layered architecture where device drivers are on a “device stack.”  In a basic model, the top of the stack is the file system.  Next is the volume manager, followed by the disk driver.  At the bottom of the device stack are the port and miniport drivers.  When an I/O request reaches the file system, it takes the block number of the file and translates it to an offset in the volume. The volume manager then translates the volume offset to a block number on the disk and passes the request to the disk driver.  When the request reaches the disk driver it will create a Command Descriptor Block (CDB) that will be sent to the SCSI device.  The disk driver imbeds the CDB into a structure called the SCSI_REQUEST_BLOCK (SRB).  This SRB is sent to the port driver as part of the I/O request packet (IRP).

     

    The port driver does most of the request processing.  There are different port drivers depending on the architecture.  For example, ATAPORT.SYS is the port driver for ATA devices, and STORPORT.SYS is the port driver for SCSI devices.  Some of the responsibilities for a port driver are:

    • Providing timing services for requests.
    • Enforcing queue depth (making sure that a device does not get more requests that it can handle).
    • Building scatter gather arrays for data buffers.

    The port driver interfaces with a driver called the “miniport”.  The miniport driver is designed by the hardware manufacturer to work with a specific adapter and is responsible for taking requests from the port driver and sending them to the target LUN.  The port driver calls the HwStorStartIo() function to send requests to the miniport, and the miniport will send the requests to the HBA so they can be sent over the physical medium (fibre, ethernet, etc) to the LUN.  When the request is complete, the miniport will call StorPortNotification() with the NotificationType parameter value set to RequestComplete, along with a pointer to the completed SRB.

     

    When a request is sent to the miniport, STORPORT will put the request in a pending queue.  When the request is completed, it is removed from this queue.  While requests are in the pending queue they are timed. 

     

    The timing mechanism is simple.  There is one timer per logical unit and it is initialized to -1.  When the first request is sent to the miniport the timer is set to the timeout value in the SRB.  The disk timeout value is a tunable parameter in the registry at: HKLM\System\CurrentControlSet\Services\Disk\TimeOutValue.  Some vendors will tune this value to best match their hardware, we do not advise changing this value without guidance from your storage vendor.

     

    The timer is decremented once per second.  When a request completes, the timer is refreshed with the timeout value of the head request in the pending queue.  So, as long as requests complete the timer will never go to zero.  If the timer does go to zero, it means the device has stopped responding.  That is when the STORPORT driver logs the Event ID 129 error.  STORPORT then has to take corrective action by trying to reset the unit.  When the unit is reset, all outstanding requests are completed with an error and they are retried.  When the pending queue empties, the timer is set to -1 which is its initial value.

    image002

    Each SRB has a timer value set.  As requests are completed the queue timer is refreshed with the timeout value of the SRB at the head of the list.

     

    The most common causes of the Event ID 129 errors are unresponsive LUNs or a dropped request.  Dropped requests can be caused by faulty routers or other hardware problems on the SAN.

     

    I have never seen software cause an Event ID 129 error.  If you are seeing Event ID 129 errors in your event logs, then you should start investigating the storage and fibre network.

  • Ntdebugging Blog

    Troubleshooting Pool Leaks Part 2 – Poolmon

    • 8 Comments

    In our previous article we discussed how to identify a pool leak using perfmon.  Although it may be interesting to know that you have a pool leak, most customers are interested in identifying the cause of the leak so that it can be corrected.  In this article we will begin the process of identifying what kernel mode driver is leaking pool, and possibly identify why.

     

    Often when we are collecting data for a poor performance scenario there are two pieces of data that we collect.  Perfmon log data is one, as we discussed in our previous article.  The other piece of data is poolmon logs.  The memory manager tracks pool usage according to the tag associated with the pool allocations, using a technique called pool tagging.  Poolmon gathers this data and displays it in an easy to use format.  Poolmon can also be configured to dump data to a log, and in some scenarios it is beneficial to schedule poolmon to periodically collect such logs.  There are several available techniques to schedule poolmon, however that is beyond the scope of this article.

     

    Poolmon has shipped with many different packages over the years; it is currently available with the Windows Driver Kit.  If you install the WDK to the default folder, poolmon will be in “C:\Program Files (x86)\Windows Kits\8.0\Tools\x64”.  Poolmon does not have dependencies on other modules in this folder; you can copy it to your other computers when you need to investigate pool usage.

     

    How does pool tagging work?  When a driver allocates pool it calls the ExAllocatePoolWithTag API.  This API accepts a tag - a four-letter string - that will be used to label the allocation.  It is up to a driver developer to choose this tag.  Ideally each developer will choose a tag that is unique to their driver and use a different tag for each code path which calls ExAllocatePoolWithTag.  Because each tag should be unique to each driver, if we can identify the tag whose usage corresponds with the leak we can then begin to identify the driver which is leaking the memory.  The tag may also give the driver developer clues as to why the memory is being leaked, if they use a unique tag for each code path.

     

    To view the pool usage associated with each tag run “poolmon -b” from a command prompt.  This will sort by the number of bytes associated with each tag.  If you are tracking pool usage over a period of time, you can log the data to a file with “poolmon -b -n poolmonlog1.txt”, replacing 1 with increasing numbers to obtain a series of logs.  Once you have a series of logs you may be able to view usage increasing for a specific tag, in a corresponding fashion to what you see in perfmon.

     

    When analyzing poolmon the important data is at the top.  Typically the tag with the largest usage in bytes is the cause of the leak.

     

    Poolmon -b

     

    In the above data we can see that the tag with the most pool usage is “Leak”.  Now that we know what tag is leaking we need to identify what driver is using this tag.  Techniques for associating a leak with a tag vary, but findstr is often effective.  Most drivers are located in c:\windows\system32\drivers, so that is a good starting point when looking for the driver.  If you don’t find a result in that folder, go up a folder and try again, repeating until you get to the root of the drive.

     

    C:\>findstr /s Leak *.sys

    Users\Administrator\Desktop\myfault.sys:

    AìIALeak;┴☼BAì

    ·∟   §£ Θ─  A╗☻   E"Θ╡☺ Hï♣╔♂  $Θª  Hï:Hc ê\@ë

      └δ_Aï ╞♣@  ë♣▓  Aï@ë¼←  δCAâ∙uAïAìIALeak;┴☼B3ï §

     

    In the above output we can see that “Leak” is used in myfault.sys.  If we hadn’t forced this leak with notmyfault, the next step in troubleshooting would be an internet search for the tag and the driver.  Often such a search will allow you to identify a specific fault within the driver and a solution.

     

    Don’t panic if findstr doesn’t find your tag, or if you find the tag but it is not unique to one driver.  In future articles we will cover additional techniques for associating drivers with tags, and for associating allocations with specific code within a driver.

  • Ntdebugging Blog

    Getting Ready for Windows Debugging

    • 8 Comments

     

    Welcome to the Microsoft NTDebugging blog!  I’m Matthew Justice, an Escalation Engineer on Microsoft’s Platforms Critical Problem Resolution (CPR) team.  Our team will be blogging about troubleshooting Windows problems at a low level, often by using the Debugging Tools for Windows.  For more information about us and this blog, check out the about page.

     

    To get things started I want to provide you with a list of tools that we’ll be referencing in our upcoming blog posts, as well as links to some technical documents to help you get things configured.

     

    The big list of tools:

     

    The following tools are part of the “Debugging Tools for Windows” – you’ll definitely need these

    http://www.microsoft.com/whdc/devtools/debugging/

    ·         windbg

    ·         cdb

    ·         ntsd

    ·         tlist

    ·         gflags

    ·         adplus

    ·         UMDH

    ·         symcheck

     

    Sysinternals provides some great tools that we’ll be discussing

    http://www.sysinternals.com

    ·         Process Explorer

    ·         Process Monitor

    ·         Regmon

    ·         Filemon

    ·         DbgView

    ·         Handle.exe

    ·         Tcpview

    ·         LiveKD

    ·         AutoRuns

    ·         WinObj

     

    There are many tools contained in “MPS Reports” (MPSRPT_SETUPPerf.EXE), but I’m listing it here specifically for Checksym

    http://www.microsoft.com/downloads/details.aspx?FamilyID=CEBF3C7C-7CA5-408F-88B7-F9C79B7306C0&displaylang=en

    ·         Checksym

     

    “Windows Server 2003 Resource Kit Tools” is another great set of tools.  In particular Kernrate is a part of that package

    http://www.microsoft.com/downloads/details.aspx?displaylang=en&familyid=9D467A69-57FF-4AE7-96EE-B18C4790CFFD

    ·         Kernrate

     

    Windows XP SP2 Support Tools

    http://www.microsoft.com/downloads/details.aspx?FamilyID=49AE8576-9BB9-4126-9761-BA8011FABF38&displaylang=en

    ·         netcap

    ·         poolmon

    ·         memsnap

    ·         tracefmt  (64-bit versions available in the DDK)

    ·         tracelog

    ·         tracepdb

    ·         depends

    ·         pstat

     

    “Visual Studio “ – in addition to the compilers and IDE, the following tools come in handy:

    ·         SPY++

    ·         dumpbin

     

    Perfwiz (Performance Monitor Wizard)

    http://www.microsoft.com/downloads/details.aspx?FamilyID=31fccd98-c3a1-4644-9622-faa046d69214&DisplayLang=en

     

    DebugDiag

    http://www.iis.net/handlers/895/ItemPermaLink.ashx

     

    Userdump (User Mode Process Dumper)

    http://www.microsoft.com/downloads/details.aspx?FamilyID=E089CA41-6A87-40C8-BF69-28AC08570B7E&displaylang=en

     

    Dheapmon (Desktop Heap Monitor)

    http://www.microsoft.com/downloads/details.aspx?familyid=5CFC9B74-97AA-4510-B4B9-B2DC98C8ED8B&displaylang=en

     

    Netmon 3.0

    §  Go to http://connect.microsoft.com/

    §  Sign in with your passport account

    §  Choose "Available Connections" on the left

    §  Choose "Apply for Network Monitor 3.0” (once you've finished with the application, the selection appears in your "My Participation" page)

    §  Go to the Downloads page (On the left side), and select the appropriate build 32 or 64 bit build.

     

     

     

    Some articles you may find useful:

     

    Debugging Tools and Symbols: Getting Started

    http://www.microsoft.com/whdc/devtools/debugging/debugstart.mspx

     

    Boot Parameters to Enable Debugging

    http://msdn2.microsoft.com/en-us/library/ms791527.aspx

     

    How to Generate a Memory Dump File When a Server Stops Responding (Hangs)

    http://support.microsoft.com/kb/303021/

     

    After installing the “Debugging Tools for Windows”, you’ll find two documents at the root of the install folder that are helpful:

     

    ·         kernel_debugging_tutorial.doc - A guide to help you get started using the kernel debugger.

     

    ·         debugger.chm - The help file for the debuggers.  It details the commands you can use in the debugger.  Think of this as a reference manual, rather than a tutorial.





  • Ntdebugging Blog

    Storport ETW Logging to Measure Requests Made to a Disk Unit

    • 10 Comments

    Greetings NTDEBUGGING community, Bob here again, and today I would like to let everyone know about a new feature implemented in the STORPORT.SYS binaries released in the following KB articles:

    ·         KB 979764 (Windows Server 2008)

    ·         KB 978000 (Windows Server 2008 R2)

     

    Beginning with these new versions, it is now possible to measure timing statistics for requests made to a system’s disk unit.  These measurements are taken at the lowest possible level of OS interaction with the storage adapter hardware, making it much easier to diagnose storage performance issues. The measurements are taken from the port driver, STORPORT.SYS, via ETW (Event Tracing for Windows).  Windows drivers are implemented in a layered architecture, so in our case the STORPORT driver interfaces directly with the adapter or miniport driver.  The STORPORT driver is responsible for managing requests and queues, and providing other upper level functions.  Each miniport driver is specific to a particular storage adapter card, and knows how to deliver requests to the disk unit over the transport medium, like fibre channel for instance.

    Configuring and Starting STORPORT ETW Logging

    To measure storage performance, the user enters a threshold value via the GUI interface. (This can be scripted as well; see KB 978000 for details.)  When an I/O time is equal to or greater than the entered threshold, an event 13 is written to the ETW log.  

    Here is a walkthrough to enable this feature via the GUI:

    Click Start, type “perfmon”, and press ENTER to start Performance Monitor.

    image

     

    Expand “Data Collector Sets” and select “Event Trace Sessions”.

    image

     

    Right-Click on “Event Trace Sessions” and select “New -> Data Collector Set”. 

    image

     

    The following dialog will appear:

    image

     

    Give the new data collector set a name in the dialog box.  In this example I called it “Storport Performance Logging”. Choose the “Create manually (Advanced)” option then click Next to see the following dialog.

    image

     

    After clicking “Add...” on the dialog above the following list of providers will appear. Select “Microsoft-Windows-StorPort” and click OK.

    image

     

    When the dialog below opens, select “Keywords (Any)” then click Edit.

    image

     

    Check the box for IOPerfNotification, and then click OK. Note:  For Windows Server 2008 SP1 and SP2, some properties may not be present (e. g. Reset, PortNotification and PauseResume). 

    image

     

     

    Next, select “Filter” and click Edit…

    image

     

    This is where we enter our time threshold.  Remember that timing values greater than or equal to this filter value will be logged to the trace file.

     

    Select “Binary” and check the “Filter enabled” box

    image

    In the “Filter data” field, enter the threshold value in the number of milliseconds. This must be entered in little endian format. Please refer to this table for the most common entries.

    Decimal

    Hexadecimal

    Binary(little endian)

    1ms

    5ms

    10ms

    15ms

    1

    5

    A

    F

    01 00 00 00 00 00 00 00

    05 00 00 00 00 00 00 00

    0A 00 00 00 00 00 00 00

    0F 00 00 00 00 00 00 00

     

    Note: The “Filter type” value will always remain 0 as in the example above showing an entry for 1 ms. 

     

    WARNING - The whole data line needs to be filled in when entering the threshold. For demonstration purposes here is the WRONG WAY to enter 256 ms. This will not work because of the missing 00 entries on the line.

    image 

    Below is the correct way to enter the value (256ms) Note: The Filter value may have to be re-entered for next run because it could get reset.

    image

    After clicking OK, you will see a new Event Trace session (screenshot below).  Right-Click the new Event Trace session and select “Properties”

    image

     

    Next we need to set a location for the trace file.  Ensure you have sufficient disk space in the target folder because logs can be many Megabytes in size depending on the number of hits. 

    image

     

    The target folder may be changed in the “Directory” tab

    image

     

    I recommend choosing a directory that is more easily located, like C:\PerfLogs, in which case you can click “Browse”

    image

     

    Note:  It is not recommended to store the trace file on the same storage you are troubleshooting.

    image

    The name for the file is selected on the “File” tab.

    image

    When you are done setting the file and directory information click OK.

    Now right-click the log and select “Start” to begin tracing.

    image

    The “Status” should indicate that it is running.

    image

    How long to run the log depends on you issue.  You should let the log run until the issue or slow down occur then check the log for delays.

    Viewing the Log File

    First right-click and select “Stop” to flush any cached data and stop the logging.

    image

    Then use Event Viewer to examine the log.

    image

     

    Next choose "Open the saved log…"

    image

     

    image

    image

     

    After clicking OK a popup dialog will be displayed to ask if you want to convert the log. Select “No”. Here is some sample event data that you will likely see in the log.

    image

     

    Example: Event ID 13

    image

     

    Let's examine the “Details” tab which shows information about the I/O that exceeded or equaled the threshold we had set.  If you do not set a threshold, then all I/O events are logged, but for performance reasons we recommend setting a "reasonable" threshold like 15ms.

    image

    Breaking down the Event Data we have …

     

    Miniport Extension: (Kernel address of the miniport extension, not of use to performance logging, only debugging)

    PortNumber:  This is the adapter port number (i.e. RaidPort1, etc.)

    PathID:  Path number if multi-pathing is in place

    TargetID:  Target LUN exposed to the Operating System

    LUN:  The Logical Unit Number of the physical storage where the particular I/O is directed

    *RequestDuration:  How long the request took in milliseconds (the number of note here for storage latency diagnostics)

    Command:  the SCSI command sent to the unit. (Usually read or write)  This is listed in decimal.  Convert this value to hexadecimal with Calc.exe for referencing SCSI commands http://en.wikipedia.org/wiki/SCSI_command

    ScsiStatus:  the status that came back from the drive

    SRBStatus:  the status as returned from adapter  (see srb.h and scsi.h in the Microsoft WDK or http://en.wikipedia.org/wiki/SCSI_Status_Code)

     

     Example: Command 42 equals 2A which is a Write of 10 bytes.

    image

    image

    Here’s the respective SCSI command:

    image

    When viewing the log, the commands will be mostly reads and writes.  If I/O is determined to be slow at this level, then the issue is most likely disk or SAN related.

    Hope you enjoyed the walkthrough!

    -Bob

     

     

     

     

     

     

     

  • Ntdebugging Blog

    How to use the DedicatedDumpFile registry value to overcome space limitations on the system drive when capturing a system memory dump

    • 5 Comments

    Prior to Windows Vista and Windows Server 2008 we had to keep a large page file on the system drive (typically drive C:) in order to properly capture system memory dumps for troubleshooting.  This presented problems as systems with very large amounts of RAM became more common, resulting in requirements for very large amounts of free space on the C: drive, or requiring that system visible memory be artificially limited for troubleshooting purposes.  This is no longer a requirement thanks to the Dedicated Dump File feature, which is available for use in Windows Vista and later operating systems.

     

    What is it?

    The dedicated dump file is basically a page file that is reserved for use only by the system crash dump routines.  It is not used for paging virtual memory.  Like a page file, the system process keeps an open handle to the dedicated dump file, which prevents it from being deleted.  When you manually initiate a memory dump, or the system crashes on its own, the data is written into the dedicated dump file instead of the page file on the system drive.  This feature is enabled by setting the following registry value:

    Location:   HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\CrashControl

    Name: DedicatedDumpFile

    Type:  REG_SZ

    Value: A dedicated dump file together with a full path, such as D:\dedicateddumpfile.sys

    Where can it be stored?

    The dedicated dump file can be stored on any local volume that can support a page file.  This volume must be considered a fixed disk device and must not be hosting Windows clustered storage.  For example, this means you can store this file on a second partition of your server’s primary hard disk, or on a SAN volume that is mounted early in the boot process. 

    Fun fact: Unlike a page file, you can actually direct the dedicated dump file to a sub-folder, even one that is actually a volume mount point.  However, the sub-folder must pre-exist during boot, and there is generally little reason to actually do this.

     

    How much space will it use?

    You can automatically allow the system to determine the size of the dedicated dump file, or you can manually set the size by using the DumpFileSize registry value. 

     

    System determined size:

    If you choose to let the system determine the size, then regardless of the dump file type set via the CrashDumpEnabled value, the dedicated dump file will be roughly the size of system visible RAM plus 256 MB.  The extra 256 MB allows room for secondary dump data that may be written by driver bug check callback routines.  In order to write this data, the drivers would have to first call KeRegisterBugCheckReasonCallback() to register the routines that would execute when the system bug checks.   This is currently not very common due to the limitations imposed by the high IRQL at which bug check callbacks run.  Allowing the system to determine the size is a good idea, especially when disk space is not of huge concern.  Also, the dedicated dump file size will be adjusted automatically at boot if additional RAM is installed or removed later.  This includes artificial memory reduction implemented via the BCDEdit option, truncatememory.  This eliminates the need to manually resize your page file to support proper collection of a memory dump.

     

    Manual size specified via the registry:

    The following registry value will allow you to manually set the dedicated dump file size in megabytes:

    Location:   HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\CrashControl

    Name: DumpFileSize

    Type:  REG_DWORD

    Value: The dump file size in megabytes

     

     

    One good reason one might have to set this value manually is in the case of a very large memory system from which only a kernel memory dump is desired.  In this case, a dedicated dump file the size of RAM+256 MB is not required.  Generally, even on 64-bit systems, kernel memory dumps are no larger than a few gigabytes, although they have the potential to be much larger.  In nearly all cases reserving one third the size of RAM will be more than sufficient to capture kernel memory.  If you have plenty of room on the drive and want to play it safe, you can simply allow the system to determine the size for you by setting the DumpFileSize value to zero, or by not creating the value at all.

    Although you may set the DumpFileSize to a value larger than needed, or the system chooses a size larger than the type of dump selected, the resultant file size after a bug check will only be as large as the data that is stored in it.  This is most commonly the size of physical RAM + room for the dump header information.  If you choose to set the DumpFileSize value manually, then you should also set the IgnorePagefileSize value under the CrashControl key to 1.  This will help ensure a dump file is created.

     

    What about the final dump file?

    After a bug check, the data will be written to the dedicated dump file on disk.  However, the final memory dump location will be based on the following registry value:

    Location:   HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\CrashControl

    Name: DumpFile

    Type:  REG_EXPAND_SZ

    Value: The target dump file name together with the full path, such as D:\memory.dmp

     

     

    The system will extract the dump data from the dedicated dump file and save the resultant file to the location and file name specified for the DumpFile value.  The default for this value is %SYSTEMROOT%\MEMORY.DMP.  Since we are using the dedicated dump file to overcome space limitations of the system drive, we will likely want to also set the DumpFile value to point to an alternate location than the default. 

    If you set this value to point to the same volume as the dedicated dump file, it will save WERFAULT.EXE the step of performing an additional copy of the temporary dump file after reboot.  Instead the file will simply be renamed on the target volume.   If there is not enough room for both the dedicated dump file and the memory dump after the bug check, then the dedicated dump file will not be recreated, and only the memory dump will remain.  If this happens and the system bug checks again, only the original dump will remain because there will be no dedicated dump to write the second bug check to until space is freed up on the drive and the system is rebooted.

     

    DedicatedDumpFile Usage Example:

    The following example will set the dedicated dump file and resultant dump to be stored on the root if drive D:

     

    Windows Registry Editor Version 5.00

     

    [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\CrashControl]

    "DumpFile"=hex(2):44,00,3a,00,5c,00,4d,00,45,00,4d,00,4f,00,52,00,59,00,2e,00,\

      44,00,4d,00,50,00,00,00

    "DedicatedDumpFile"="D:\\DedicatedDump.sys"

     

     

    (Note: The lines above can be saved as a .REG file and imported into the registry)

     

    The screenshot below demonstrates a system configured to capture a complete 4 GB memory dump to a manually sized dedicated dump file, and will store the final MEMORY.DMP on the D: drive.

     image

    Important information:

    949052 - Kernel memory dump files may not be generated on Windows Server 2008-based and Windows Vista SP1 or later based computers when physical memory is larger than the size of the page file

     

    957517 - A dedicated complete memory dump file may not be successfully generated if the volume that stores the dedicated dump file has insufficient free space

    950858 - Dedicated dump files are unexpectedly truncated to 4 GB on a computer that is running Windows Server 2008 or Windows Vista and that has more than 4 GB of physical memory

     

  • Ntdebugging Blog

    Event ID 157 "Disk # has been surprise removed"

    • 10 Comments

    Hello my name is Bob Golding and I would like to share information on a new error you may see in the system event log. It is Event ID 157 "Disk <n> has been surprise removed" with Source: disk.  This error indicates that the CLASSPNP driver has received a “surprise removal” request from the plug and play manager (PNP) for a non-removable disk.

     

    What does this error mean?

    The PNP manager does what is called enumerations.  An enumeration is a request sent to a driver that controls a bus, such as PCI, to take an inventory of devices on the bus and report back a list of the devices.  The SCSI bus is enumerated in a similar manner, as are devices on the IDE bus.

     

    These enumerations can happen for a number of reasons.  For example, hardware can request an enumeration when it detects a change in configuration.  Also a user can initiate an enumeration by selecting “scan for new devices” in device manager.  

     

    When an enumeration request is received, the bus driver will rescan the bus for all devices.  It will issue commands to the existing devices as though it was looking for new ones.  If these commands fail on an existing unit, the driver will mark the device as “missing”.  When the device is marked “missing”, it will not be reported back to PNP in the inventory.  When PNP determines that the device is not in the inventory it will send a surprise removal request to the bus driver so the bus driver can remove the device object.

     

    Since the CLASSPNP driver sits in the device stack and receives requests that are destined for disks, it sees the surprise removal request and logs an event if the disk is supposed to be non-removable.  An example of a non-removable disk is a hard drive on a SCSI or IDE bus.  An example of a removable disk is a USB thumb drive.

     

    Previously nothing was logged when a non-removable disk was removed, as a result disks would disappear from the system with no indication.  The event id 157 error was implemented in Windows 8.1 and Windows Server 2012 R2 to log a record of a disk disappearing.

     

    Why does this error happen?

    These errors are most often caused when something disrupts the system’s communication with a disk, such as a SAN fabric error or a SCSI bus problem.  The errors can also be caused by a disk that fails, or when a user unplugs a disk while the system is running.  An administrator that sees these errors needs to verify the heath of the disk subsystem.

     

    Event ID 157 Example:

     

  • Ntdebugging Blog

    How the Clipboard Works, Part 1

    • 4 Comments

    Recently I had the opportunity to debug the clipboard in Windows, and I thought I’d share some of the things I learned.  The clipboard is one of those parts of Windows that many of us use dozens (hundreds?) of times a day and don’t really think about. Before working on this case, I had never even considered how it works under the hood.  It turns out that there’s more to it than you might think. In this first article, I’ll describe how applications store different types of data to the clipboard and how it is retrieved.  My next post will describe how applications can hook the clipboard to monitor it for changes.  In both, I’ll include debug notes to show you how to access the data from the debugger.

     

    Let’s start by discussing clipboard formats.  A clipboard format is used to describe what type of data is placed on the clipboard.  There are a number of predefined standard formats that an application can use, such as bitmap, ANSI text, Unicode text, and TIFF.  Windows also allows an application to specify its own formats. For example, a word processor may want to register a format that includes text, formatting, and images.  Of course, this leads to one problem, what happens if you want to copy from your word processor and paste into Notepad, which doesn’t understand all of the formatting and pictures?

     

    The answer is to allow multiple formats to be stored in the clipboard at one time.  When I used to think of the clipboard I thought of it as a single object (“my text” or “my image”), but in reality the clipboard usually has my data in several different forms.  The destination program gets a format it can use when I paste.

     

    So how does this data end up on the clipboard?  Simple, an application first takes ownership of the clipboard via the OpenClipboard function.   Once it has done that, it can empty the clipboard with EmptyClipboard.  Finally, it is ready to place data on the clipboard using SetClipboardData.  SetClipboardData takes two parameters; the first is the identifier of one of the clipboard formats we discussed above.  The second is a handle to the memory containing the data in that format. The application can continue to call SetClipboardData for each of the various formats it wishes to provide going from best to worst (since the destination application will select the first format in the list it recognizes).  To make things easier for the developer, Windows will automatically provide converted formats for some clipboard format types.  Once the program is done, it calls CloseClipboard.

     

    When a user hits paste, the destination application will call OpenClipboard and one of these functions to determine what data format(s) are available: IsClipboardFormatAvailable, GetPriorityClipboardFormat, or EnumClipboardFormats. Assuming the application finds a format it can use, it will then call GetClipboardData with the desired format identifier as a parameter to get a handle to the data.  Finally, it will call CloseClipboard.

     

    Now let’s take a look at how you can find what data being written to the clipboard using the debugger. (Note that all of my notes are taken from a Win7/2008 R2 system – so things might vary slightly in different versions of the OS.)   Since the clipboard is part of Win32k.sys, you’ll need to use a kernel debugger.  I like to use win32k!InternalSetClipboardData+0xe4 as a breakpoint.  The nice thing about this offset is that it is right after we’ve populated the RDI register with data from SetClipboardData in a structure known as tagCLIP.

     

    kd> u win32k!InternalSetClipboardData+0xe4-c L5

    win32k!InternalSetClipboardData+0xd8:

    fffff960`0011e278 894360          mov     dword ptr [rbx+60h],eax

    fffff960`0011e27b 8937            mov     dword ptr [rdi],esi

    fffff960`0011e27d 4c896708        mov     qword ptr [rdi+8],r12

    fffff960`0011e281 896f10          mov     dword ptr [rdi+10h],ebp

    fffff960`0011e284 ff15667e1900    call    qword ptr[win32k!_imp_PsGetCurrentProcessWin32Process (fffff960`002b60f0)]

     

    kd> dt win32k!tagCLIP

       +0x000 fmt              : Uint4B

       +0x008 hData            : Ptr64 Void

       +0x010fGlobalHandle     : Int4B

     

    Here’s what a call to SetClipboardData from Notepad looks like:

    kd> k

    Child-SP          RetAddr           Call Site

    fffff880`0513a940 fffff960`0011e14f win32k!InternalSetClipboardData+0xe4

    fffff880`0513ab90 fffff960`000e9312 win32k!SetClipboardData+0x57

    fffff880`0513abd0 fffff800`01482ed3 win32k!NtUserSetClipboardData+0x9e

    fffff880`0513ac20 00000000`7792e30ant!KiSystemServiceCopyEnd+0x13

    00000000`001dfad8 00000000`7792e494 USER32!ZwUserSetClipboardData+0xa

    00000000`001dfae0 000007fe`fc5b892b USER32!SetClipboardData+0xdf

    00000000`001dfb20 000007fe`fc5ba625 COMCTL32!Edit_Copy+0xdf

    00000000`001dfb60 00000000`77929bd1 COMCTL32!Edit_WndProc+0xec9

    00000000`001dfc00 00000000`779298da USER32!UserCallWinProcCheckWow+0x1ad

    00000000`001dfcc0 00000000`ff5110bc USER32!DispatchMessageWorker+0x3b5

    00000000`001dfd40 00000000`ff51133c notepad!WinMain+0x16f

    00000000`001dfdc0 00000000`77a2652d notepad!DisplayNonGenuineDlgWorker+0x2da

    00000000`001dfe80 00000000`77b5c521 kernel32!BaseThreadInitThunk+0xd

    00000000`001dfeb0 00000000`00000000ntdll!RtlUserThreadStart+0x1d

     

    So here, we can dt RDI as a tagCLIP to see what was written:

    kd> dt win32k!tagCLIP @rdi

       +0x000 fmt              : 0xd

       +0x008 hData            : 0x00000000`00270235 Void

       +0x010fGlobalHandle     : 0n1

     

    Fmt is the clipboard format. 0xd is 13, which indicates this data is Unicode text.  We can’t just ‘du’ the value in hData, however, because this is a handle, not a direct pointer to the data.  So now we need to look up the handle.  To do that, we need to look at a win32k global structure – gSharedInfo:

    kd> ?win32k!gSharedInfo

    Evaluate expression: -7284261440224 = fffff960`002f3520

    kd> dt win32k!tagSHAREDINFO fffff960`002f3520

       +0x000 psi              : 0xfffff900`c0980a70 tagSERVERINFO

       +0x008 aheList          : 0xfffff900`c0800000 _HANDLEENTRY

       +0x010 HeEntrySize      : 0x18

       +0x018 pDispInfo        : 0xfffff900`c0981e50 tagDISPLAYINFO

       +0x020ulSharedDelta     : 0

       +0x028 awmControl       : [31] _WNDMSG

       +0x218DefWindowMsgs     : _WNDMSG

       +0x228DefWindowSpecMsgs : _WNDMSG

     

    aheList in gSharedInfo contains an array of handle entries, and the last 2 bytes of hData multiplied by the size of a handle entry the address of our handle entry:

    kd> ?0x00000000`00270235 & FFFF

    Evaluate expression: 565 = 00000000`00000235

    kd> ??sizeof(win32k!_HANDLEENTRY)

    unsigned int64 0x18

    kd> ? 0xfffff900`c0800000 + (0x235*0x18)

    Evaluate expression: -7693351766792 = fffff900`c08034f8

     

    kd> dt win32k!_HANDLEENTRY fffff900`c08034f8

       +0x000 phead            : 0xfffff900`c0de0fb0 _HEAD

       +0x008 pOwner           : (null)

       +0x010 bType            : 0x6 ''

       +0x011 bFlags           : 0 ''

       +0x012 wUniq            : 0x27

     

    If we look in phead at offset 14, we’ll get our data (this offset may vary on different platforms):

    kd> du fffff900`c0de0fb0 + 0x14

    fffff900`c0de0fc4 "Hi NTDebugging readers!"

     

    Let’s consider one other scenario.  I copied some text out of Wordpad, and a number of SetClipboardData calls were made to accommodate different formats. The Unicode format entry looks like this:

    Breakpoint 0 hit

    win32k!InternalSetClipboardData+0xe4:

    fffff960`0011e284 ff15667e1900   call    qword ptr[win32k!_imp_PsGetCurrentProcessWin32Process (fffff960`002b60f0)]

    kd> dt win32k!tagCLIP @rdi

       +0x000 fmt              : 0xd

       +0x008 hData            : (null)

       +0x010fGlobalHandle    : 0n0

     

    hData is null!  Why is that?  It turns out that the clipboard allows an application to pass in null to SetClipboardData for a given format.  This indicates that the application can provide the data in that format, but is deferring doing so until it is actually needed.  Sure enough, if I paste the text into Notepad, which needs the text in Unicode, Windows sends a WM_RENDERFORMAT message to the WordPad window, and WordPad provides the data in the new format.  Of course, if the application exits before populating all of its formats, Windows needs all of the formats rendered.  In this case, Windows will send the WM_RENDERALLFORMATS message so other applications can use the clipboard data after the source application has exited.

     

    That’s all for now.  Next time we’ll look at how applications can monitor the clipboard for changes using two hooks.  If you want to know more about using the clipboard in your code, this is a great reference.

     

    -Matt Burrough

     

    Part 2 of this article can be found here:

    http://blogs.msdn.com/b/ntdebugging/archive/2012/03/29/how-the-clipboard-works-part-2.aspx

  • Ntdebugging Blog

    Easily Resolving an Event Viewer Error using a Process Memory Dump

    • 11 Comments

    My name is Ryan Mangipano (ryanman) and I am a Sr. Support Escalation Engineer at Microsoft.  Today I will be blogging about how I used the SOS  .Net Framework debugging extension (and !analyze -v) to easily troubleshoot a .Net Framework exception. This exception was preventing Event Viewer from displaying properly. Event Viewer was returning an error that provided very little information about what was actually causing the issue. I will demonstrate how, in this case, it was very easy to use windbg to obtain information about what went wrong. I did not have to perform a live debug on the issue.  Instead, I used a process dump to obtain very exact information, which was returned by the debugger, relating to the root cause. I was then able to use Process Monitor to identify the file that needed to be examined. These actions led me to the source of the problem, which was easily corrected. Also, the issue discussed in this blog was easily reproduced on Windows Server 2008. This means that you should be able to practice this debug exercise, on your non-production Windows 2008 SP1 Server, for learning purposes if you are interested in doing so.

     

    Issue Reported:  The following error was encountered when opening eventvwr.msc (Event Viewer) on a Windows 2008 Server system:

     

                           "MMC could not create the snap-in."

    MMC could not create the snap-in.

    The snap-in might not have been installed correctly

    Name: Event Viewer

                   CLSID: FX:{b05566ad-fe9c-4363-be05-7a4cbb7cb510}

    clip_image001

    First Step- Research & Data Gathering: After ensuring I first understood the problem reported, I searched for known issues. I found out that we have seen this error before.  It may occur when the following registry key gets deleted or corrupted:

     

                    HKLM\software\Microsoft\MMC\SnapIns\FX:{ b05566ad-fe9c-4363-be05-7a4cbb7cb510}

     

    I had the customer export this key and found that it was not corrupted in any way. I verified that all data was as expected

     

    Next,  a memory dump of the mmc.exe process was collected.  The mmc.exe process is used to host the eventvwr.msc snap-in. This was easily obtained using the built in Windows 2008 Server "Windows Task Manager" feature: "Create Dump File" . If you have several mmc console instances executing on your system, you can use the Task Manager context menu shortcuts "Switch To" and "Go To Process" to help you to identify the correct instance.

    clip_image002clip_image003

    clip_image004

     

    Note: We also collected a process monitor logfile during the startup of eventvwr.msc. This log file later proved very helpful in resolving the issue (as I will show below). Process monitor can be obtained at the following URL:  http://technet.microsoft.com/en-us/sysinternals/bb896645.aspx

     

     

     Now let's have a look at the debug.

     

    1.          First, I navigated Windows Explorer to the location of the dump file and double-clicked it to open it in windbg.exe.

    clip_image005

    It opened in windbg because I had previously run the command windbg -IA, which associates .dmp files with windbg. You can read more about the command line options in windbg in the help file that is included with the debugging tools.

    clip_image002[5]

    clip_image002[9]

    2.            I noticed the following output from the debugger after it loaded the dump file:

     

    This dump file has an exception of interest stored in it.

    The stored exception information can be accessed via .ecxr.

    (ff8.a2c): CLR exception - code e0434f4d (first/second chance not available)

     

     

    3.            Next, I wanted to ensure my symbol path was set correctly. I could have set it using the .sympath command:

     

    0:011> .sympath SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    Expanded Symbol search path is: srv*c:\websymbols*http://msdl.microsoft.com/download/symbols

    0:011> .sympath

    Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    Expanded Symbol search path is: srv*c:\websymbols*http://msdl.microsoft.com/download/symbols

     

    However, when your goal is to simply point to the default symbol server, .symfix is a very nice shortcut. It prevents one from having to try to remember the URL.  Here’s the syntax: 

     

    0:011> .symfix c:\websymbols

    0:011> .sympath

    Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

     

    4.            To ensure that I didn't waste time reviewing the wrong data, I performed a quick check to ensure that we collected a dump of the requested snap-in.

     

    0:005> !peb

    PEB at 000007fffffdb000

    ....

       CommandLine:  '"C:\Windows\system32\mmc.exe" "C:\Windows\system32\eventvwr.msc" '

     

    You could alternatively dump the  CommandLine from the nt!_PEB using the dt command

     

    0:005> dt nt!_PEB ProcessParameters->CommandLine 000007fffffdb000

    ntdll!_PEB

       +0x020 ProcessParameters              :

          +0x070 CommandLine                    : _UNICODE_STRING ""C:\Windows\system32\mmc.exe" "C:\Windows\system32\eventvwr.msc" "

     

    5.            Next, I then dumped out all of the threads in this process and found the following thread contained a stack that was raising a .Net Framework exception

     

    0:011> ~* kL

    ... (ommitted the non-relevent threads)

     

    # 11  Id: ff8.a2c Suspend: 1 Teb: 7ffd3000 Unfrozen

    ChildEBP RetAddr 

    0691f03c 7343a91c kernel32!RaiseException+0x58

    0691f09c 7343d81a mscorwks!RaiseTheExceptionInternalOnly+0x2a8

    *** WARNING: Unable to verify checksum for MMCEx.ni.dll

    0691f140 6bfe0b5a mscorwks!JIT_Rethrow+0xbf

    *** WARNING: Unable to verify checksum for mscorlib.ni.dll

    0691f1e8 69926cf6 MMCEx_ni+0xd0b5a

    0691f1f4 6993019f mscorlib_ni+0x216cf6

    0691f208 69926c74 mscorlib_ni+0x22019f

    0691f220 733d1b4c mscorlib_ni+0x216c74

    0691f230 733e21b1 mscorwks!CallDescrWorker+0x33

    0691f2b0 733f6501 mscorwks!CallDescrWorkerWithHandler+0xa3

    0691f3e8 733f6534 mscorwks!MethodDesc::CallDescr+0x19c

    0691f404 733f6552 mscorwks!MethodDesc::CallTargetWorker+0x1f

    0691f41c 7349d803 mscorwks!MethodDescCallSite::CallWithValueTypes+0x1a

    0691f604 733f845f mscorwks!ThreadNative::KickOffThread_Worker+0x192

    0691f618 733f83fb mscorwks!Thread::DoADCallBack+0x32a

    0691f6ac 733f8321 mscorwks!Thread::ShouldChangeAbortToUnload+0xe3

    0691f6e8 733f84ad mscorwks!Thread::ShouldChangeAbortToUnload+0x30a

    0691f710 7349d5d4 mscorwks!Thread::ShouldChangeAbortToUnload+0x33e

     

    6.            Out of curiosity, I also ran the Get Last Error command

     

    0:011> !gle

    LastErrorValue: (Win32) 0 (0) - The operation completed successfully.

    LastStatusValue: (NTSTATUS) 0 - STATUS_WAIT_0

     

    7.            After this, I ran analyze -v to see what helpful information the debugger would provide. The debugger did output exception information but informed me that I needed to use the x86 debugger instead.

     

    0:011> !analyze -v

    ...

                            EXCEPTION_RECORD:  ffffffff -- (.exr 0xffffffffffffffff)

                            ExceptionAddress: 771a42eb (kernel32!RaiseException+0x00000058)

                              ExceptionCode: e0434f4d (CLR exception)

                              ExceptionFlags: 00000001

                            NumberParameters: 1

                            Parameter[0]: 80131604

                            MANAGED_BITNESS_MISMATCH:

                                                    Managed code needs matching platform of sos.dll for proper analysis. Use 'x86' debugger.

                            FAULTING_THREAD:  00000a2c

                                                    PRIMARY_PROBLEM_CLASS:  CLR_EXCEPTION

     

    8.            I fired up the x86 debugger and loaded the appropriate version of the SOS .Net Framework debugger extension.  This extension ships in the Operating System along with the .Net Framework.  On most occasions, I would have initiated the loading of the extension through the use of the following syntax:

     

     0:011> .load C:\Windows\Microsoft.NET\Framework\v2.0.50727\sos.dll

     

    OR

     

    0:011> .load  c:\Windows\Microsoft.NET\Framework64\v2.0.50727\sos.dll

     

    However, once you realize that managed debugging will be necessary and that you need the services of the SOS extension, it’s best to use the .loadby command rather than .load.  This is due to the fact that the version of SOS must match the version of the CLR loaded into that process.  Here’s the recommended syntax: 

     

                            0:011>.loadby sos mscorwks

     

    I always verify that my extensions are loaded properly by using the  .chain command.

     

    0:011> .chain

    ...   Extension DLL chain:

                            C:\Windows\Microsoft.NET\Framework\v2.0.50727\sos.dll: image 2.0.50727.1434, API 1.0.0, built Wed Dec 05 22:42:38 2007

     

    9.            Running !help printed out the following helpful information about the SOS extension since sos.dll was at the top of the .chain output:

     

    0:011> !help

    -------------------------------------------------------------------------------

    SOS is a debugger extension DLL designed to aid in the debugging of managed

    programs. Functions are listed by category, then roughly in order of

    importance. Shortcut names for popular functions are listed in parenthesis.

    Type "!help <functionname>" for detailed info on that function.

     

    Object Inspection                  Examining code and stacks

    -----------------------------      -----------------------------

    DumpObj (do)                       Threads

    DumpArray (da)                     CLRStack

    DumpStackObjects (dso)             IP2MD

    DumpHeap                           U

    DumpVC                             DumpStack

    GCRoot                             EEStack

    ObjSize                            GCInfo

    FinalizeQueue                      EHInfo

    PrintException (pe)                COMState

    TraverseHeap                       BPMD

     

    10.            Using the exception address, displayed by the debugger when opening the dump, and the !pe command listed above, I obtained more information about the exception:

                   

    0:011> !pe 771a42eb

    Invalid object

    There are nested exceptions on this thread. Run with -nested for details

     

    0:011> !pe -nested 771a42eb

    Invalid object

     

    Nested exception -------------------------------------------------------------

    Exception object: 040a676c

    Exception type: System.Reflection.TargetInvocationException

    Message: Exception has been thrown by the target of an invocation.

    InnerException: System.Reflection.TargetInvocationException, use !PrintException 040a6a20 to see more

    StackTrace (generated):

        SP       IP       Function

     

    StackTraceString: <none>

    HResult: 80131604

     

    0:011> !PrintException 040a6a20

    Exception object: 040a6a20

    Exception type: System.Reflection.TargetInvocationException

    Message: Exception has been thrown by the target of an invocation.

    InnerException: System.Configuration.ConfigurationErrorsException, use !PrintException 040a6cf8 to see more

    StackTrace (generated):

    <none>

    StackTraceString: <none>

    HResult: 80131604

    There are nested exceptions on this thread. Run with -nested for details

     

    0:011> !PrintException 040a6cf8

    Exception object: 040a6cf8

    Exception type: System.Configuration.ConfigurationErrorsException

    Message: Configuration system failed to initialize

    InnerException: System.Configuration.ConfigurationErrorsException, use !PrintException 040a7174 to see more

    StackTrace (generated):

    <none>

    StackTraceString: <none>

    HResult: 80131902

    There are nested exceptions on this thread. Run with -nested for details

     

    0:011> !PrintException 040a7174

    Exception object: 040a7174

    Exception type: System.Configuration.ConfigurationErrorsException

    Message: Unrecognized configuration section system.web/myInvalidData  

    InnerException: <none>

    StackTrace (generated):

    <none>

    StackTraceString: <none>

    HResult: 80131902

    There are nested exceptions on this thread. Run with -nested for details

     

    11.            Based on the exception information listed above, it appeared that a .Net Framework configuration section, system.web, contained an invalid configuration section named myInvalidData  inside of it. I then re-ran !analyze -v against the dump again (now that I had loaded the x86 debugger) and found that !analyze -v will load the sos.dll extension and even run the !pe extension automatically. It then automatically displayed the exception record information for me as well. Also, notice that the thread listed by !analyze -v matches the thread I examined earlier.

     

                0:011> !analyze -v

                    ...

                            EXCEPTION_MESSAGE:  Unrecognized configuration section system.web/myInvalidData.

                                                    MANAGED_OBJECT_NAME:  System.Configuration.ConfigurationErrorsException

                                                    FAULTING_THREAD:  00000a2c

     

    0:011> ~

       0  Id: ff8.c84 Suspend: 1 Teb: 7ffdf000 Unfrozen

       1  Id: ff8.96c Suspend: 1 Teb: 7ffde000 Unfrozen

       2  Id: ff8.d10 Suspend: 1 Teb: 7ffdd000 Unfrozen

       3  Id: ff8.d94 Suspend: 1 Teb: 7ffdc000 Unfrozen

       4  Id: ff8.a14 Suspend: 1 Teb: 7ffda000 Unfrozen

       5  Id: ff8.fbc Suspend: 1 Teb: 7ffd9000 Unfrozen

       6  Id: ff8.f88 Suspend: 1 Teb: 7ffd8000 Unfrozen

       7  Id: ff8.a64 Suspend: 1 Teb: 7ffd6000 Unfrozen

       8  Id: ff8.bf8 Suspend: 1 Teb: 7ffd5000 Unfrozen

       9  Id: ff8.d24 Suspend: 1 Teb: 7ffd4000 Unfrozen

      10  Id: ff8.ff0 Suspend: 1 Teb: 7ffd7000 Unfrozen

    . 11  Id: ff8.a2c Suspend: 1 Teb: 7ffd3000 Unfrozen

     

    12. At this point I was interested in identifying the source of this unrecognized configuration. Instead of engaging our .Net support team, I started with a quick search using www.live.com for     

     

                "unrecognized configuration section" system.web site:microsoft.com

                 

    This returned the following results http://search.live.com/results.aspx?q=%22unrecognized+configuration+section%22+system.web+site%3Amicrosoft.com&form=QBRE

     

    By quickly reviewing some of the hits returned, I found that others had encountered this exception in their own applications.  This is due to invalid entries in the various .config  files used in .Net. Looking through the posts, different configuration file names and paths were observed.

     

    So, I opened up the process monitor logfile to see which configuration files we were reading data from. I added filter criterion to match entries from the mmc.exe process, the TID from the FAULTING_THREAD listed in the exception data, path data containing .config,  and a successful status result. It's best to be as specific as possible.

     

    clip_image002[7]

    I found that we were reading in a large amount of settings over and over again from the .net Framework global configuration file:             

                                    c:\Windows\Microsoft.NET\Framework\v2.0.50727\CONFIG\machine.config

                    (on x64 this would be     C:\Windows\Microsoft.NET\Framework64\v2.0.50727\CONFIG\machine.config)

    clip_image002[11]

    Final Step- Putting it all together, Reproducing the issue, & confirming resolution : Using notepad, a quick search of the suspect xml file (C:\Windows\Microsoft.NET\Framework64\v2.0.50727\CONFIG\machine.config) on my system revealed a <system.web> section. At this point, I suspected that this section contained an invalid section which may have been related to the problem. To verify this, and since I like to break things, I added an invalid configuration setting <myInvalidData/> to my global configuration file. Doing so, I successfully reproduced the issue on my system. I then contacted the customer and asked if they had by any chance added any settings under the <system.web> in the configuration file: c:\Windows\Microsoft.NET\Framework\v2.0.50727\CONFIG\machine.config.

     

    The customer informed me that, per the request of their ASP.net developer, they had in fact added settings to that section of the file.  By researching http://msdn.microsoft.com/en-us/library/system.web.aspx and the schema documentation at http://msdn.microsoft.com/en-us/library/dayb112d.aspx, we were able to determine that the settings that were present in this file should not have been present inside of <system.web> . The settings were moved to the proper location per the developer and the issue was resolved.

     

    Here are the steps I used to reproduce the issue in case you are attempting to replicate this at home-

     

    A. Using notepad, open the following configuration file on a non-production Windows Server 2008 SP1 system:

         (please make a backup copy first in case you make a mistake)  

       c:\Windows\Microsoft.NET\Framework\v2.0.50727\CONFIG\machine.config

         OR   (Open the version that matches the architecture of your platform )

       C:\Windows\Microsoft.NET\Framework64\v2.0.50727\CONFIG\machine.config

     

    clip_image012

    B. Find the section <system.web> in this file (you can use the find function in notepad):

    clip_image013

    C.  Add the following line directly after <system.web> as shown in the example below:

                         <myInvalidData/>   

    clip_image014

    D. Save the file and then open eventvwr.msc and verify that the following error is displayed:

    clip_image015

    Conclusion

    Hopefully this blog has demonstrated an example of how you can use the "create dump file" feature of Windows 2008, windbg, and other related tools in an attempt to gain more specific data when your error message is not revealing the source of the problem. Feel free to post any questions, comments, or concerns.

     

  • Ntdebugging Blog

    The Mystery of Lsass.exe Memory Consumption, (When all components get involved)

    • 4 Comments

    Hi All, this is Karim Elsaid and I’m a Support Escalation Engineer working with the Dubai platforms support team.  Recently I was working on a very challenging and interesting case, and I wanted to share that experience with you.

     

    One of our customers was experiencing a problem on all his Domain Controllers (Running x86 Windows Server 2003), where intermittently the DCs ran out of resources.  When one of the servers ran out of resources it had to be restarted to restore its functionality.  The following event was logged on the DC when the problem occurred:

     

    Event Type: Error
    Event Source: NTDS General
    Event Category: Internal Processing
    Event ID: 1169
    Date:
    Time:
    User: N/A
    Computer:
    Description:
    Active Directory could not allocate the needed amount of memory.

    Memory (bytes):
    4568

    Active Directory will continue to operate, but may not function correctly.

    User Action
    Restart this domain controller. If this condition continues, increase the available
    physical or virtual memory.

    Additional Data
    Internal ID:
    30205bd
    For more information, see Help and Support Center at
    http://go.microsoft.com/fwlink/events.asp.

     

    This event gets logged when the Local Security Authority Subsystem (lsass.exe) process is unable to allocate memory from the heap.

     

    When this problem occurs, users are not able to authenticate and Exchange servers cease functioning until the DC gets rebooted.

     

    By the time we got the call, the DC that was exhibiting the issue had been rebooted, hence when we checked the lsass.exe memory usage it was very low, but when we checked another DC on the same site, the lsass.exe memory usage was around 1.7GB.  Well, that’s pretty high but it is worth mentioning that by default lsass.exe may utilize a large amount of virtual memory for the ESENT Cache (Active Directory database), nevertheless this should not cause any issues on DCs as it should back-off when needed.

     

    The issue here is that the memory usage for lsass.exe kept growing and never got released back again - this is called a process memory leak.

     

    To confirm our theory we obtained a Perfmon Log from the DC and checked how Lsass.exe is doing with memory, which proved the theory that lsass.exe is actually leaking memory.

    Performance Monitor

     

    We can see from the graph above a steady increase in lsass.exe private bytes.  The DC was booted without /3GB and hence the limit of the user mode process address space is 2GB.

     

    For more information on this topic, please check the previous post “The Memory Shell Game

     

    Now, we need to explore the options on how to track and address such leaks. We could obtain an lsass.exe process dump and check the heap, or obtain a Full memory dump of the server while in the failed state, to check the lsass.exe heap usage and overall system state, but the thing is none of the above may immediately reveal what’s going on.  We would need to be very lucky to get the dump while the actual memory call that happens inside lsass.exe process is executing, which is very slim.

     

    What we are looking for is a “pattern”, the pattern which is causing the actual leak inside the lsass.exe process.

     

    For that we will turn to a tool that I see many of the customers are not aware of, but in fact it is very handy and easy to troubleshoot such a memory leak problem. We will use the “User Mode Dump Heap” or UMDH. By using this tool it will do the job for us, it will monitor the all heap calls that happen in the lsass.exe process.  All we need to do is to download the “Debugging Tools For Windows”, enable user mode stack acquisition from the command line, download the symbols from Microsoft public symbol servers, reboot the server and we are good to go.

     

    You can find information on UMDH; how to set it up and use it “Here

     

    So now, all heap blocks that the lsass.exe process allocates will get monitored by UMDH, and the tool will tell us information about that specific call stack.  For example, the number of allocations using this stack, number of bytes consumed, etc…

     

    Back to our main problem: We have setup the DC for UMDH and started monitoring the lsass.exe process.  We verified UMDH is working after the reboot by generating a sample log -  it is very critical to generate this simple log to make sure that you are able to resolve the symbols correctly.

     

    We kept generating UMDH logs daily and then the interesting pattern started to appear, which shows:

     

    + 123097760 ( 123098480 -    720) 1538731 allocs      BackTrace4BA0

    + 1538722 ( 1538731 -      9) BackTrace4BA0     allocations

     

          ntdll!RtlAllocateHeapSlowly+00000041

          ntdll!RtlAllocateHeap+00000E9F

          msvcrt!malloc+0000006C

          NTDSA!ABBind_local+0000031C

          NTDSA!NspiBind+0000013D

          RPCRT4!Invoke+00000030

          RPCRT4!NdrStubCall2+00000299

          RPCRT4!NdrServerCall2+00000019

          RPCRT4!DispatchToStubInCNoAvrf+00000038

          RPCRT4!RPC_INTERFACE::DispatchToStubWorker+0000011F

          RPCRT4!RPC_INTERFACE::DispatchToStub+000000A3

          RPCRT4!OSF_SCALL::DispatchHelper+00000149

          RPCRT4!OSF_SCALL::DispatchRPCCall+0000010D

          RPCRT4!OSF_SCALL::ProcessReceivedPDU+0000057F

          RPCRT4!OSF_SCALL::BeginRpcCall+00000194

          RPCRT4!OSF_SCONNECTION::ProcessReceiveComplete+00000435

          RPCRT4!ProcessConnectionServerReceivedEvent+00000021

          RPCRT4!LOADABLE_TRANSPORT::ProcessIOEvents+000001B8

          RPCRT4!ProcessIOEventsWrapper+0000000D

          RPCRT4!BaseCachedThreadRoutine+0000009D

          RPCRT4!ThreadStartRoutine+0000001B

          kernel32!BaseThreadStart+00000034

     

    From the backtrace 4BA0 we can see that there are ~1.5 heap allocation calls that have occurred without being freed from the first day we started to monitor lsass using UMDH.

     

    There is also another call stack with exactly the same symptoms:

     

    + 51525199 ( 51525506 -    307) 1538802 allocs  BackTrace4B92

    + 1538793 ( 1538802 -      9) BackTrace4B92     allocations

     

          ntdll!RtlAllocateHeapSlowly+00000041

          ntdll!RtlAllocateHeap+00000E9F

          LSASRV!LsapAllocatePrivateHeapNoZero+0000004F

          RPCRT4!operator new+0000000D

          RPCRT4!W2AAttachHelper+00000032

          RPCRT4!CNlDelUnicodeAnsi::Convert+00000014

          NTDSA!ABBind_local+0000005D

          NTDSA!NspiBind+0000013D

          RPCRT4!Invoke+00000030

          RPCRT4!NdrStubCall2+00000299

          RPCRT4!NdrServerCall2+00000019

          RPCRT4!DispatchToStubInCNoAvrf+00000038

          RPCRT4!RPC_INTERFACE::DispatchToStubWorker+0000011F

          RPCRT4!RPC_INTERFACE::DispatchToStub+000000A3

          RPCRT4!OSF_SCALL::DispatchHelper+00000149

          RPCRT4!OSF_SCALL::DispatchRPCCall+0000010D

          RPCRT4!OSF_SCALL::ProcessReceivedPDU+0000057F

          RPCRT4!OSF_SCALL::BeginRpcCall+00000194

          RPCRT4!OSF_SCONNECTION::ProcessReceiveComplete+00000435

          RPCRT4!ProcessConnectionServerReceivedEvent+00000021

          RPCRT4!LOADABLE_TRANSPORT::ProcessIOEvents+000001B8

          RPCRT4!ProcessIOEventsWrapper+0000000D

          RPCRT4!BaseCachedThreadRoutine+0000009D

          RPCRT4!ThreadStartRoutine+0000001B

          kernel32!BaseThreadStart+00000034

     

    From both call stacks above we can see that the originating function that leads to the heap allocation is “NTDSA!NspiBind

     

    NSPI is the interface which the Exchange Address Book provider uses to talk to the Global Catalogs. Typically when Outlook clients want to talk to the directory for address book, they talk to the Exchange server directly and then the Exchange server will look up the appropriate GC, forward the request to it and finally send the result back to the originating clients.  This method is called the DSproxy method.  Outlook clients can also send referral requests (RFR) to the Exchange server to get a referral to a GC and then start talking to this GC directly.  You can find more information about this operation “Here

     

     

    With this new information in hand, we turned our attention to the Exchange server for a while. We did so by obtaining a netmon trace from the Exchange server, and started examining it to reveal a massive number of RFR requests originating from outlook client machines toward the Exchange server:

    3795  11:50:59.1426170  10.3876170        GC_SERVER   EXCHSERVER NSPI  NSPI:NspiBind Response, ReturnValue=0x0     {MSRPC:535, TCP:488, IPv4:118}

    3796  11:50:59.1426170  10.3876170        EXCHSERVER 10.208.164.176    NSPI      NSPI:NspiBind Response, ReturnValue=0x0   {MSRPC:502, TCP:496, IPv4:326}

    3797  11:50:59.1426170  10.3876170        EXCHSERVER 10.208.164.176    NSPI      NSPI:NspiBind Response, ReturnValue=0x0   {MSRPC:509, TCP:508, IPv4:326}

    3798  11:50:59.1426170  10.3876170        EXCHSERVER 10.208.164.176    NSPI      NSPI:NspiBind Response, ReturnValue=0x0   {MSRPC:494, TCP:484, IPv4:326}

    9807  11:51:13.1894920  24.4344920        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1144, TCP:1142, IPv4:326}

    10084 11:51:13.3769920  24.6219920        EXCHSERVER GC_SERVER   NSPI  NSPI:NspiBind Request, Flags=0x0      {MSRPC:1145, TCP:1143, IPv4:118}

    10085 11:51:13.3769920  24.6219920        GC_SERVER   EXCHSERVER NSPI  NSPI:NspiBind Response, ReturnValue=0x0     {MSRPC:1145, TCP:1143, IPv4:118}

    10086 11:51:13.3926170  24.6376170        EXCHSERVER 10.208.164.176    NSPI      NSPI:NspiBind Response, ReturnValue=0x0   {MSRPC:1144, TCP:1142, IPv4:326}

    10103 11:51:13.3926170  24.6376170        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1182, TCP:1180, IPv4:326}

    10108 11:51:13.3926170  24.6376170        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1151, TCP:1149, IPv4:326}

    10111 11:51:13.3926170  24.6376170        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1200, TCP:1196, IPv4:326}

    10115 11:51:13.3926170  24.6376170        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1164, TCP:1163, IPv4:326}

    10148 11:51:13.4863670  24.7313670        10.34.51.69 EXCHSERVER NSPI  NSPI:NspiBind Request, Flags=0x0      {MSRPC:447, TCP:446, IPv4:367}

    10149 11:51:13.4863670  24.7313670        EXCHSERVER GC_SERVER   NSPI  NSPI:NspiBind Request, Flags=0x0      {MSRPC:473, TCP:448, IPv4:69}

    10172 11:51:13.4863670  24.7313670        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1199, TCP:1197, IPv4:326}

    10174 11:51:13.4863670  24.7313670        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1187, TCP:1186, IPv4:326}

    10180 11:51:13.4863670  24.7313670        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1170, TCP:1168, IPv4:326}

    10183 11:51:13.4863670  24.7313670        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1153, TCP:1152, IPv4:326}

    10185 11:51:13.4863670  24.7313670        GC_SERVER   EXCHSERVER NSPI  NSPI:NspiBind Response, ReturnValue=0x0     {MSRPC:473, TCP:448, IPv4:69}

    10186 11:51:13.4863670  24.7313670        EXCHSERVER 10.34.51.69 NSPI  NSPI:NspiBind Response, ReturnValue=0x0     {MSRPC:447, TCP:446, IPv4:367}

    10224 11:51:13.5176170  24.7626170        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1209, TCP:1208, IPv4:326}

    10227 11:51:13.5176170  24.7626170        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1192, TCP:1190, IPv4:326}

    10232 11:51:13.5176170  24.7626170        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1159, TCP:1157, IPv4:326}

    10235 11:51:13.5176170  24.7626170        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1175, TCP:1173, IPv4:326}

    10240 11:51:13.5176170  24.7626170        10.34.51.69 EXCHSERVER NSPI  NSPI:NspiBind Request, Flags=0x0      {MSRPC:447, TCP:446, IPv4:367}

    10241 11:51:13.5176170  24.7626170        EXCHSERVER GC_SERVER   NSPI  NSPI:NspiBind Request, Flags=0x0      {MSRPC:473, TCP:448, IPv4:69}

    10251 11:51:13.5332420  24.7782420        GC_SERVER   EXCHSERVER NSPI  NSPI:NspiBind Response, ReturnValue=0x0     {MSRPC:473, TCP:448, IPv4:69}

    10255 11:51:13.5332420  24.7782420        EXCHSERVER 10.34.51.69 NSPI  NSPI:NspiBind Response, ReturnValue=0x0     {MSRPC:447, TCP:446, IPv4:367}

    Snipp…..

     

    In the netmon capture we saw an excessive number of NspiBind requests been sent from the Outlook client machines to the Exchange server and Exchange server in turn sending these NspiBind requests to the GC, in the netmon capture we didn’t find any frames for NspiUnBind requests, which should be used to destroy that connection.

     

    When we checked the Exchange server security logs we found a massive number of logon events that are occurring for every single client:

    Event Type: Success Audit

    Event Source:     Security

    Event Category:   (2)

    Event ID:   540

    Date:       5/26/2010

    Time:       3:49:09 PM

    User:      

    Computer:   EXCHSERVER

    Description:

    Successful Network Logon:

          User Name:  User_Name

          Domain:           CONTOSO.COM

          Logon ID:         (0x0,0x3DA1F42)

          Logon Type: 3

          Logon Process:    Kerberos

          Authentication Package: Kerberos

          Workstation Name:

          Logon GUID: {b03f9ad2-3150-e1be-f50a-b3abb9d2c09d}

          Caller User Name: -

          Caller Domain:    -

          Caller Logon ID:  -

          Caller Process ID: -

          Transited Services: -

          Source Network Address: 10.11.12.13

          Source Port:      0

     

    For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

     

    So, now we know the problem of lsass.exe memory consumption on the DCs is actually occurring because of a massive number of NspiBind requests that are being sent to the DCs without proper NspiUnbind requests.

     

    So we turned our attention to the client machines that have Outlook installed and started to list all 3rd party add-ins that are installed on them.

     

    We found that all affected client machines are actually sharing 3rd party add-ins for email content archival. Further investigation on this software revealed a strange MAPI Wrapper DLL library, and our customer confirmed that the start time of the problem approximately matches the time they have rolled out the content archival software on the client machines.

     

    We uninstalled this content archival add-in from a couple of machines and these machines stopped hammering the Exchange Server and GC with NspiBind requests.

     

    Later we uninstalled it from the rest of the client machines and the lsass.exe process on the DCs has finally stabilized and we no longer have any issues with its memory usage.

     

    I’d like to mention that starting with Windows Server 2008 we have implemented a “NSPI Max Sessions Per User” where you can limit the number of NSPI requests from a client to the server.

     

    You can read more information about this new functionality “Here

     

    That’s it! I hope you enjoyed reading and hopefully this should help you to troubleshoot any memory leaks you may encounter.  The information contained here can be applied to other processes when you suspect heap leaks.

     

    See you again on another adventure.

     

    Best Regards,

    Karim Elsaid

  • Ntdebugging Blog

    Using Xperf to investigate slow I/O issues

    • 2 Comments

    Debuggers,

    Here’s another short video of the type of triage possible with Xperf, this time Bob shows us how to track down a slow I/O issue. Since I/O Manager is instrumented at IoCallDriver and IoCompleteRequest we can often use this information to diagnose a Slow I/O or SAN issue. The steps to enable the tracing are in the video description and here again below.

    Thanks,

    Tate

                                                                                                                                                                                                                                                                                            

    Video Description: (http://www.youtube.com/watch?v=0h-i_EfzadA)

    Here's how you can use the Windows Performance Toolkit aka Xperf to investigate slow I/O issues on Windows 2008, Vista, and higher versions. Put the following into a batch file (modifying the -f to point the etl file generation elsewhere than the slow storage) and it will start a circular trace which can be stopped when the slow I/O occurs.

    @echo off

    xperf -on PROC_THREAD+LOADER+CSWITCH+FILENAME+FILE_IO+FILE_IO_INIT+DRIVERS -f kernel.etl -stackwalk CSwitch+DiskReadInit+DiskWriteInit+DiskFlushInit+FileCreate+FileCleanup+FileClose+FileRead+FileWrite -BufferSize 1024 -MaxBuffers 1024 -MaxFile 1024 -FileMode Circular

    echo Reproduce the problem now, hit any key to finish, collect Merged Etl file

    pause

    echo merging etl...

    xperf -d XperfSlowIOcir.etl

    @echo on

     

    Share this post :
Page 2 of 24 (239 items) 12345»