• Ntdebugging Blog

    Missing System Writer Case Explained

    • 3 Comments

    I worked on a case the other day where all I had was a procmon log and event logs to troubleshoot a problem where the System Writer did not appear in the VSSADMIN LIST WRITERS output. This might be review for the folks that know this component pretty well but I figured I would share how I did it for those that are not so familiar with the System Writer.

     

    WHAT WE KNOW:

    1. System State Backups fail
    2. Running a VSS List Writers does not list the system writer

     

    Looking at the event logs I found the error shown below. This error indicates there was a failure while “Writer Exposing its Metadata Context”. Each writer is responsible for providing a list of files, volumes, and other resources it is designed to protect. This list is called metadata and is formatted as XML. In the example we are working with the error is “Unexpected error calling routine XML document is too long”.  While helpful, this message alone does not provide us with a clear reason why the XML document is too long.

     

    Event Type: Error

    Event Source: VSS

    Event ID: 8193

    Description: Volume Shadow Copy Service error: Unexpected error calling routine XML document is too long. hr = 0x80070018, The program issued a command but the command length is incorrect. . Operation: Writer Exposing its Metadata Context: Execution Context: Requestor Writer Instance ID: {636923A0-89C2-4823-ADEF-023A739B2515} Writer Class Id: {E8132975-6F93-4464-A53E-1050253AE220} Writer Name: System Writer

     

    The second event that was logged was also not very helpful as it only indicates that the writer did have a failure. It looks like we are going to need to collect more data to figure this out.

     

    Event Type: Error

    Event Source: VSS

    Event ID: 8193

    Description: Volume Shadow Copy Service error: Unexpected error calling routine CreateVssExamineWriterMetadata. hr = 0x80042302, A Volume Shadow Copy Service component encountered an unexpected error. Check the Application event log for more information. . Operation: Writer Exposing its Metadata Context: Execution Context: Requestor Writer Instance ID: {636923A0-89C2-4823-ADEF-023A739B2515} Writer Class Id: {E8132975-6F93-4464-A53E-1050253AE220} Writer Name: System Writer

     

    From the error above we learned that there was an issue with the metadata file for the System Writer. These errors are among some of the most common issues seen with this writer. There are some not so well documented limitations within the writer due to some hard set limits on path depth and the number of files in a given path. These limitations are frequently exposed by the C:\Windows\Microsoft.Net\ path. Often, this path is used by development software like Visual Studio as well as server applications like IIS. Below I have listed a few known issues that should help provide some scope when troubleshooting System Writer issues.

     

    Known limitations and common points of failure:

    • More than 1,000 folders in a folder causes writer to fail during OnIdentify
    • More than 10,000 files in a folder causes writer to fail during OnIdentify (frequently C:\Windows\Microsoft.Net)
    • Permissions issues (frequently in C:\Windows\WinSXS and C:\Windows\Microsoft.Net)
    • Permissions issues with COM+ Event System Service
      • This service needs to be running and needs to have Network Service with Service User Rights

     

    What data can I capture to help me find where the issue is?

     

    The best place to start is with a Process Monitor (Procmon) capture. To prepare for this capture you will need to download Process Monitor, open the Services MMC snap-in, as well as open an administrative command prompt which will be used in a later step of the process.

     

    You should first stop the Cryptographic Services service using the Services MMC.

     

     

    Once stopped you will want to open Procmon, note that by default Procmon will start capturing when opened. Now that you have Procmon open and capturing data you will start the cryptographic service. This will allow you to capture any errors during service initialization. Once the service is started you will use the command prompt opened earlier to run “vssadmin.exe list writers”. This will signal the writers on the system to capture their metadata, which is a common place we see failures with the System Writer. When the vssadmin command completes, stop the Procmon capture and save this data to disk.

     

    Now that we have data how do we find the errors?

     

    Open your newly created Procmon file. First, add a new filter for the path field that contains “VSS\Diag”.

     

     

    We do this because this is the registry path that all writers will log to when entering and leaving major functions. Now that we have our filtered view we need to look for an entry from the System Writer. You can see the highlighted line below shows the “IDENTIFY” entry for the System Writer. From here we can ascertain the PID of the svchost.exe that the system writer is running in. We now want to include only this svchost. To accomplish this you can right click on the PID for the svchost.exe and select “Include ‘(PID)’”. 

     

     

    Now that we have found our System Writers svchost we will want to remove the filter for “VSS\Diag”; to do that you can return to the filter tool in Procmon and uncheck the box next to the entry.

     

     

    We now have a complete view of what this service was doing at the time it started and completed the OnIdentify. Our next step is to locate the IDENTIFY (Leave) entry as this is often a great marker for where your next clue will be. While in most cases we can’t directly see the error the writer hit we can make some educated connections based on the common issues we spoke about above. If we take a look at the events that took place just before the IDENTIFY (Leave) we can see that we were working in the C:\Windows\Microsoft.NET\assembly\ directory. This is one of the paths that the System Writer is responsible for protecting. As mentioned above, there are some known limitations to the depth of paths and number of files in the “C:\Windows\Microsoft.NET” folder. This is a great example of that limitation as seen in our procmon capture. The example below shows the IDENTIFY (Leave) with the line before that being where the last work was taking place. Meaning this is what the writer was touching when it failed.

     

     

    What does this tell us and what should we do next?

     

    Given the known path limitations, we need to check out the number of files and folders in the C:\Windows\Microsoft.Net\ path and see where the bloat is. Some of these files can be safely removed, however only files located in the Temp locations (Temporary ASP.NET Files) are safe to delete.

     

    Recently we released KB2807849 which addresses the issue shown above.

     

    There are other possible causes of the event log errors mentioned above, such as issues with file permissions. For those problems follow the same steps as above and you are likely to see the IDENTIFY (Leave) just after file access error is displayed in your procmon log. For these failures you will need to investigate the permissions on the file we failed on. Likely the file is missing permissions for the writer’s service account Network Service or Local System. All that is needed here is to add back the missing permissions for your failed file.

     

    While these issues can halt your nightly backups, it is often fairly easy to find the root cause. It just takes time and a little bit of experience with Process Monitor. 

     

    Good luck and successful backups!

  • Ntdebugging Blog

    Understanding Pool Corruption Part 2 – Special Pool for Buffer Overruns

    • 1 Comments

    In our previous article we discussed pool corruption that occurs when a driver writes too much data in a buffer.  In this article we will discuss how special pool can help identify the driver that writes too much data.

     

    Pool is typically organized to allow multiple drivers to store data in the same page of memory, as shown in Figure 1.  By allowing multiple drivers to share the same page, pool provides for an efficient use of the available kernel memory space.  However this sharing requires that each driver be careful in how it uses pool, any bugs where the driver uses pool improperly may corrupt the pool of other drivers and cause a crash.

     

    Figure 1 – Uncorrupted Pool

     

    With pool organized as shown in Figure 1, if DriverA allocates 100 bytes but writes 120 bytes it will overwrite the pool header and data stored by DriverB.  In Part 1 we demonstrated this type of buffer overflow using NotMyFault, but we were not able to identify which code had corrupted the pool.

     

    Figure 2 – Corrupted Pool

     

    To catch the driver that corrupted pool we can use special pool.  Special pool changes the organization of the pool so that each driver’s allocation is in a separate page of memory.  This helps prevent drivers from accidentally writing to another driver’s memory.  Special pool also configures the driver’s allocation at the end of the page and sets the next virtual page as a guard page by marking it as invalid.  The guard page causes an attempt to write past the end of the allocation to result in an immediate bugcheck.

     

    Special pool also fills the unused portion of the page with a repeating pattern, referred to as “slop bytes”.  These slop bytes will be checked when the page is freed, if any errors are found in the pattern a bugcheck will be generated to indicate that the memory was corrupted.  This type of corruption is not a buffer overflow, it may be an underflow or some other form of corruption.

     

    Figure 3 – Special Pool

     

    Because special pool stores each pool allocation in its own 4KB page, it causes an increase in memory usage.  When special pool is enabled the memory manager will configure a limit of how much special pool may be allocated on the system, when this limit is reached the normal pools will be used instead.  This limitation may be especially pronounced on 32-bit systems which have less kernel space than 64-bit systems.

     

    Now that we have explained how special pool works, we should use it.

     

    There are two methods to enable special pool.  Driver verifier allows special pool to be enabled on specific drivers.  The PoolTag registry value described in KB188831 allows special pool to be enabled for a particular pool tag.  Starting in Windows Vista and Windows Server 2008, driver verifier captures additional information for special pool allocations so this is typically the recommended method.

     

    To enable special pool using driver verifier use the following command line, or choose the option from the verifier GUI.  Use the /driver flag to specify drivers you want to verify, this is the place to list drivers you suspect as the cause of the problem.  You may want to verify drivers you have written and want to test or drivers you have recently updated on the system.  In the command line below I am only verifying myfault.sys.  A reboot is required to enable special pool.

     

    verifier /flags 1 /driver myfault.sys

     

    After enabling verifier and rebooting the system, repeat the activity that causes the crash.  For some problems the activity may just be to wait for a period of time.  For our demonstration we are running NotMyFault (see Part 1 for details).

     

    The crash resulting from a buffer overflow in special pool will be a stop 0xD6, DRIVER_PAGE_FAULT_BEYOND_END_OF_ALLOCATION.

     

    kd> !analyze -v

    *******************************************************************************

    *                                                                             *

    *                        Bugcheck Analysis                                    *

    *                                                                             *

    *******************************************************************************

     

    DRIVER_PAGE_FAULT_BEYOND_END_OF_ALLOCATION (d6)

    N bytes of memory was allocated and more than N bytes are being referenced.

    This cannot be protected by try-except.

    When possible, the guilty driver's name (Unicode string) is printed on

    the bugcheck screen and saved in KiBugCheckDriver.

    Arguments:

    Arg1: fffff9800b5ff000, memory referenced

    Arg2: 0000000000000001, value 0 = read operation, 1 = write operation

    Arg3: fffff88004f834eb, if non-zero, the address which referenced memory.

    Arg4: 0000000000000000, (reserved)

     

    We can debug this crash and determine that notmyfault.sys wrote beyond its pool buffer.

     

    The call stack shows that myfault.sys accessed invalid memory and this generated a page fault.

     

    kd> k

    Child-SP          RetAddr           Call Site

    fffff880`04822658 fffff803`721333f1 nt!KeBugCheckEx

    fffff880`04822660 fffff803`720acacb nt! ?? ::FNODOBFM::`string'+0x33c2b

    fffff880`04822700 fffff803`7206feee nt!MmAccessFault+0x55b

    fffff880`04822840 fffff880`04f834eb nt!KiPageFault+0x16e

    fffff880`048229d0 fffff880`04f83727 myfault+0x14eb

    fffff880`04822b20 fffff803`72658a4a myfault+0x1727

    fffff880`04822b80 fffff803`724476c7 nt!IovCallDriver+0xba

    fffff880`04822bd0 fffff803`7245c8a6 nt!IopXxxControlFile+0x7e5

    fffff880`04822d60 fffff803`72071453 nt!NtDeviceIoControlFile+0x56

    fffff880`04822dd0 000007fc`4fe22c5a nt!KiSystemServiceCopyEnd+0x13

    00000000`004debb8 00000000`00000000 0x000007fc`4fe22c5a

     

    The !pool command shows that the address being referenced by myfault.sys is special pool.

     

    kd> !pool fffff9800b5ff000

    Pool page fffff9800b5ff000 region is Special pool

    fffff9800b5ff000: Unable to get contents of special pool block

     

    The page table entry shows that the address is not valid.  This is the guard page used by special pool to catch overruns.

     

    kd> !pte fffff9800b5ff000

                                               VA fffff9800b5ff000

    PXE at FFFFF6FB7DBEDF98    PPE at FFFFF6FB7DBF3000    PDE at FFFFF6FB7E6002D0    PTE at FFFFF6FCC005AFF8

    contains 0000000001B8F863  contains 000000000138E863  contains 000000001A6A1863  contains 0000000000000000

    pfn 1b8f      ---DA--KWEV  pfn 138e      ---DA--KWEV  pfn 1a6a1     ---DA--KWEV  not valid

     

    The allocation prior to this memory is an 800 byte block of non paged pool tagged as “Wrap”.  “Wrap” is the tag used by verifier when pool is allocated without a tag, it is the equivalent to the “None” tag we saw in Part 1.

     

    kd> !pool fffff9800b5ff000-1000

    Pool page fffff9800b5fe000 region is Special pool

    *fffff9800b5fe000 size:  800 data: fffff9800b5fe800 (NonPaged) *Wrap

                Owning component : Unknown (update pooltag.txt)

     

    Special pool is an effective mechanism to track down buffer overflow pool corruption.  It can also be used to catch other types of pool corruption which we will discuss in future articles.

Page 1 of 1 (2 items)