• Ntdebugging Blog

    How to have a colorful relationship with your dump files

    • 3 Comments

    Hello NTDebuggers…  I look at a lot of dump files every day.  This being the case I like to take full advantage of the customizable look and feel of windbg.   I actually have an association setup between DMP files and a CMD file that loads my customized COLOR workspace each time I double click on a dump file.  I like a black background with colorized source and debug command window output.   

    Here is a snapshot of my typical debug session.

    clip_image001[12]

    Here is how you set it up.

    1) Create the following CMD file and put it in your path.  It’s called D.CMD on my system..

    echo off
    Title kd.exe -z %1
    C:
    CD\debuggers
    start C:\Debuggers\windbg.exe -z %1 -W color

    2) Load windbg and under view \ options configure your color options.  Note: you will want to set every option in the list box.

    clip_image002 clip_image003

    3) Save the workspace by selecting FILE \ Save Workspace, I saved mine as COLOR

    4) If you don’t have a dump file create one by opening a process in the debugger and doing a .dump /ma C:\test\test.dmp (I’ll assume you have a directory called C:\test)

    5) Select the C:\test\test.dmp file in explorer and right click on the properties. (note I used badwindow.dmp)

    clip_image005

    6) Change the dump file association via drilling down to Change \ Browse \ (Select D.CMD)

    clip_image006

    7) That’s all there is to it.  Now you can simply double click on any DMP file and windbg will start and load with your COLOR workspace.

    Good luck and happy debugging.

    Jeff Dailey-

  • Ntdebugging Blog

    Where the rubber meets the road, or in this case the hardware meets the probe.

    • 1 Comments

    Hi my name is Bob, I’m an Escalation engineer with the Microsoft critical problem resolution team.  We had one of our readers ask how much we deal with hardware.  Well in response we recently worked on an interesting problem I thought I would share with you.  In this case it was interesting because it demonstrated an issue specific to multi-processor machines and something that probably sounded innocuous to the driver writer who caused the  problem.

     

    What was the problem?

     

    The problem was that everything worked except the time in the system was not being updated. The RTC had stopped.   We spent time reviewing the semantics of how the RTC should function and even hooked up an oscilloscope to the RTC on the motherboard and were able to turn it off and on with the debugger by writing out the correct port.  The trace on the scope validated our understanding of what had to be written to the port to turn the clock off.    One we had a clear understanding of this we understood what we were looking for in a driver that might cause the problem.  Note the clock typically fires every 10ms so you do not need a fast scope to do this.

     

     

    Special keyboard driver written

     

    In order to catch a dump in state we had to modify the keyboard driver.  It would cause an “Int 3” in its ISR instead of calling “bug check” for an E2 stop.    Because the RTC was not running the idle thread was not getting quantums and as a result a normal break in would not work.  However the system would respond to ISRs. 

     

    What was found

     

    All RTC interrupts were stopped - the clock was not running.  We checked all the obvious places to see if the RTC was disabled.

     

    We looked at the ICR in the I/O APIC.  This is the interrupt configuration register. There is a register for every interrupt pin on the APIC. These registers are used to tell the APIC what vector to send the processor so the processor can service the interrupt.  It also has configuration information about level and if it edge triggered and a mask bit.  The mask bit was not set.

    Below is a reenactment.

     

    0: kd> ed ffd01000

    ffd01000 00000034 20 ß Select register 20 which is pin 8.

     

     

    0: kd> ed ffd01010

    ffd01010 000008d1    ß Contents ß Vect D1 Bit 16 the interrupt mask bit is not set so it is OK.

     

     

    Next check the RTC status register which are I/O ports 70 and 71.  Port 70 is the address port.  Port 71 is the data port.  This information is from an old BIOS book.

     

    0: kd> ob 70 b       ß ‘B’ is a control register.   

    0: kd> ib 71

    00000071: 42         ß The value 42 means that the RTC is enabled.  Bit 6 is the enable.

     

     

    So what was it?

     

    The way the RTC works is it will interrupt at a certain interval.  When the interrupt is serviced, the status register has to be read to start it again.

     

    We discovered another driver that was reading  the clock, this was done by disassembly various drivers in the dump and looking for the I/O operation to ports 70 or 71.  The lower addresses selected by port 70 will yield the time when read.  That is what the driver was doing.

     

    You would think that simply reading the time in this way would not hurt anything.  However, in a multi-processor system, access has to be serialized.  There is only one set of I/O ports for the system.

     

    Since it takes two accesses to perform an operation on the clock, one address & one data, a collision between two processors can cause undetermined results.

     

    Below is a timing diagram of the issue;

     

     

    Proc 0 running OS RTC handler                         Proc 1 running XYZ driver

     

      T1          Set register select to status register

     

      T2                                                                                          Set register select to read time

     

      T3          Read status register to restart clock

     

     

    So at T3 the OS RTC handler reads the wrong register so the clock does not start. 

     

     

    Conclusion

     

    I thought this was an interesting problem that illustrates the need for serialization.  And it demonstrates what to look out for in a multi-proc environment.  You always have to think “What happens if the other processor does…….”

     

    For more information consult any BIOS developer manuals you may have lying around or this link we found http://www.geocities.com/SiliconValley/Campus/1671/docs/rtc.htm

    See the “Status Register C” section, “All future interrupts are disabled until this register is read - your interrupt handler *must* do it.”

  • Ntdebugging Blog

    WMI Nugget: How to Gather the Provider Binary from a WMI Class Name

    • 0 Comments

    It's Venkatesh with a WMI nugget. While troubleshooting or debugging WMI issues you may come across WMI queries wherein you don’t know which provider implemented the WMI class used in the query. You may want to know the binary and the product that implemented the provider so you can contact the vendor or upgrade the binary to see if it resolves the issue you are investigating.

    To help make the investigation easier, Windows Vista and all post operating systems have the cool new 'Analytic and Debug log' for WMI tracing. To get similar tracing in Windows 2003 and earlier versions, WMI has a “Verbose Logging” option enabled through the WMI control (WmiMgmt.msc). Check out this blog for detailed steps to enable the WMI Analytic and Debug -

    http://blogs.msdn.com/wmi/archive/2009/05/27/is-wmiprvse-a-real-villain.aspx

    After enabling WMI logging you will see events like the ones pictured below. There may be a particular WMI query relevant to the issue you're troubleshooting requiring you to know the provider binary for the WMI class in the query. Apart from the class name (InstProvSamp) in the query you need the WMI namespace to which this class belongs to such as '\\.\root\default' as shown below.

    image

    Fig 1: Event ID 1 in the Windows Vista’s Event Viewer\Windows Logs\Applications and Services Logs\Microsoft\Windows\WMI-Activity\Trace log

    Now we have the WMI Namespace and the class name so let’s find the provider for it.

    1. Get the provider name from the Class Name and Namespace:

    Every WMI class has a qualifier “provider” containing the name of the provider to which it belongs. Using wbemtest.exe connect to the namespace we mentioned above(\\.\root\default). Then get the class information for our class, InstProvSamp using the “Open Class…” button. After clicking "OK" the Object editor for InstrProvSamp displays the class information for InstProvSamp including the name of the provider it belongs to (CIM_STRING qualifier “provider” as shown below).

    image

    image

    2. Get the class identifier (CLSID) of the provider from the provider name:

    For every WMI provider WMI maintains an instance of system class __Win32Provider which contains information about the physical implementation of the provider like CLSID which contains the GUID that represents the class identifier (CLSID) of the provider COM object. Now query the repository to get the provider information for “InstProvSamp” provider as I did below.

    image

    image 

    image

    3. Get the provider binary from the provider’s CLSID:

    During registration every COM object adds an entry under HKLM\SOFTWARE\Classes\CLSID. Using regedit.exe you can browse to provider’s COM registration information using the CLSID obtained from Step 2 as shown below.

    image

    This should help you connect the dots while debugging your WMI issue. Let me know if you have additional questions!

    Share this post :
  • Ntdebugging Blog

    How to identify a driver that calls a Windows API leading to a pool leak on behalf of NT Kernel?

    • 1 Comments

    Hello my name is Gurpreet Singh Jutla and I would like to share information on how we can trace the caller which ends up allocating “Se  “ Pool tag. When we use the Windows debugger and investigate the pool allocation and the binary associated with this pool tag, we see NT Kernel responsible for the allocations. But is the NT Kernel really responsible for a pool leak associated with this pool tag?

     

    Issue at hand

    • On windows 2003 x86 we see that the paged pool has depleted and we are running into event id 333.
    • We can see the same behavior on later versions of the OS and even on x64
    • We see that the leaking pool tag is “Se  “ which is the pool tag for security objects. Is Microsoft component at fault or something is calling an API and using security objects on a large scale?
    • This is windows 2003 x86 and we have limited options to root cause the issue. We need to really find out why we end up having so many allocations for this tag.
    • We could enable Pool Tracking on the NT Kernel, but would that help?

     

    Step 1

    !vm 1  -> This tells us if there were any page pool allocation failures

     

    0: kd> !vm 1

     

    *** Virtual Memory Usage ***

          Physical Memory:     2096922 (   8387688 Kb)

          Page File: \??\D:\pagefile.sys

            Current:  16779264 Kb  Free Space:  16552492 Kb

            Minimum:  16779264 Kb  Maximum:     16779264 Kb

          Available Pages:     1607242 (   6428968 Kb)

          ResAvail Pages:      1991659 (   7966636 Kb)

          Locked IO Pages:         656 (      2624 Kb)

          Free System PTEs:     163671 (    654684 Kb)

          Free NP PTEs:          32766 (    131064 Kb)

          Free Special NP:           0 (         0 Kb)

          Modified Pages:        10775 (     43100 Kb)

          Modified PF Pages:     10728 (     42912 Kb)

          NonPagedPool Usage:     7881 (     31524 Kb)

          NonPagedPool Max:      65279 (    261116 Kb)

          PagedPool 0 Usage:     67074 (    268296 Kb)

          PagedPool 1 Usage:      3266 (     13064 Kb)

          PagedPool 2 Usage:      3282 (     13128 Kb)

          PagedPool 3 Usage:      3268 (     13072 Kb)

          PagedPool 4 Usage:      3214 (     12856 Kb)

          PagedPool Usage:       80104 (    320416 Kb)

          PagedPool Maximum:    134144 (    536576 Kb)

          Session Commit:        14832 (     59328 Kb)

          Shared Commit:         19969 (     79876 Kb)

          Special Pool:              0 (         0 Kb)

          Shared Process:        19362 (     77448 Kb)

          Pages For MDLs:          146 (       584 Kb)

          PagedPool Commit:      80140 (    320560 Kb)

          Driver Commit:          1602 (      6408 Kb)

          Committed pages:      520485 (   2081940 Kb)

          Commit limit:        6241313 (  24965252 Kb)

     

    !poolused /t10 4  -> This will list the top consumers of paged pool. It is “Se  “ in our case.

     

    0: kd> !poolused /t10 4

    ....

    Sorting by Paged Pool Consumed

     

                   NonPaged                  Paged

    Tag     Allocs         Used     Allocs         Used

     

    Se           0            0     172204    232720312  General security allocations , Binary: nt!se

    MmSt         0            0      15231     31835696  Mm section object prototype ptes , Binary: nt!mm

    Ntff         9         1872      10434      8514144  FCB_DATA , Binary: ntfs.sys

    WD         384      1251376         27      5591040  UNKNOWN pooltag 'WD  ', please update pooltag.txt

    UlHT         0            0          1      4198400  Hash Table , Binary: http.sys

    NtfF         0            0       3259      3050424  FCB_INDEX , Binary: ntfs.sys

    Toke         0            0        966      3034216  Token objects , Binary: nt!se

    NtFs     13717       551176      19678      1758008  StrucSup.c , Binary: ntfs.sys

    IoNm         0            0      12034      1737360  Io parsing names , Binary: nt!io

    FSim         0            0      11336      1451008  File System Run Time Mcb Initial Mapping

    CM16         0            0        293      1437696  Internal Configuration manager allocations

    Wmit         6        11392         23      1376912  Wmi Trace

    NtFU         0            0       8719      1237232  usnsup.c , Binary: ntfs.sys

    Obtb         0            0        397       995600  object tables via EX handle.c , Binary: nt!ob

    Key          0            0       7245       753432  Key objects

    MmSm         0            0      11393       729152  segments used to map data files , Binary: nt!mm

     

    TOTAL    192495     33751032     347461    310172416

     

    Step 2

    Once we know the pool tag we need to run the following command. Remember the pool tag has to be a 4 char case sensitive string so I would run the following command which tells me in which all modules we have the Pool tag “Se  “ used. Please note that I have added two spaces in the string during my search. This is because the !poolused shows Se tag has caused the issue. Not providing the spaces in the search string will give different results. The following command searches each module that is loaded on the system.

     

    !for_each_module s -a @#Base @#End "Se  "

     

    The result should be something like:

     

    0: kd> !for_each_module s -a @#Base @#End "Se  "

    8096a1ae  53 65 20 20 6a 0c 6a 01-e8 bd a9 f2 ff 3b c7 89  Se  j.j......;..

    8096a5e3  53 65 20 20 6a 0c 6a 01-e8 88 a5 f2 ff 3b c7 89  Se  j.j......;..

    8096af9e  53 65 20 20 50 6a 01 89-45 fc e8 cb 9b f2 ff 8b  Se  Pj..E.......

    8096c909  53 65 20 20 8d 1c 9d 10-00 00 00 53 6a 01 e8 5c  Se  .......Sj..\

    8096c9e3  53 65 20 20 53 6a 01 e8-89 81 f2 ff 8b f8 85 ff  Se  Sj..........

    8096ca3e  53 65 20 20 8d 04 85 10-00 00 00 50 6a 01 e8 27  Se  .......Pj..'

    8096caf2  53 65 20 20 8d 1c 9d 0c-00 00 00 53 6a 01 e8 73  Se  .......Sj..s

    8096cb3e  53 65 20 20 8d 1c 9d 0c-00 00 00 53 6a 01 e8 27  Se  .......Sj..'

    8096cb9d  53 65 20 20 56 6a 01 e8-cf 7f f2 ff 8b f0 3b f7  Se  Vj........;.

    8096cc44  53 65 20 20 6a 10 6a 01-e8 27 7f f2 ff 85 c0 0f  Se  j.j..'......

    8096cc65  53 65 20 20 6a 04 6a 01-e8 06 7f f2 ff 85 c0 0f  Se  j.j.........

    8096cc96  53 65 20 20 6a 04 6a 01-e8 d5 7e f2 ff 85 c0 0f  Se  j.j...~.....

    8096ccbe  53 65 20 20 6a 38 6a 01-e8 ad 7e f2 ff 8b f8 85  Se  j8j...~.....

    809718ac  53 65 20 20 74 42 8b 46-68 3b c3 74 0f 53 50 e8  Se  tB.Fh;.t.SP.

    80971b48  53 65 20 20 74 42 8b 46-68 3b c3 74 0f 53 50 e8  Se  tB.Fh;.t.SP.

    8097372b  53 65 20 20 6a 0c 6a 01-e8 40 14 f2 ff 85 c0 75  Se  j.j..@.....u

    80976c38  53 65 20 20 6a 0c 6a 01-e8 33 df f1 ff 3b c6 89  Se  j.j..3...;..

    80a20d1b  53 65 20 20 bf 00 01 00-00 57 56 e8 4d 3e e7 ff  Se  .....WV.M>..

    80a22698  53 65 20 20 c7 45 80 4f-62 43 6c c7 45 84 43 63  Se  .E.ObCl.E.Cc

     

    The above steps are also explained in detail at http://blogs.msdn.com/b/ntdebugging/archive/2012/08/31/troubleshooting-pool-leaks-part-3-debugging.aspx

     

    Step 3

    Now the tough part begins. We need to run “ln” command on each of the addresses shown by the output above. Example

     

    0: kd> ln 8096c909 

    (8096c8cc)   nt!SeQueryInformationToken+0x3d   |  (8096cdc0)   nt!SeCaptureObjectTypeList

     

    See the highlighted above. It is the nearest API to that address. Search on MSDN for all API’s listed. The ones that do not have an MSDN article can likely be ignored. SEQueryInformationToken is publically available and hence callable by any loaded driver. Example http://msdn.microsoft.com/en-in/library/windows/hardware/ff556690(v=vs.85).aspx

     

    At this point we will make the assumption that it is a 3rd party driver making the calls to this Windows API.  It is, of course, possible that a Microsoft module is doing this, but if such a problem exists in our code it usually becomes apparent very quickly when we get flooded by support calls.

     

    Step 4

    Method 1

    We can use the “!for_each_module” debugger command to search the address space of each loaded module for the entry point of the function found in step 3.

     

    0: kd> !for_each_module s-d @#Base @#End 8096c8cc

    b94b5938  8096c8cc 8096000a 80960182 80894b78  ............xK..

    ba7d50f4  8096c8cc 8081e5e0 8096b42e 8096b40c  ................

    ba83624c  8096c8cc 8081757e 8082b4f2 8094d438  ....~u......8...

    bab8103c  8096c8cc 80960122 80960158 8096010a  ...."...X.......

    f74e2950  8096c8cc 8089e180 8096091a 80960638  ............8...

    f74f0b68  8096c8cc 8089e180 80959546 80884150  ........F...PA..

    f75cf094  8096c8cc 80960122 80960158 8096010a  ...."...X.......

    f797c110  8096c8cc 808847f0 80966c60 80894b78  .....G..`l..xK..

    f7b643c4  8096c8cc 80960054 809e437c 808eadc6  ....T...|C......

     

    The first column in the above output can then be passed to the “ln” command to get the nearest function to the address listed.

     

    0: kd> ln b94b5938

    *** ERROR: Module load completed but symbols could not be loaded for ABCMiniFilter.sys

    0: kd> ln f797c110

    *** ERROR: Module load completed but symbols could not be loaded for XVhdBusxxx.sys

     

    Method 2

    This is a bit more complicated. Run the lm command to dump all modules. You will get the output as follows

     

    0: kd> lm

    start    end        module name

    80800000 80a5d000   nt        

    80a5d000 80a89000   hal       

    b476c000 b4775000   asyncmac   (deferred)            

    b7f55000 b7fb6000   ABCD   (deferred)            

    b817b000 b81da000   eeCtrl     (deferred)            

    b82b6000 b82b9300   xyz   (deferred)            

    b87a2000 b87cc000   Fastfat   (deferred)

    b8f04000 b8f0de80   ABCMiniFilter   (deferred)                       

    <snip>

     

    Ignore the unloaded modules and the modules you know are Microsoft binaries. For the rest of them you need to run the following command on each module. Remember the first column is the start of the module and the second column is the end of module.

     

    Use the !dh command for each header one by one and you will see the header information for the module.  I just picked one module for reference you will need to run the command for each third party module.

     

    0: kd> !dh b8f04000 -f

     

    File Type: EXECUTABLE IMAGE

    FILE HEADER VALUES

         14C machine (i386)

           7 number of sections

    51F7D252 time date stamp Tue Jul 30 20:18:50 2013

     

           0 file pointer to symbol table

           0 number of symbols

          E0 size of optional header

         102 characteristics

               Executable

               32 bit word machine

     

    OPTIONAL HEADER VALUES

         10B magic #

        9.00 linker version

        8800 size of code

        1200 size of initialized data

           0 size of uninitialized data

        86B2 address of entry point

         480 base of code

             ----- new -----

    00010000 image base

          80 section alignment

          80 file alignment

           1 subsystem (Native)

        6.01 operating system version

        6.01 image version

        5.01 subsystem version

        9E80 size of image

         480 size of headers

       11E27 checksum

    00040000 size of stack reserve

    00001000 size of stack commit

    00100000 size of heap reserve

    00001000 size of heap commit

           0  DLL characteristics

           0 [       0] address [size] of Export Directory

        8A28 [      50] address [size] of Import Directory

        9180 [     498] address [size] of Resource Directory

           0 [       0] address [size] of Exception Directory

        9E80 [    1FD0] address [size] of Security Directory

        9680 [     6C0] address [size] of Base Relocation Directory

        7990 [      1C] address [size] of Debug Directory

           0 [       0] address [size] of Description Directory

           0 [       0] address [size] of Special Directory

           0 [       0] address [size] of Thread Storage Directory

        7A68 [      40] address [size] of Load Configuration Directory

           0 [       0] address [size] of Bound Import Directory

        7880 [     108] address [size] of Import Address Table Directory

           0 [       0] address [size] of Delay Import Directory

           0 [       0] address [size] of COR20 Header Directory

           0 [       0] address [size] of Reserved Directory

     

    The highlighted line is important for us. Remember this is the offset of the start of Import Address Table directory and the end of the same. In our case 7880 offset is start and 108 offset is the end. We run the following command to dump the import table. Running the dps command from the start offset in the module till the end of the import table will list all the functions it imports

     

    0: kd> dps b8f04000+7880 b8f04000+7880+108

    b8f0b880  f7873892 fltMgr!FltGetStreamHandleContext

    b8f0b884  f7873988 fltMgr!FltAllocateContext

    b8f0b888  f787459c fltMgr!FltSetStreamHandleContext

    <snip>

    b8f0b938  8096c8cc nt!SeQueryInformationToken

    b8f0b93c  8096000a nt!RtlValidSid

    <snip>

     

    You see the module we searched in will have the API in its import table. Running the above command on each module, dumping the import tables will help us identify which modules have the API in its import table. However it cannot be confirmed if the driver calls the API and causes the leak. This is really tedious and time consuming but will help us identify each binary that calls the above API which leads to pool tag “Se  “ allocation on behalf of the NT Kernel.

     

    Conclusion

    We can disable the binaries that call the SEQueryInformationToken API one by one and see if the issue still persists. Please note that the boot/System drivers should not be disabled as it can lead to a no boot situation. You can use Autoruns from Sysinternals to disable these drivers and if you run into a no boot situation, boot back into the “last known good configuration” to help recover the box.

  • Ntdebugging Blog

    Breaking down the "Cl" in !irp

    • 2 Comments

    Hey there NTDEBUGGERS my name is Randy Monteleone and today we are going to talk about IRPs. In the past we have talked about the IRP structure in passing and showed a field here and there that can be pulled out and used to find answers to stalled IO. I was recently working on a debugger extension and found something interesting in the IRP I was looking at. I had been looking at a !irp output much like the one pictured below. I found that I was asking myself what do the "Success Error Cancel" fields mean?

     

    After doing some digging and working with a few of my co-workers we found the mystery to the meaning behind these words and why we see them in our output. Lets break this IRP down starting with the ">" marker that indicates the current stack frame. In the output below you see this marker is indicating that we are working on something in partmgr.

     

    1: kd> !irp fffffa809a1f3af0

    Irp is active with 9 stacks 5 is current (= 0xfffffa809a1f3ce0)

    Mdl=fffffa814e9c4f40: No System Buffer: Thread fffffa80d05b67e0:  Irp stack trace. 

         cmd  flg cl Device   File     Completion-Context

    [  0, 0]   0  0 00000000 00000000 00000000-00000000   

       

                   Args: 00000000 00000000 00000000 00000000

    [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                   Args: 00000000 00000000 00000000 00000000

    [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                   Args: 00000000 00000000 00000000 00000000

    [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                   Args: 00000000 00000000 00000000 00000000

    >[  4,34]  1c e0 fffffa80920c0060 00000000 fffff880010301b0-00000000 Success Error Cancel <-- What are you trying to tell me?

           \Driver\Disk           partmgr!PmReadWriteCompletion

                  Args: 00001000 00000000 14625000 00000000

    [  4, 0]  1c e0 fffffa80920c0b90 00000000 fffff88001063010-fffffa80b01eec00 Success Error Cancel

           \Driver\partmgr        volmgr!VmpReadWriteCompletionRoutine

                  Args: 3993d69568d 00000000 14625000 00000000

    [  4, 0]   c e0 fffffa80b01eeab0 00000000 fffff88001d59150-fffffa80d05b2180 Success Error Cancel

           \Driver\volmgr         volsnap!VspRefCountCompletionRoutine

                  Args: 00001000 00000000 3993d69568a 00000000

    [  4, 0]   c e1 fffffa80d05b2030 00000000 fffff88001845344-fffff8800d52bb38 Success Error Cancel pending

           \Driver\volsnap        Ntfs!NtfsMasterIrpSyncCompletionRoutine

                   Args: 00001000 00000000 14615000 00000000

    [  4, 0]   0  0 fffffa80c2829030 fffffa80d04f18a0 00000000-00000000   

           \FileSystem\Ntfs

                   Args: 00001000 00000000 00002000 00000000

       

    In the example below we see the “Success Error Cancel” fields and in this case we can also see that the pending flag has been set. The pending field indicates that STATUS_PENDING was returned to the caller. This is used so that I/O completion can determine whether or not to fully complete the I/O operation requested by the packet. Drivers can do this by calling IoMarkIrpPending.

     

    Now look at the "cl" column as it holds the key to unlocking what "Success Error Cancel pending" really means.

     

    1: kd> !irp 0xfffffa80`920c2340

    Irp is active with 3 stacks 3 is current (= 0xfffffa80920c24a0)   <--- _IO_STACK_LOCATION

    Mdl=fffffa814e9c4f40: No System Buffer: Thread 00000000: Irp stack trace. 

         cmd  flg cl Device   File     Completion-Context

    [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                   Args: 00000000 00000000 00000000 00000000

    [  f, 0]  1c  0 fffffa809209a060 00000000 fffff880010061a0-fffffa80920bfcc0   

           \Driver\mpio        mpio!MPIOPdoCompletion

                   Args: fffffa80920c22b0 00000000 00000000 fffffa80920bfcc0

    >[  f, 0] 1c  e1 fffffa809209a060 00000000 fffff88001a01a00-fffffa80920c2190 Success Error Cancel pending

           \Driver\mpio        CLASSPNP!TransferPktComplete

    Args: fffffa80920c22b0 00000000 00000000 fffffa80920bfcc0

     

    Focusing on the cl column we see that our active stack frame is working in MPIO, but what is it trying to tell us about its status and intent? Well to figure that out we need to break down the value "e1" we see listed in our cl or Control column. This field is being used as a flag where the high and low parts represent two different values. We get this value in !irp from an _IO_STACK_LOCATION Control member. This can be found by dumping the stack location displayed by !irp.

     

    1: kd> dt_IO_STACK_LOCATION 0xfffffa80920c24a0

    nt!_IO_STACK_LOCATION

      +0x000 MajorFunction    : 0xf ''

      +0x001 MinorFunction    : 0 ''

      +0x002 Flags            : 0x1c ''

      +0x003 Control          : 0xe1 ''  < -- Control flag

      +0x008 Parameters       : <unnamed-tag>

      +0x028 DeviceObject     : 0xfffffa80`9209a060 _DEVICE_OBJECT

      +0x030 FileObject       : (null)

      +0x038 CompletionRoutine : 0xfffff880`01a01a00     long  CLASSPNP!TransferPktComplete+0

      +0x040 Context          : 0xfffffa80`920c2190 Void

     

    So now that we know where we get this value we still have to decode what it means and how this value results in what we see in our !IRP output. Let's split our number into two, our high and low parts. In our example above we had the value e1. Taking our value apart gives us an E and a 1. The first bit of our value will indicates the last returned value as our IRP was processed. In our case this is a 1, meaning that our IRP is Pending. Thus we see the word pending at the end of current stack frame in !irp. There are three possible values for this bit, 0, 1 or 2. 0 being nothing, 1 being pending and 2 being Error Returned.

     

    Moving to our high bits we are left with the "e". Lets take this number and convert it to binary. We end up with 1110.  This upper number indicates which invoke types were requested for the completion routine for the driver listed on that stack frame. If we look at MSDN we see that IoSetCompletionRoutine takes in three bool values to set this flag. These options specify whether the completion routine is called if the IRP is completed with that corresponding status.

     

    VOID IoSetCompletionRoutine(

      _In_      PIRP Irp,

      _In_opt_  PIO_COMPLETION_ROUTINE CompletionRoutine,

      _In_opt_  PVOID Context,

      _In_      BOOLEAN InvokeOnSuccess,

      _In_      BOOLEAN InvokeOnError,

      _In_      BOOLEAN InvokeOnCancel

    );

     

    Doing some source review I was able to pin together how these values translate to the !irp output. If we look back at our binary value of "e" we see that we have a set of bits that get set based on the what the driver wanted to do when an IRP was completed with one of our defined status values.

     

    Cancel   = 2

    Success  = 4

    Error    = 8

     

    Add each of these values up and the sum is 14 or e in hex. Going back to our binary "1110" we see that our first bit is set to zero or off. This would be the pending and Error Returned values that I mentioned above. The next three bits represent the corresponding Cancel, Success and Error bool values passed to the driver at the time IoSetCompletionRoutine was called for this stack location.

     

    Example : IoSetCompletionRoutine(pirp, pCompletionRoutine,pContex, True,False,True); would yield a value of 6.

     

    Remember the important thing here is not that the !irp output is trying to tell us that one of these things happened. It's telling us that this driver would like to be notified if one of those things does happen. This area also provides us with information if pending or or an error is returned.

     

    Well that's all I have for now, for more documentation for the _IRP and _IO_STACK_LOCATION structures please see the following links to MSDN.

     

    More on IRP: http://msdn.microsoft.com/en-us/library/windows/hardware/ff550694(v=vs.85).aspx

    More on _IO_STACK_LOCATION: http://msdn.microsoft.com/en-us/library/windows/hardware/ff550659(v=vs.85).aspx

  • Ntdebugging Blog

    ResAvail Pages and Working Sets

    • 0 Comments

    Hello everyone, I'm Ray and I'm here to talk a bit about a dump I recently looked at and a little-referenced memory counter called ResAvail Pages (resident available pages).

     

    The problem statement was:  The server hangs after a while.

     

    Not terribly informative, but that's where we start with many cases. First some good housekeeping:

     

    0: kd> vertarget

    Windows 7 Kernel Version 7601 (Service Pack 1) MP (2 procs) Free x64

    Product: Server, suite: TerminalServer SingleUserTS

    Built by: 7601.18113.amd64fre.win7sp1_gdr.130318-1533

    Machine Name: "ASDFASDF1234"

    Kernel base = 0xfffff800`01665000 PsLoadedModuleList = 0xfffff800`018a8670

    Debug session time: Thu Aug  8 09:39:26.992 2013 (UTC - 4:00)

    System Uptime: 9 days 1:08:39.307

     

    Of course Windows 7 Server == Server 2008 R2.

     

    One of the basic things I check at the beginning of these hang dumps with vague problem statements is the memory information.

     

    0: kd> !vm 21

     

    *** Virtual Memory Usage ***

    Physical Memory:     2097038 (   8388152 Kb)

    Page File: \??\C:\pagefile.sys

      Current:  12582912 Kb  Free Space:  12539700 Kb

      Minimum:  12582912 Kb  Maximum:     12582912 Kb

    Available Pages:      286693 (   1146772 Kb)

    ResAvail Pages:          135 (       540 Kb)

     

    ********** Running out of physical memory **********

     

    Locked IO Pages:           0 (         0 Kb)

    Free System PTEs:   33526408 ( 134105632 Kb)

     

    ******* 12 system cache map requests have failed ******

     

    Modified Pages:         4017 (     16068 Kb)

    Modified PF Pages:      4017 (     16068 Kb)

    NonPagedPool Usage:   113241 (    452964 Kb)

    NonPagedPool Max:    1561592 (   6246368 Kb)

    PagedPool 0 Usage:     35325 (    141300 Kb)

    PagedPool 1 Usage:     28162 (    112648 Kb)

    PagedPool 2 Usage:     24351 (     97404 Kb)

    PagedPool 3 Usage:     24350 (     97400 Kb)

    PagedPool 4 Usage:     24516 (     98064 Kb)

    PagedPool Usage:      136704 (    546816 Kb)

    PagedPool Maximum:  33554432 ( 134217728 Kb)

     

    ********** 222 pool allocations have failed **********

     

    Session Commit:         6013 (     24052 Kb)

    Shared Commit:          6150 (     24600 Kb)

    Special Pool:              0 (         0 Kb)

    Shared Process:      1214088 (   4856352 Kb)

    Pages For MDLs:           67 (       268 Kb)

    PagedPool Commit:     136768 (    547072 Kb)

    Driver Commit:         15548 (     62192 Kb)

    Committed pages:     1648790 (   6595160 Kb)

    Commit limit:        5242301 (  20969204 Kb)

     

    So we're failing to allocate pool, but we aren't out of virtual memory for paged pool or nonpaged pool.  Let's look at the breakdown:

     

    0: kd> dd nt!MmPoolFailures l?9

    fffff800`01892160  000001be 00000000 00000000 00000002

    fffff800`01892170  00000000 00000000 00000000 00000000

    fffff800`01892180  00000000

     

    Where:

        yellow   = Nonpaged high/medium/low priority failures

        green    = Paged high/medium/low priority failures

        cyan      = Session paged high/medium/low priority failures

     

    So we actually failed both nonpaged AND paged pool allocations in this case.  Why?  We're "Running out of physical memory", obviously.  So where does this running out of physical memory message come from?  In the above example this is from the ResAvail Pages counter.

     

    ResAvail Pages is the amount of physical memory there would be if every working set was at its minimum size and only what needs to be resident in RAM was present (e.g. PFN database, system PTEs, driver images, kernel thread stacks, nonpaged pool, etc).

     

    Where did this memory go then?  We have plenty of Available Pages (Free + Zero + Standby) for use.  So something is claiming memory it isn't actually using.  In this type of situation one of the things I immediately suspect is process working set minimums. Working set basically means the physical memory used by a process.

     

    So let's check.

     

    0: kd> !process 0 1

     

    <a lot of processes in this output>.

     

    PROCESS fffffa8008f76060

        SessionId: 0  Cid: 0adc    Peb: 7fffffda000  ParentCid: 0678

        DirBase: 204ac9000  ObjectTable: 00000000  HandleCount:   0.

        Image: cscript.exe

        VadRoot 0000000000000000 Vads 0 Clone 0 Private 1. Modified 3. Locked 0.

        DeviceMap fffff8a000008a70

        Token                             fffff8a0046f9c50

        ElapsedTime                       9 Days 01:08:00.134

        UserTime                          00:00:00.000

        KernelTime                        00:00:00.015

        QuotaPoolUsage[PagedPool]         0

        QuotaPoolUsage[NonPagedPool]      0

        Working Set Sizes (now,min,max)  (5, 50, 345) (20KB, 200KB, 1380KB)

        PeakWorkingSetSize                1454

        VirtualSize                       65 Mb

        PeakVirtualSize                   84 Mb

        PageFaultCount                    1628

        MemoryPriority                    BACKGROUND

        BasePriority                      8

        CommitCharge                      0

     

    I have only shown one example process above for brevity's sake, but there were thousands returned.  241,423 to be precise.  None had abnormally high process working set minimums, but cumulatively their usage adds up.

     

    The “now” process working set is lower than the minimum working set.  How is that possible?  Well, the minimum and maximum are not hard limits, but suggested limits.  For example, the minimum working set is honored unless there is memory pressure, in which case it can be trimmed below this value.  There is a way to set the min and/or max as hard limits on specific processes by using the QUOTA_LIMITS_HARDWS_MIN_ENABLE flag via SetProcessWorkingSetSize.

     

    You can view if the minimum and maximum working set values are configured in the _EPROCESS->Vm->Flags structure.  Note these numbers are from another system as this structure was already torn down for the processes we were looking at.

     

    0: kd> dt _EPROCESS fffffa8008f76060 Vm

    nt!_EPROCESS

       +0x398 Vm : _MMSUPPORT

    0: kd> dt _MMSUPPORT fffffa8008f76060+0x398

    nt!_MMSUPPORT

       +0x000 WorkingSetMutex  : _EX_PUSH_LOCK

       +0x008 ExitGate         : 0xfffff880`00961000 _KGATE

       +0x010 AccessLog        : (null)

       +0x018 WorkingSetExpansionLinks : _LIST_ENTRY [ 0x00000000`00000000 - 0xfffffa80`08f3c410 ]

       +0x028 AgeDistribution  : [7] 0

       +0x044 MinimumWorkingSetSize : 0x32

       +0x048 WorkingSetSize   : 5

       +0x04c WorkingSetPrivateSize : 5

       +0x050 MaximumWorkingSetSize : 0x159

       +0x054 ChargedWslePages : 0

       +0x058 ActualWslePages  : 0

       +0x05c WorkingSetSizeOverhead : 0

       +0x060 PeakWorkingSetSize : 0x5ae

       +0x064 HardFaultCount   : 0x41

       +0x068 VmWorkingSetList : 0xfffff700`01080000 _MMWSL

       +0x070 NextPageColor    : 0x2dac

       +0x072 LastTrimStamp    : 0

       +0x074 PageFaultCount   : 0x65c

       +0x078 RepurposeCount   : 0x1e1

       +0x07c Spare            : [2] 0

       +0x084 Flags            : _MMSUPPORT_FLAGS

    0: kd> dt _MMSUPPORT_FLAGS fffffa8008f76060+0x398+0x84

    nt!_MMSUPPORT_FLAGS

       +0x000 WorkingSetType   : 0y000

       +0x000 ModwriterAttached : 0y0

       +0x000 TrimHard         : 0y0

       +0x000 MaximumWorkingSetHard : 0y0

       +0x000 ForceTrim        : 0y0

       +0x000 MinimumWorkingSetHard : 0y0

       +0x001 SessionMaster    : 0y0

       +0x001 TrimmerState     : 0y00

       +0x001 Reserved         : 0y0

       +0x001 PageStealers     : 0y0000

       +0x002 MemoryPriority   : 0y00000000 (0)

       +0x003 WsleDeleted      : 0y1

       +0x003 VmExiting        : 0y1

       +0x003 ExpansionFailed  : 0y0

       +0x003 Available        : 0y00000 (0)

     

    How about some more detail?

     

    0: kd> !process fffffa8008f76060

    PROCESS fffffa8008f76060

        SessionId: 0  Cid: 0adc    Peb: 7fffffda000  ParentCid: 0678

        DirBase: 204ac9000  ObjectTable: 00000000  HandleCount:  0.

        Image: cscript.exe

        VadRoot 0000000000000000 Vads 0 Clone 0 Private 1. Modified 3. Locked 0.

        DeviceMap fffff8a000008a70

        Token                             fffff8a0046f9c50

        ElapsedTime                       9 Days 01:08:00.134

        UserTime                          00:00:00.000

        KernelTime                        00:00:00.015

        QuotaPoolUsage[PagedPool]         0

        QuotaPoolUsage[NonPagedPool]      0

        Working Set Sizes (now,min,max)  (5, 50, 345) (20KB, 200KB, 1380KB)

        PeakWorkingSetSize                1454

        VirtualSize                       65 Mb

        PeakVirtualSize                   84 Mb

        PageFaultCount                    1628

        MemoryPriority                    BACKGROUND

        BasePriority                      8

        CommitCharge                      0

     

    No active threads

     

    0: kd> !object fffffa8008f76060

    Object: fffffa8008f76060  Type: (fffffa8006cccc90) Process

        ObjectHeader: fffffa8008f76030 (new version)

        HandleCount: 0  PointerCount: 1

     

    The highlighted information shows us that this process has no active threads left but the process object itself (and its 20KB working set use) were still hanging around because a kernel driver had a reference to the object that it never released.  Sampling other entries shows the server had been leaking process objects since it was booted.

     

    Unfortunately trying to directly track down pointer leaks on process objects is difficult and requires an instrumented kernel, so we tried to check the easy stuff first before going that route.  We know it has to be a kernel driver doing this (since it is a pointer and not a handle leak) so we looked at the list of 3rd party drivers installed.  Note: The driver names have been redacted.

     

    0: kd> lm

    start             end                 module name

    <snip>

    fffff880`04112000 fffff880`04121e00   driver1    (no symbols)   <-- no symbols usually means 3rd party       

    fffff880`04158000 fffff880`041a4c00   driver2    (no symbols)          

    <snip>

     

    0: kd> lmvm driver1   

    Browse full module list

    start             end                 module name

    fffff880`04112000 fffff880`04121e00   driver1    (no symbols)          

        Loaded symbol image file: driver1.sys

        Image path: \SystemRoot\system32\DRIVERS\driver1.sys

        Image name: driver1.sys

        Browse all global symbols  functions  data

        Timestamp:        Wed Dec 13 12:09:32 2006 (458033CC)

        CheckSum:         0001669E

        ImageSize:        0000FE00

        Translations:     0000.04b0 0000.04e4 0409.04b0 0409.04e4

    0: kd> lmvm driver2

    Browse full module list

    start             end                 module name

    fffff880`04158000 fffff880`041a4c00   driver2    (no symbols)          

        Loaded symbol image file: driver2.sys

        Image path: \??\C:\Windows\system32\drivers\driver2.sys

        Image name: driver2.sys

        Browse all global symbols  functions  data

        Timestamp:        Thu Nov 30 12:12:07 2006 (456F10E7)

        CheckSum:         0004FE8E

        ImageSize:        0004CC00

        Translations:     0000.04b0 0000.04e4 0409.04b0 0409.04e4

     

    Fortunately for both the customer and us we turned up a pair of drivers that predated Windows Vista (meaning they were designed for XP/2003) that raised an eyebrow.  Of course we need a more solid evidence link than just "it's an old driver", so I did a quick search of our internal KB.  This turned up several other customers who had these same drivers installed, experienced the same problem, then removed them and the problem went away.  That sounds like a pretty good evidence link. We implemented the same plan for this customer successfully.

  • Ntdebugging Blog

    Debugging Without a Debugger

    • 3 Comments

    Hi, my name is Todd Webb, I am an Escalation Engineer from the Global Escalation Services OEM Team.

     

    I recently worked a case where we were trying to troubleshoot an issue and the system would hang with USB debugging enabled. This was a problem since USB debugging was the only kernel debugging option available on this particular hardware setup. I needed a quick and easy way to find the source of the problem without using many of the standard methods.  Basically I had to debug a debugger issue with no access to a debugger.  I knew the problem was likely USB related, but had no other information.  The method I used to troubleshoot this issue was to write some macros that would make a call to HalMakeBeep to generate audible indications of where we were in the code.  The primary reason for using this method was I could quickly and easily move the location in the code where I made the calls as I narrowed down where the problem occurred. 

     

    The alternative would have been to add a bunch of logging and find some way to force a dump.  My initial assessment was this would be more time consuming and less flexible then the beep code method.  Others may have other methods to do this, but this just gives people another method of debugging driver start-up problems on systems that are not debuggable.

     

    The main beep macro allows you to pass in a frequency and duration in milliseconds for a single beep.  Then based on the example below you can extend the single beep macro to create a series of beeps.

     

    Here is an example of the macros I used:

     

    //

    // beep macros for troubleshooting without debugger

    //

     

    #define EHCI_BEEP1(f, d)                                              \

        {                                                                 \

            LARGE_INTEGER duration;                                       \

            duration.QuadPart = Int32x32To64(d, -10000);                  \

            HalMakeBeep(f);                                               \

            KeDelayExecutionThread(KernelMode, FALSE, &duration);         \

            HalMakeBeep(0);                                               \

            duration.QuadPart = Int32x32To64(500, -10000);                \

            KeDelayExecutionThread(KernelMode, FALSE, &duration);         \

        }

     

    #define EHCI_BEEP2(f, d)                                              \

        {                                                                 \

            EHCI_BEEP1(f, d);                                             \

            EHCI_BEEP1(f, d);                                             \

        }

         

    #define EHCI_BEEP3(f, d)                                              \

        {                                                                 \

            EHCI_BEEP2(f, d);                                             \

            EHCI_BEEP1(f, d);                                             \

        }

     

    Here is an example of a long beep followed by three short beeps:

     

        //

        // debug - 1 long : 3 short

        //

        EHCI_BEEP1(600, 1000);

        EHCI_BEEP3(600, 500);

     

    Share this post :
  • Ntdebugging Blog

    Wanted: Windows Internals subject matter experts

    • 3 Comments

    Microsoft is looking for five Windows Internals subject matter experts to come work on a very special five to eight day project on the Redmond campus during the month of May 2008.  Candidates must have good communications skills, be non Microsoft employees,  have 5+ years experience with windows, be familiar with the Windows Internals book, have kernel and user mode debugging experience, and be C literate.  

     

    We prefer candidates strongly represent the Windows IT Professional or Windows development industry.    These typically would be 2nd or 3rd tier escalation resources or Windows developers (developing Win32 code, or device drivers) that are working for Gold Partners, ISVs or are independent consultants or MVPs.

     

    A Microsoft NDA (Non Disclosure Agreement) will be required.

     

    Please contact Jeff Dailey at jeffda@microsoft.com with the subject line “Windows Internals Project”

    Please include a copy of your resume and details explaining your Windows background, along with your contact info, and your location.

     

    We will cover air travel, lodging and food for non local candidates attending from within the US.

     

    Thank you.

     

    More Info:

    The Widows Internals experts selected for this project will be brought to Redmond to aid in developing a new certification program based on the Windows Internals.  This is not a hiring effort or product testing session.  Though the candidates should have C literacy they will not be writing or reviewing any Microsoft source code. 

     

  • Ntdebugging Blog

    Windows Troubleshooting – Special Pool

    • 0 Comments

    The Windows Support team has a new YouTube channel, “Windows Troubleshooting”.  The first set of videos cover debugging blue screens.

    In this video, Bob Golding, Senior Escalation Engineer, describes how the Special Pool Windows diagnostics tool catches drivers that corrupt memory. Bob also introduces how memory is organized in the system for allocating memory for drivers.

  • Ntdebugging Blog

    Debug Fundamentals Exercise 4: The GAMEOVER exercise

    • 8 Comments

     

    We are back with another addition to the debugging fundamentals series here in 2009! Regardless of your debugging experience we hope you enjoy this one. Don’t worry too much if you can’t answer them all but give them your best shot.

     

    The problem:

     

    While debugging an application, a developer observes a crash and is puzzled. The developer saves a memory dump of the process using the command “.dump /ma dbgex4.dmp” and sends it to you for debug assistance.

     

    0:000> r

    eax=53454750 ebx=00000000 ecx=00000100 edx=000001b0 esi=001bfe20 edi=00ac36bc

    eip=47414d45 esp=001bfe00 ebp=4f564552 iopl=0         nv up ei pl zr na pe nc

    cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00010246

    47414d45 ??              ???

     

    The attached zip, dbgex4.zip, includes three files. The program itself, the program’s symbols file and the saved memory dump file. (NOTE: You do not need to execute the program for the exercise however you may want to in order to solve the bonus’. You will need to rename the file dbgex4.ex_ to dbgex4.exe in order to run it).

     

    Using the accompanying files, answer the following questions:

     

    Part One – Debugging

     

    1.       What caused the access violation?

    2.       Examine the registers at the time of the crash.  Is there anything interesting about the contents?

    3.       How did the registers get into this state?

    4.       Find the offending data structure which caused this state to occur. What are its contents?

    5.       Are there security concerns with this access violation? Why?

    6.        Why is this class of crashes not seen in the wild much anymore?

    Part Two – Reverse Engineering

     

    Examine the functions main(), snap(), crack(), pop(), and boom().

    1.       Describe with a sentence or two what each one is doing.  (There is no need to comment on every assembly instruction or re-write the code in C here unless you really feel you need to.)

    Bonus:

    1.       If this access violation occurred while the program was running without a debugger present, would there be anything different about the crashing register state or the data structure which caused it? If so what is it and why? (HINT: If you would like to observe the crashing state like this, you can register windbg.exe to be the Just-In-Time debugger using the –I switch like so “windbg.exe -I”. When you run the dbgex4.exe program, it will crash and windbg.exe will launch automatically and attach to it.)

    2.       The functions from “Part Two – Reverse Engineering” perform operations on a set of data. During this manipulation some data is lost. Find this data and decode its “secret” message.

     

    Share this post :
Page 21 of 24 (239 items) «1920212223»