Hello NTDebuggers… I look at a lot of dump files every day. This being the case I like to take full advantage of the customizable look and feel of windbg. I actually have an association setup between DMP files and a CMD file that loads my customized COLOR workspace each time I double click on a dump file. I like a black background with colorized source and debug command window output.
Here is a snapshot of my typical debug session.
Here is how you set it up.
1) Create the following CMD file and put it in your path. It’s called D.CMD on my system..
echo off Title kd.exe -z %1 C: CD\debuggers start C:\Debuggers\windbg.exe -z %1 -W color
2) Load windbg and under view \ options configure your color options. Note: you will want to set every option in the list box.
3) Save the workspace by selecting FILE \ Save Workspace, I saved mine as COLOR
4) If you don’t have a dump file create one by opening a process in the debugger and doing a .dump /ma C:\test\test.dmp (I’ll assume you have a directory called C:\test)
5) Select the C:\test\test.dmp file in explorer and right click on the properties. (note I used badwindow.dmp)
6) Change the dump file association via drilling down to Change \ Browse \ (Select D.CMD)
7) That’s all there is to it. Now you can simply double click on any DMP file and windbg will start and load with your COLOR workspace.
Good luck and happy debugging.
Hi my name is Bob, I’m an Escalation engineer with the Microsoft critical problem resolution team. We had one of our readers ask how much we deal with hardware. Well in response we recently worked on an interesting problem I thought I would share with you. In this case it was interesting because it demonstrated an issue specific to multi-processor machines and something that probably sounded innocuous to the driver writer who caused the problem.
What was the problem?
The problem was that everything worked except the time in the system was not being updated. The RTC had stopped. We spent time reviewing the semantics of how the RTC should function and even hooked up an oscilloscope to the RTC on the motherboard and were able to turn it off and on with the debugger by writing out the correct port. The trace on the scope validated our understanding of what had to be written to the port to turn the clock off. One we had a clear understanding of this we understood what we were looking for in a driver that might cause the problem. Note the clock typically fires every 10ms so you do not need a fast scope to do this.
Special keyboard driver written
In order to catch a dump in state we had to modify the keyboard driver. It would cause an “Int 3” in its ISR instead of calling “bug check” for an E2 stop. Because the RTC was not running the idle thread was not getting quantums and as a result a normal break in would not work. However the system would respond to ISRs.
What was found
All RTC interrupts were stopped - the clock was not running. We checked all the obvious places to see if the RTC was disabled.
We looked at the ICR in the I/O APIC. This is the interrupt configuration register. There is a register for every interrupt pin on the APIC. These registers are used to tell the APIC what vector to send the processor so the processor can service the interrupt. It also has configuration information about level and if it edge triggered and a mask bit. The mask bit was not set.
Below is a reenactment.
0: kd> ed ffd01000
ffd01000 00000034 20 ß Select register 20 which is pin 8.
0: kd> ed ffd01010
ffd01010 000008d1 ß Contents ß Vect D1 Bit 16 the interrupt mask bit is not set so it is OK.
Next check the RTC status register which are I/O ports 70 and 71. Port 70 is the address port. Port 71 is the data port. This information is from an old BIOS book.
0: kd> ob 70 b ß ‘B’ is a control register.
0: kd> ib 71
00000071: 42 ß The value 42 means that the RTC is enabled. Bit 6 is the enable.
So what was it?
The way the RTC works is it will interrupt at a certain interval. When the interrupt is serviced, the status register has to be read to start it again.
We discovered another driver that was reading the clock, this was done by disassembly various drivers in the dump and looking for the I/O operation to ports 70 or 71. The lower addresses selected by port 70 will yield the time when read. That is what the driver was doing.
You would think that simply reading the time in this way would not hurt anything. However, in a multi-processor system, access has to be serialized. There is only one set of I/O ports for the system.
Since it takes two accesses to perform an operation on the clock, one address & one data, a collision between two processors can cause undetermined results.
Below is a timing diagram of the issue;
Proc 0 running OS RTC handler Proc 1 running XYZ driver
T1 Set register select to status register
T2 Set register select to read time
T3 Read status register to restart clock
So at T3 the OS RTC handler reads the wrong register so the clock does not start.
I thought this was an interesting problem that illustrates the need for serialization. And it demonstrates what to look out for in a multi-proc environment. You always have to think “What happens if the other processor does…….”
For more information consult any BIOS developer manuals you may have lying around or this link we found http://www.geocities.com/SiliconValley/Campus/1671/docs/rtc.htm
See the “Status Register C” section, “All future interrupts are disabled until this register is read - your interrupt handler *must* do it.”
It's Venkatesh with a WMI nugget. While troubleshooting or debugging WMI issues you may come across WMI queries wherein you don’t know which provider implemented the WMI class used in the query. You may want to know the binary and the product that implemented the provider so you can contact the vendor or upgrade the binary to see if it resolves the issue you are investigating.
To help make the investigation easier, Windows Vista and all post operating systems have the cool new 'Analytic and Debug log' for WMI tracing. To get similar tracing in Windows 2003 and earlier versions, WMI has a “Verbose Logging” option enabled through the WMI control (WmiMgmt.msc). Check out this blog for detailed steps to enable the WMI Analytic and Debug -
After enabling WMI logging you will see events like the ones pictured below. There may be a particular WMI query relevant to the issue you're troubleshooting requiring you to know the provider binary for the WMI class in the query. Apart from the class name (InstProvSamp) in the query you need the WMI namespace to which this class belongs to such as '\\.\root\default' as shown below.
Fig 1: Event ID 1 in the Windows Vista’s Event Viewer\Windows Logs\Applications and Services Logs\Microsoft\Windows\WMI-Activity\Trace log
Now we have the WMI Namespace and the class name so let’s find the provider for it.
1. Get the provider name from the Class Name and Namespace:
Every WMI class has a qualifier “provider” containing the name of the provider to which it belongs. Using wbemtest.exe connect to the namespace we mentioned above(\\.\root\default). Then get the class information for our class, InstProvSamp using the “Open Class…” button. After clicking "OK" the Object editor for InstrProvSamp displays the class information for InstProvSamp including the name of the provider it belongs to (CIM_STRING qualifier “provider” as shown below).
2. Get the class identifier (CLSID) of the provider from the provider name:
For every WMI provider WMI maintains an instance of system class __Win32Provider which contains information about the physical implementation of the provider like CLSID which contains the GUID that represents the class identifier (CLSID) of the provider COM object. Now query the repository to get the provider information for “InstProvSamp” provider as I did below.
3. Get the provider binary from the provider’s CLSID:
During registration every COM object adds an entry under HKLM\SOFTWARE\Classes\CLSID. Using regedit.exe you can browse to provider’s COM registration information using the CLSID obtained from Step 2 as shown below.
This should help you connect the dots while debugging your WMI issue. Let me know if you have additional questions!
Hello my name is Gurpreet Singh Jutla and I would like to share information on how we can trace the caller which ends up allocating “Se “ Pool tag. When we use the Windows debugger and investigate the pool allocation and the binary associated with this pool tag, we see NT Kernel responsible for the allocations. But is the NT Kernel really responsible for a pool leak associated with this pool tag?
Issue at hand
!vm 1 -> This tells us if there were any page pool allocation failures
0: kd> !vm 1
*** Virtual Memory Usage ***
Physical Memory: 2096922 ( 8387688 Kb)
Page File: \??\D:\pagefile.sys
Current: 16779264 Kb Free Space: 16552492 Kb
Minimum: 16779264 Kb Maximum: 16779264 Kb
Available Pages: 1607242 ( 6428968 Kb)
ResAvail Pages: 1991659 ( 7966636 Kb)
Locked IO Pages: 656 ( 2624 Kb)
Free System PTEs: 163671 ( 654684 Kb)
Free NP PTEs: 32766 ( 131064 Kb)
Free Special NP: 0 ( 0 Kb)
Modified Pages: 10775 ( 43100 Kb)
Modified PF Pages: 10728 ( 42912 Kb)
NonPagedPool Usage: 7881 ( 31524 Kb)
NonPagedPool Max: 65279 ( 261116 Kb)
PagedPool 0 Usage: 67074 ( 268296 Kb)
PagedPool 1 Usage: 3266 ( 13064 Kb)
PagedPool 2 Usage: 3282 ( 13128 Kb)
PagedPool 3 Usage: 3268 ( 13072 Kb)
PagedPool 4 Usage: 3214 ( 12856 Kb)
PagedPool Usage: 80104 ( 320416 Kb)
PagedPool Maximum: 134144 ( 536576 Kb)
Session Commit: 14832 ( 59328 Kb)
Shared Commit: 19969 ( 79876 Kb)
Special Pool: 0 ( 0 Kb)
Shared Process: 19362 ( 77448 Kb)
Pages For MDLs: 146 ( 584 Kb)
PagedPool Commit: 80140 ( 320560 Kb)
Driver Commit: 1602 ( 6408 Kb)
Committed pages: 520485 ( 2081940 Kb)
Commit limit: 6241313 ( 24965252 Kb)
!poolused /t10 4 -> This will list the top consumers of paged pool. It is “Se “ in our case.
0: kd> !poolused /t10 4
Sorting by Paged Pool Consumed
Tag Allocs Used Allocs Used
Se 0 0 172204 232720312 General security allocations , Binary: nt!se
MmSt 0 0 15231 31835696 Mm section object prototype ptes , Binary: nt!mm
Ntff 9 1872 10434 8514144 FCB_DATA , Binary: ntfs.sys
WD 384 1251376 27 5591040 UNKNOWN pooltag 'WD ', please update pooltag.txt
UlHT 0 0 1 4198400 Hash Table , Binary: http.sys
NtfF 0 0 3259 3050424 FCB_INDEX , Binary: ntfs.sys
Toke 0 0 966 3034216 Token objects , Binary: nt!se
NtFs 13717 551176 19678 1758008 StrucSup.c , Binary: ntfs.sys
IoNm 0 0 12034 1737360 Io parsing names , Binary: nt!io
FSim 0 0 11336 1451008 File System Run Time Mcb Initial Mapping
CM16 0 0 293 1437696 Internal Configuration manager allocations
Wmit 6 11392 23 1376912 Wmi Trace
NtFU 0 0 8719 1237232 usnsup.c , Binary: ntfs.sys
Obtb 0 0 397 995600 object tables via EX handle.c , Binary: nt!ob
Key 0 0 7245 753432 Key objects
MmSm 0 0 11393 729152 segments used to map data files , Binary: nt!mm
TOTAL 192495 33751032 347461 310172416
Once we know the pool tag we need to run the following command. Remember the pool tag has to be a 4 char case sensitive string so I would run the following command which tells me in which all modules we have the Pool tag “Se “ used. Please note that I have added two spaces in the string during my search. This is because the !poolused shows Se tag has caused the issue. Not providing the spaces in the search string will give different results. The following command searches each module that is loaded on the system.
!for_each_module s -a @#Base @#End "Se "
The result should be something like:
0: kd> !for_each_module s -a @#Base @#End "Se "
8096a1ae 53 65 20 20 6a 0c 6a 01-e8 bd a9 f2 ff 3b c7 89 Se j.j......;..
8096a5e3 53 65 20 20 6a 0c 6a 01-e8 88 a5 f2 ff 3b c7 89 Se j.j......;..
8096af9e 53 65 20 20 50 6a 01 89-45 fc e8 cb 9b f2 ff 8b Se Pj..E.......
8096c909 53 65 20 20 8d 1c 9d 10-00 00 00 53 6a 01 e8 5c Se .......Sj..\
8096c9e3 53 65 20 20 53 6a 01 e8-89 81 f2 ff 8b f8 85 ff Se Sj..........
8096ca3e 53 65 20 20 8d 04 85 10-00 00 00 50 6a 01 e8 27 Se .......Pj..'
8096caf2 53 65 20 20 8d 1c 9d 0c-00 00 00 53 6a 01 e8 73 Se .......Sj..s
8096cb3e 53 65 20 20 8d 1c 9d 0c-00 00 00 53 6a 01 e8 27 Se .......Sj..'
8096cb9d 53 65 20 20 56 6a 01 e8-cf 7f f2 ff 8b f0 3b f7 Se Vj........;.
8096cc44 53 65 20 20 6a 10 6a 01-e8 27 7f f2 ff 85 c0 0f Se j.j..'......
8096cc65 53 65 20 20 6a 04 6a 01-e8 06 7f f2 ff 85 c0 0f Se j.j.........
8096cc96 53 65 20 20 6a 04 6a 01-e8 d5 7e f2 ff 85 c0 0f Se j.j...~.....
8096ccbe 53 65 20 20 6a 38 6a 01-e8 ad 7e f2 ff 8b f8 85 Se j8j...~.....
809718ac 53 65 20 20 74 42 8b 46-68 3b c3 74 0f 53 50 e8 Se tB.Fh;.t.SP.
80971b48 53 65 20 20 74 42 8b 46-68 3b c3 74 0f 53 50 e8 Se tB.Fh;.t.SP.
8097372b 53 65 20 20 6a 0c 6a 01-e8 40 14 f2 ff 85 c0 75 Se j.j..@.....u
80976c38 53 65 20 20 6a 0c 6a 01-e8 33 df f1 ff 3b c6 89 Se j.j..3...;..
80a20d1b 53 65 20 20 bf 00 01 00-00 57 56 e8 4d 3e e7 ff Se .....WV.M>..
80a22698 53 65 20 20 c7 45 80 4f-62 43 6c c7 45 84 43 63 Se .E.ObCl.E.Cc
The above steps are also explained in detail at http://blogs.msdn.com/b/ntdebugging/archive/2012/08/31/troubleshooting-pool-leaks-part-3-debugging.aspx
Now the tough part begins. We need to run “ln” command on each of the addresses shown by the output above. Example
0: kd> ln 8096c909
(8096c8cc) nt!SeQueryInformationToken+0x3d | (8096cdc0) nt!SeCaptureObjectTypeList
See the highlighted above. It is the nearest API to that address. Search on MSDN for all API’s listed. The ones that do not have an MSDN article can likely be ignored. SEQueryInformationToken is publically available and hence callable by any loaded driver. Example http://msdn.microsoft.com/en-in/library/windows/hardware/ff556690(v=vs.85).aspx
At this point we will make the assumption that it is a 3rd party driver making the calls to this Windows API. It is, of course, possible that a Microsoft module is doing this, but if such a problem exists in our code it usually becomes apparent very quickly when we get flooded by support calls.
We can use the “!for_each_module” debugger command to search the address space of each loaded module for the entry point of the function found in step 3.
0: kd> !for_each_module s-d @#Base @#End 8096c8cc
b94b5938 8096c8cc 8096000a 80960182 80894b78 ............xK..
ba7d50f4 8096c8cc 8081e5e0 8096b42e 8096b40c ................
ba83624c 8096c8cc 8081757e 8082b4f2 8094d438 ....~u......8...
bab8103c 8096c8cc 80960122 80960158 8096010a ...."...X.......
f74e2950 8096c8cc 8089e180 8096091a 80960638 ............8...
f74f0b68 8096c8cc 8089e180 80959546 80884150 ........F...PA..
f75cf094 8096c8cc 80960122 80960158 8096010a ...."...X.......
f797c110 8096c8cc 808847f0 80966c60 80894b78 .....G..`l..xK..
f7b643c4 8096c8cc 80960054 809e437c 808eadc6 ....T...|C......
The first column in the above output can then be passed to the “ln” command to get the nearest function to the address listed.
0: kd> ln b94b5938
*** ERROR: Module load completed but symbols could not be loaded for ABCMiniFilter.sys
0: kd> ln f797c110
*** ERROR: Module load completed but symbols could not be loaded for XVhdBusxxx.sys
This is a bit more complicated. Run the lm command to dump all modules. You will get the output as follows
0: kd> lm
start end module name
80800000 80a5d000 nt
80a5d000 80a89000 hal
b476c000 b4775000 asyncmac (deferred)
b7f55000 b7fb6000 ABCD (deferred)
b817b000 b81da000 eeCtrl (deferred)
b82b6000 b82b9300 xyz (deferred)
b87a2000 b87cc000 Fastfat (deferred)
b8f04000 b8f0de80 ABCMiniFilter (deferred)
Ignore the unloaded modules and the modules you know are Microsoft binaries. For the rest of them you need to run the following command on each module. Remember the first column is the start of the module and the second column is the end of module.
Use the !dh command for each header one by one and you will see the header information for the module. I just picked one module for reference you will need to run the command for each third party module.
0: kd> !dh b8f04000 -f
File Type: EXECUTABLE IMAGE
FILE HEADER VALUES
14C machine (i386)
7 number of sections
51F7D252 time date stamp Tue Jul 30 20:18:50 2013
0 file pointer to symbol table
0 number of symbols
E0 size of optional header
32 bit word machine
OPTIONAL HEADER VALUES
10B magic #
9.00 linker version
8800 size of code
1200 size of initialized data
0 size of uninitialized data
86B2 address of entry point
480 base of code
----- new -----
00010000 image base
80 section alignment
80 file alignment
1 subsystem (Native)
6.01 operating system version
6.01 image version
5.01 subsystem version
9E80 size of image
480 size of headers
00040000 size of stack reserve
00001000 size of stack commit
00100000 size of heap reserve
00001000 size of heap commit
0 DLL characteristics
0 [ 0] address [size] of Export Directory
8A28 [ 50] address [size] of Import Directory
9180 [ 498] address [size] of Resource Directory
0 [ 0] address [size] of Exception Directory
9E80 [ 1FD0] address [size] of Security Directory
9680 [ 6C0] address [size] of Base Relocation Directory
7990 [ 1C] address [size] of Debug Directory
0 [ 0] address [size] of Description Directory
0 [ 0] address [size] of Special Directory
0 [ 0] address [size] of Thread Storage Directory
7A68 [ 40] address [size] of Load Configuration Directory
0 [ 0] address [size] of Bound Import Directory
7880 [ 108] address [size] of Import Address Table Directory
0 [ 0] address [size] of Delay Import Directory
0 [ 0] address [size] of COR20 Header Directory
0 [ 0] address [size] of Reserved Directory
The highlighted line is important for us. Remember this is the offset of the start of Import Address Table directory and the end of the same. In our case 7880 offset is start and 108 offset is the end. We run the following command to dump the import table. Running the dps command from the start offset in the module till the end of the import table will list all the functions it imports
0: kd> dps b8f04000+7880 b8f04000+7880+108
b8f0b880 f7873892 fltMgr!FltGetStreamHandleContext
b8f0b884 f7873988 fltMgr!FltAllocateContext
b8f0b888 f787459c fltMgr!FltSetStreamHandleContext
b8f0b938 8096c8cc nt!SeQueryInformationToken
b8f0b93c 8096000a nt!RtlValidSid
You see the module we searched in will have the API in its import table. Running the above command on each module, dumping the import tables will help us identify which modules have the API in its import table. However it cannot be confirmed if the driver calls the API and causes the leak. This is really tedious and time consuming but will help us identify each binary that calls the above API which leads to pool tag “Se “ allocation on behalf of the NT Kernel.
We can disable the binaries that call the SEQueryInformationToken API one by one and see if the issue still persists. Please note that the boot/System drivers should not be disabled as it can lead to a no boot situation. You can use Autoruns from Sysinternals to disable these drivers and if you run into a no boot situation, boot back into the “last known good configuration” to help recover the box.
Hey there NTDEBUGGERS my name is Randy Monteleone and today we are going to talk about IRPs. In the past we have talked about the IRP structure in passing and showed a field here and there that can be pulled out and used to find answers to stalled IO. I was recently working on a debugger extension and found something interesting in the IRP I was looking at. I had been looking at a !irp output much like the one pictured below. I found that I was asking myself what do the "Success Error Cancel" fields mean?
After doing some digging and working with a few of my co-workers we found the mystery to the meaning behind these words and why we see them in our output. Lets break this IRP down starting with the ">" marker that indicates the current stack frame. In the output below you see this marker is indicating that we are working on something in partmgr.
1: kd> !irp fffffa809a1f3af0
Irp is active with 9 stacks 5 is current (= 0xfffffa809a1f3ce0)
Mdl=fffffa814e9c4f40: No System Buffer: Thread fffffa80d05b67e0: Irp stack trace.
cmd flg cl Device File Completion-Context
[ 0, 0] 0 0 00000000 00000000 00000000-00000000
Args: 00000000 00000000 00000000 00000000
>[ 4,34] 1c e0 fffffa80920c0060 00000000 fffff880010301b0-00000000 Success Error Cancel <-- What are you trying to tell me?
Args: 00001000 00000000 14625000 00000000
[ 4, 0] 1c e0 fffffa80920c0b90 00000000 fffff88001063010-fffffa80b01eec00 Success Error Cancel
Args: 3993d69568d 00000000 14625000 00000000
[ 4, 0] c e0 fffffa80b01eeab0 00000000 fffff88001d59150-fffffa80d05b2180 Success Error Cancel
Args: 00001000 00000000 3993d69568a 00000000
[ 4, 0] c e1 fffffa80d05b2030 00000000 fffff88001845344-fffff8800d52bb38 Success Error Cancel pending
Args: 00001000 00000000 14615000 00000000
[ 4, 0] 0 0 fffffa80c2829030 fffffa80d04f18a0 00000000-00000000
Args: 00001000 00000000 00002000 00000000
In the example below we see the “Success Error Cancel” fields and in this case we can also see that the pending flag has been set. The pending field indicates that STATUS_PENDING was returned to the caller. This is used so that I/O completion can determine whether or not to fully complete the I/O operation requested by the packet. Drivers can do this by calling IoMarkIrpPending.
Now look at the "cl" column as it holds the key to unlocking what "Success Error Cancel pending" really means.
1: kd> !irp 0xfffffa80`920c2340
Irp is active with 3 stacks 3 is current (= 0xfffffa80920c24a0) <--- _IO_STACK_LOCATION
Mdl=fffffa814e9c4f40: No System Buffer: Thread 00000000: Irp stack trace.
[ f, 0] 1c 0 fffffa809209a060 00000000 fffff880010061a0-fffffa80920bfcc0
Args: fffffa80920c22b0 00000000 00000000 fffffa80920bfcc0
>[ f, 0] 1c e1 fffffa809209a060 00000000 fffff88001a01a00-fffffa80920c2190 Success Error Cancel pending
Args: fffffa80920c22b0 00000000 00000000 fffffa80920bfcc0
Focusing on the cl column we see that our active stack frame is working in MPIO, but what is it trying to tell us about its status and intent? Well to figure that out we need to break down the value "e1" we see listed in our cl or Control column. This field is being used as a flag where the high and low parts represent two different values. We get this value in !irp from an _IO_STACK_LOCATION Control member. This can be found by dumping the stack location displayed by !irp.
1: kd> dt_IO_STACK_LOCATION 0xfffffa80920c24a0
+0x000 MajorFunction : 0xf ''
+0x001 MinorFunction : 0 ''
+0x002 Flags : 0x1c ''
+0x003 Control : 0xe1 '' < -- Control flag
+0x008 Parameters : <unnamed-tag>
+0x028 DeviceObject : 0xfffffa80`9209a060 _DEVICE_OBJECT
+0x030 FileObject : (null)
+0x038 CompletionRoutine : 0xfffff880`01a01a00 long CLASSPNP!TransferPktComplete+0
+0x040 Context : 0xfffffa80`920c2190 Void
So now that we know where we get this value we still have to decode what it means and how this value results in what we see in our !IRP output. Let's split our number into two, our high and low parts. In our example above we had the value e1. Taking our value apart gives us an E and a 1. The first bit of our value will indicates the last returned value as our IRP was processed. In our case this is a 1, meaning that our IRP is Pending. Thus we see the word pending at the end of current stack frame in !irp. There are three possible values for this bit, 0, 1 or 2. 0 being nothing, 1 being pending and 2 being Error Returned.
Moving to our high bits we are left with the "e". Lets take this number and convert it to binary. We end up with 1110. This upper number indicates which invoke types were requested for the completion routine for the driver listed on that stack frame. If we look at MSDN we see that IoSetCompletionRoutine takes in three bool values to set this flag. These options specify whether the completion routine is called if the IRP is completed with that corresponding status.
_In_ PIRP Irp,
_In_opt_ PIO_COMPLETION_ROUTINE CompletionRoutine,
_In_opt_ PVOID Context,
_In_ BOOLEAN InvokeOnSuccess,
_In_ BOOLEAN InvokeOnError,
_In_ BOOLEAN InvokeOnCancel
Doing some source review I was able to pin together how these values translate to the !irp output. If we look back at our binary value of "e" we see that we have a set of bits that get set based on the what the driver wanted to do when an IRP was completed with one of our defined status values.
Cancel = 2
Success = 4
Error = 8
Add each of these values up and the sum is 14 or e in hex. Going back to our binary "1110" we see that our first bit is set to zero or off. This would be the pending and Error Returned values that I mentioned above. The next three bits represent the corresponding Cancel, Success and Error bool values passed to the driver at the time IoSetCompletionRoutine was called for this stack location.
Example : IoSetCompletionRoutine(pirp, pCompletionRoutine,pContex, True,False,True); would yield a value of 6.
Remember the important thing here is not that the !irp output is trying to tell us that one of these things happened. It's telling us that this driver would like to be notified if one of those things does happen. This area also provides us with information if pending or or an error is returned.
Well that's all I have for now, for more documentation for the _IRP and _IO_STACK_LOCATION structures please see the following links to MSDN.
More on IRP: http://msdn.microsoft.com/en-us/library/windows/hardware/ff550694(v=vs.85).aspx
More on _IO_STACK_LOCATION: http://msdn.microsoft.com/en-us/library/windows/hardware/ff550659(v=vs.85).aspx
Hello everyone, I'm Ray and I'm here to talk a bit about a dump I recently looked at and a little-referenced memory counter called ResAvail Pages (resident available pages).
The problem statement was: The server hangs after a while.
Not terribly informative, but that's where we start with many cases. First some good housekeeping:
0: kd> vertarget
Windows 7 Kernel Version 7601 (Service Pack 1) MP (2 procs) Free x64
Product: Server, suite: TerminalServer SingleUserTS
Built by: 7601.18113.amd64fre.win7sp1_gdr.130318-1533
Machine Name: "ASDFASDF1234"
Kernel base = 0xfffff800`01665000 PsLoadedModuleList = 0xfffff800`018a8670
Debug session time: Thu Aug 8 09:39:26.992 2013 (UTC - 4:00)
System Uptime: 9 days 1:08:39.307
Of course Windows 7 Server == Server 2008 R2.
One of the basic things I check at the beginning of these hang dumps with vague problem statements is the memory information.
0: kd> !vm 21
Physical Memory: 2097038 ( 8388152 Kb)
Page File: \??\C:\pagefile.sys
Current: 12582912 Kb Free Space: 12539700 Kb
Minimum: 12582912 Kb Maximum: 12582912 Kb
Available Pages: 286693 ( 1146772 Kb)
ResAvail Pages: 135 ( 540 Kb)
********** Running out of physical memory **********
Locked IO Pages: 0 ( 0 Kb)
Free System PTEs: 33526408 ( 134105632 Kb)
******* 12 system cache map requests have failed ******
Modified Pages: 4017 ( 16068 Kb)
Modified PF Pages: 4017 ( 16068 Kb)
NonPagedPool Usage: 113241 ( 452964 Kb)
NonPagedPool Max: 1561592 ( 6246368 Kb)
PagedPool 0 Usage: 35325 ( 141300 Kb)
PagedPool 1 Usage: 28162 ( 112648 Kb)
PagedPool 2 Usage: 24351 ( 97404 Kb)
PagedPool 3 Usage: 24350 ( 97400 Kb)
PagedPool 4 Usage: 24516 ( 98064 Kb)
PagedPool Usage: 136704 ( 546816 Kb)
PagedPool Maximum: 33554432 ( 134217728 Kb)
********** 222 pool allocations have failed **********
Session Commit: 6013 ( 24052 Kb)
Shared Commit: 6150 ( 24600 Kb)
Special Pool: 0 ( 0 Kb)
Shared Process: 1214088 ( 4856352 Kb)
Pages For MDLs: 67 ( 268 Kb)
PagedPool Commit: 136768 ( 547072 Kb)
Driver Commit: 15548 ( 62192 Kb)
Committed pages: 1648790 ( 6595160 Kb)
Commit limit: 5242301 ( 20969204 Kb)
So we're failing to allocate pool, but we aren't out of virtual memory for paged pool or nonpaged pool. Let's look at the breakdown:
0: kd> dd nt!MmPoolFailures l?9
fffff800`01892160 000001be 00000000 00000000 00000002
fffff800`01892170 00000000 00000000 00000000 00000000
yellow = Nonpaged high/medium/low priority failures
green = Paged high/medium/low priority failures
cyan = Session paged high/medium/low priority failures
So we actually failed both nonpaged AND paged pool allocations in this case. Why? We're "Running out of physical memory", obviously. So where does this running out of physical memory message come from? In the above example this is from the ResAvail Pages counter.
ResAvail Pages is the amount of physical memory there would be if every working set was at its minimum size and only what needs to be resident in RAM was present (e.g. PFN database, system PTEs, driver images, kernel thread stacks, nonpaged pool, etc).
Where did this memory go then? We have plenty of Available Pages (Free + Zero + Standby) for use. So something is claiming memory it isn't actually using. In this type of situation one of the things I immediately suspect is process working set minimums. Working set basically means the physical memory used by a process.
So let's check.
0: kd> !process 0 1
<a lot of processes in this output>.
SessionId: 0 Cid: 0adc Peb: 7fffffda000 ParentCid: 0678
DirBase: 204ac9000 ObjectTable: 00000000 HandleCount: 0.
VadRoot 0000000000000000 Vads 0 Clone 0 Private 1. Modified 3. Locked 0.
ElapsedTime 9 Days 01:08:00.134
Working Set Sizes (now,min,max) (5, 50, 345) (20KB, 200KB, 1380KB)
VirtualSize 65 Mb
PeakVirtualSize 84 Mb
I have only shown one example process above for brevity's sake, but there were thousands returned. 241,423 to be precise. None had abnormally high process working set minimums, but cumulatively their usage adds up.
The “now” process working set is lower than the minimum working set. How is that possible? Well, the minimum and maximum are not hard limits, but suggested limits. For example, the minimum working set is honored unless there is memory pressure, in which case it can be trimmed below this value. There is a way to set the min and/or max as hard limits on specific processes by using the QUOTA_LIMITS_HARDWS_MIN_ENABLE flag via SetProcessWorkingSetSize.
You can view if the minimum and maximum working set values are configured in the _EPROCESS->Vm->Flags structure. Note these numbers are from another system as this structure was already torn down for the processes we were looking at.
0: kd> dt _EPROCESS fffffa8008f76060 Vm
+0x398 Vm : _MMSUPPORT
0: kd> dt _MMSUPPORT fffffa8008f76060+0x398
+0x000 WorkingSetMutex : _EX_PUSH_LOCK
+0x008 ExitGate : 0xfffff880`00961000 _KGATE
+0x010 AccessLog : (null)
+0x018 WorkingSetExpansionLinks : _LIST_ENTRY [ 0x00000000`00000000 - 0xfffffa80`08f3c410 ]
+0x028 AgeDistribution :  0
+0x044 MinimumWorkingSetSize : 0x32
+0x048 WorkingSetSize : 5
+0x04c WorkingSetPrivateSize : 5
+0x050 MaximumWorkingSetSize : 0x159
+0x054 ChargedWslePages : 0
+0x058 ActualWslePages : 0
+0x05c WorkingSetSizeOverhead : 0
+0x060 PeakWorkingSetSize : 0x5ae
+0x064 HardFaultCount : 0x41
+0x068 VmWorkingSetList : 0xfffff700`01080000 _MMWSL
+0x070 NextPageColor : 0x2dac
+0x072 LastTrimStamp : 0
+0x074 PageFaultCount : 0x65c
+0x078 RepurposeCount : 0x1e1
+0x07c Spare :  0
+0x084 Flags : _MMSUPPORT_FLAGS
0: kd> dt _MMSUPPORT_FLAGS fffffa8008f76060+0x398+0x84
+0x000 WorkingSetType : 0y000
+0x000 ModwriterAttached : 0y0
+0x000 TrimHard : 0y0
+0x000 MaximumWorkingSetHard : 0y0
+0x000 ForceTrim : 0y0
+0x000 MinimumWorkingSetHard : 0y0
+0x001 SessionMaster : 0y0
+0x001 TrimmerState : 0y00
+0x001 Reserved : 0y0
+0x001 PageStealers : 0y0000
+0x002 MemoryPriority : 0y00000000 (0)
+0x003 WsleDeleted : 0y1
+0x003 VmExiting : 0y1
+0x003 ExpansionFailed : 0y0
+0x003 Available : 0y00000 (0)
How about some more detail?
0: kd> !process fffffa8008f76060
DirBase: 204ac9000 ObjectTable: 00000000 HandleCount: 0.
No active threads
0: kd> !object fffffa8008f76060
Object: fffffa8008f76060 Type: (fffffa8006cccc90) Process
ObjectHeader: fffffa8008f76030 (new version)
HandleCount: 0 PointerCount: 1
The highlighted information shows us that this process has no active threads left but the process object itself (and its 20KB working set use) were still hanging around because a kernel driver had a reference to the object that it never released. Sampling other entries shows the server had been leaking process objects since it was booted.
Unfortunately trying to directly track down pointer leaks on process objects is difficult and requires an instrumented kernel, so we tried to check the easy stuff first before going that route. We know it has to be a kernel driver doing this (since it is a pointer and not a handle leak) so we looked at the list of 3rd party drivers installed. Note: The driver names have been redacted.
start end module name
fffff880`04112000 fffff880`04121e00 driver1 (no symbols) <-- no symbols usually means 3rd party
fffff880`04158000 fffff880`041a4c00 driver2 (no symbols)
0: kd> lmvm driver1
Browse full module list
fffff880`04112000 fffff880`04121e00 driver1 (no symbols)
Loaded symbol image file: driver1.sys
Image path: \SystemRoot\system32\DRIVERS\driver1.sys
Image name: driver1.sys
Browse all global symbols functions data
Timestamp: Wed Dec 13 12:09:32 2006 (458033CC)
Translations: 0000.04b0 0000.04e4 0409.04b0 0409.04e4
0: kd> lmvm driver2
Loaded symbol image file: driver2.sys
Image path: \??\C:\Windows\system32\drivers\driver2.sys
Image name: driver2.sys
Timestamp: Thu Nov 30 12:12:07 2006 (456F10E7)
Fortunately for both the customer and us we turned up a pair of drivers that predated Windows Vista (meaning they were designed for XP/2003) that raised an eyebrow. Of course we need a more solid evidence link than just "it's an old driver", so I did a quick search of our internal KB. This turned up several other customers who had these same drivers installed, experienced the same problem, then removed them and the problem went away. That sounds like a pretty good evidence link. We implemented the same plan for this customer successfully.
Hi, my name is Todd Webb, I am an Escalation Engineer from the Global Escalation Services OEM Team.
I recently worked a case where we were trying to troubleshoot an issue and the system would hang with USB debugging enabled. This was a problem since USB debugging was the only kernel debugging option available on this particular hardware setup. I needed a quick and easy way to find the source of the problem without using many of the standard methods. Basically I had to debug a debugger issue with no access to a debugger. I knew the problem was likely USB related, but had no other information. The method I used to troubleshoot this issue was to write some macros that would make a call to HalMakeBeep to generate audible indications of where we were in the code. The primary reason for using this method was I could quickly and easily move the location in the code where I made the calls as I narrowed down where the problem occurred.
The alternative would have been to add a bunch of logging and find some way to force a dump. My initial assessment was this would be more time consuming and less flexible then the beep code method. Others may have other methods to do this, but this just gives people another method of debugging driver start-up problems on systems that are not debuggable.
The main beep macro allows you to pass in a frequency and duration in milliseconds for a single beep. Then based on the example below you can extend the single beep macro to create a series of beeps.
Here is an example of the macros I used:
// beep macros for troubleshooting without debugger
#define EHCI_BEEP1(f, d) \
LARGE_INTEGER duration; \
duration.QuadPart = Int32x32To64(d, -10000); \
KeDelayExecutionThread(KernelMode, FALSE, &duration); \
duration.QuadPart = Int32x32To64(500, -10000); \
#define EHCI_BEEP2(f, d) \
EHCI_BEEP1(f, d); \
EHCI_BEEP1(f, d); \
#define EHCI_BEEP3(f, d) \
EHCI_BEEP2(f, d); \
Here is an example of a long beep followed by three short beeps:
// debug - 1 long : 3 short
Microsoft is looking for five Windows Internals subject matter experts to come work on a very special five to eight day project on the Redmond campus during the month of May 2008. Candidates must have good communications skills, be non Microsoft employees, have 5+ years experience with windows, be familiar with the Windows Internals book, have kernel and user mode debugging experience, and be C literate.
We prefer candidates strongly represent the Windows IT Professional or Windows development industry. These typically would be 2nd or 3rd tier escalation resources or Windows developers (developing Win32 code, or device drivers) that are working for Gold Partners, ISVs or are independent consultants or MVPs.
A Microsoft NDA (Non Disclosure Agreement) will be required.
Please contact Jeff Dailey at email@example.com with the subject line “Windows Internals Project”
Please include a copy of your resume and details explaining your Windows background, along with your contact info, and your location.
We will cover air travel, lodging and food for non local candidates attending from within the US.
The Widows Internals experts selected for this project will be brought to Redmond to aid in developing a new certification program based on the Windows Internals. This is not a hiring effort or product testing session. Though the candidates should have C literacy they will not be writing or reviewing any Microsoft source code.
The Windows Support team has a new YouTube channel, “Windows Troubleshooting”. The first set of videos cover debugging blue screens.
In this video, Bob Golding, Senior Escalation Engineer, describes how the Special Pool Windows diagnostics tool catches drivers that corrupt memory. Bob also introduces how memory is organized in the system for allocating memory for drivers.
We are back with another addition to the debugging fundamentals series here in 2009! Regardless of your debugging experience we hope you enjoy this one. Don’t worry too much if you can’t answer them all but give them your best shot.
While debugging an application, a developer observes a crash and is puzzled. The developer saves a memory dump of the process using the command “.dump /ma dbgex4.dmp” and sends it to you for debug assistance.
eax=53454750 ebx=00000000 ecx=00000100 edx=000001b0 esi=001bfe20 edi=00ac36bc
eip=47414d45 esp=001bfe00 ebp=4f564552 iopl=0 nv up ei pl zr na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010246
47414d45 ?? ???
The attached zip, dbgex4.zip, includes three files. The program itself, the program’s symbols file and the saved memory dump file. (NOTE: You do not need to execute the program for the exercise however you may want to in order to solve the bonus’. You will need to rename the file dbgex4.ex_ to dbgex4.exe in order to run it).
Using the accompanying files, answer the following questions:
Part One – Debugging
1. What caused the access violation?
2. Examine the registers at the time of the crash. Is there anything interesting about the contents?
3. How did the registers get into this state?
4. Find the offending data structure which caused this state to occur. What are its contents?
5. Are there security concerns with this access violation? Why?
6. Why is this class of crashes not seen in the wild much anymore?
Part Two – Reverse Engineering
Examine the functions main(), snap(), crack(), pop(), and boom().
1. Describe with a sentence or two what each one is doing. (There is no need to comment on every assembly instruction or re-write the code in C here unless you really feel you need to.)
1. If this access violation occurred while the program was running without a debugger present, would there be anything different about the crashing register state or the data structure which caused it? If so what is it and why? (HINT: If you would like to observe the crashing state like this, you can register windbg.exe to be the Just-In-Time debugger using the –I switch like so “windbg.exe -I”. When you run the dbgex4.exe program, it will crash and windbg.exe will launch automatically and attach to it.)
2. The functions from “Part Two – Reverse Engineering” perform operations on a set of data. During this manipulation some data is lost. Find this data and decode its “secret” message.