Written by Ron Stock
I recently collaborated with a third-party vendor which required me to track down a problem during the Vista upgrade process. The issue came down to catching a Microsoft component opening a specific registry service key during the setup process. Issues like this frequently call for live debugging to catch registry access in real time. I could have set a breakpoint on RegOpenKeyExW() and examined every requested key passed into the function, but this method is very time consuming considering RegOpenKeyExW() is a highly used code path. This methodology involves setting a break point on the function, waiting for the break in, checking the second parameter (lpSubKey) for a match to the desired registry key (string), and hitting “g” if the key doesn’t match. I could have potentially repeated these steps a hundred times before finding the call containing the key I needed. I wanted some way to set a ‘break on access’ only when my key was touched. We ran into a similar issue requiring the ability to catch a file system function working with a specific file when possibly hundreds of files were in use. So how do you gain the automation needed for this type of debugging? The answer is debugger scripts.
Let’s run through an example with Windows Explorer.exe using this method. In this scenario we’ll attempt to catch Explorer.exe opening the HKEY_LOCAL_MACHINE\SYSTEM\Setup key. Windows Explorer opens several keys per second, so the focus is to ‘break in’ when this specific key is passed to RegOpenKeyEx() without having to manually wade through hundreds of possible keys passed into the function.
1. Identify the function in question.
Per MSDN, RegOpenKeyEx is defined as the following, and we’re interested in the second parameter because it contains the name of the key.
LONG WINAPI RegOpenKeyEx(
__in HKEY hKey,
__in_opt LPCTSTR lpSubKey,
__reserved DWORD ulOptions,
__in REGSAM samDesired,
__out PHKEY phkResult
);
2. Create the debugger script
The script logic is straight forward. Our debugger script will break in every time RegOpenKeyEx() is called. It then checks the second parameter for the string SYSTEM\Setup. If it finds a match it stays broken in, otherwise it returns command to the running thread. Our script will look like this -
r$t0=poi(esp+8)
.if (@$t0 != 0) { as /mu ${/v:_str} @$t0 } .else { ad /q ${/v:_str} }
.if ($spat("${_str}x", "*system\Setup*") == 0) { .echo _str;g } .else { .echo _str }
r$t0 is considered a pseudo-register and is used as a variable in this script. Since the second parameter of RegOpenKeyEx() is passed via esp+8, our debugger command program will dereference esp+8 and move it into r$t0. Our pseudo-register now holds the memory address of lpSubKey. The second line of the program simply checks to see if r$t0 is NULL, and will setup our alias named _str at the address held in r$t0 if r$t0 is not NULL. The third line uses the MASM string operator, $spat, looking for a string pattern which matches system\setup. If it doesn’t find a match it prints the string, and returns command back to the executing thread. It’s the equivalent of typing ‘g’ in the debugger. If the string pattern system\setup is found, the string is printed out on the screen, and the debugger remains broken in. Then we can debug the underlying problem in the appropriate context. Now save them into a txt located at c:\debuggers\strings.txt so we can reference it from our breakpoint.
3. Set the break point.
Next we break into the Explorer process from the kernel debugger
kd> !process 0 0 Explorer.exe
PROCESS 8340d6a8 SessionId: 1 Cid: 0270 Peb: 7ffd5000 ParentCid: 01b0
DirBase: 2d48f000 ObjectTable: 90171348 HandleCount: 559.
Image: explorer.exe
kd> !bpid 0270
Finding wininit.exe (1)...
Finding winlogon.exe (1)...
Waiting for winlogon.exe to break. This can take a couple of minutes...
Break instruction exception - code 80000003 (first chance)
Break into process 270 set. The next break should be in the desired process.
ntdll!DbgBreakPoint:
77f17dfe cc int 3
Then we set a breakpoint on ADVAPI32!RegOpenKeyExW instructing it to use our script to evaluate whether SYSTEM\Setup is the key passed in at the second parameter. If it finds a match to our key it stays broken in otherwise execution is passed back to the running Explorer thread.
kd> bp ADVAPI32!RegOpenKeyExW "$$<c:\\debuggers\\strings.txt"
kd> bl
0 e 76e8f09d 0001 (0001) ADVAPI32!RegOpenKeyExW "$$<c:\debuggers\strings.txt"
This is the output we would expect to see for all of the non-matches.
kd> g
kd> r$t0=poi(esp+8)
kd> .if (@$t0 != 0) { as /mu ${/v:_str} @$t0 } .else { ad /q ${/v:_str} }
kd> .if ($spat("${_str}x", "*system\Setup*") == 0) { .echo _str;g } .else { .echo _str }
System\CurrentControlSet\Services\DnsCache\Parameters
Software\Policies\Microsoft\Windows NT\DnsClient
(Several entries have been removed here to conserve space in the blog)
System\CurrentControlSet\Services\DNS
System\Setup
ADVAPI32!RegOpenKeyExW:
76e8f09d 8bff mov edi,edi
Bingo! It detected that System\Setup was the string passed in and remained broken in allowing us to start any necessary debugging.
kd> kv
ChildEBP RetAddr Args to Child
077aeae0 760dfb75 80000002 760dfb94 00000000 ADVAPI32!RegOpenKeyExW (FPO: [Non-Fpo])
WARNING: Frame IP not in any known module. Following frames may be wrong.
077aed18 760dfbb7 00000004 00000000 0000003a 0x760dfb75
077aee20 7678c1b2 00000000 00000000 00000000 0x760dfbb7
077aee24 00000000 00000000 00000000 00000000 kernel32!WaitForSingleObject+0x12 (FPO: [Non-Fpo])
kd> dc 760dfb94
760dfb94 00790053 00740073 006d0065 0053005c S.y.s.t.e.m.\.S.
760dfba4 00740065 00700075 ff530000 4ce81075 e.t.u.p...S.u..L
760dfbb4 8bffffff 0ffb3bf8 0032c284 89fb3b00 .....;....2..;..
760dfbc4 840ff47d 0000012b f875ff56 013ee857 }...+...V.u.W.>.
760dfbd4 f08b0000 840ff33b 000000f2 0000c3e9 ....;...........
760dfbe4 90909000 ff8b9090 83ec8b55 458b1cec ........U......E
760dfbf4 db335308 fc458956 e8f45d89 fffff583 .S3.V.E..]......
760dfc04 ff18758b 1e891475 fff5a9e8 89c33bff .u..u........;..
Debugger scripts are another cool tool to add to your debugging arsenal. I’ve personally benefited from this type of scripting many times. Again check the debugger help files for more information on the available commands.
Hello all, Scott Olson here again to share another interesting issue I worked on a while back. The issue was that after upgrading to Windows XP Service Pack 2 the system would experience random bug checks with memory corruption. Interestingly, there was a very specific pattern to the corruption - it looked like a PFN address and flags were randomly placed into the page table page in several places in the process. The memory manager would never do this type of thing and I suspected that a driver was editing user page table pages, which should never be done.
Let's take a look at the stack:
kd> kb *** Stack trace for last set context - .thread/.cxr resets it ChildEBP RetAddr Args to Child f15b1308 80523096 c00862d8 10c5b000 00000000 nt!MiDeletePte+0x198 f15b13d0 80519776 000001d8 10d20fff 00000000 nt!MiDeleteVirtualAddresses+0x164 f15b13ec 805b1d74 10c20000 10d20fff f15b14a4 nt!MiDeleteFreeVm+0x20 f15b148c 8054060c ffffffff 049c6aa8 049c6ab0 nt!NtFreeVirtualMemory+0x42e f15b148c 7c90eb94 ffffffff 049c6aa8 049c6ab0 nt!KiFastCallEntry+0xfc 03e4a398 7c90da54 7c8209b3 ffffffff 049c6aa8 ntdll!KiFastSystemCallRet 03e4a39c 7c8209b3 ffffffff 049c6aa8 049c6ab0 ntdll!NtFreeVirtualMemory+0xc
Here is the page table entry for the virtual address:
kd> !pte 10c5b000 VA 10c5b000 PDE at 00000000C0600430 PTE at 00000000C00862D8 contains 000000003FC6F867 contains 0000000015E0086F pfn 3fc6f ---DA--UWEV pfn 15e00 ---DA-TUWEV
This shows that the value 15e0086f is incorrectly put into the page table pages. This bad value corresponds to a write-through mapping to a range allocated via a call to MmAllocatePagesForMdl.
c00862d0 00000000 00000000 15e0086f 00000000 c00862e0 00000000 00000000 00000000 00000000 c00862f0 00000000 00000000 00000000 00000000 c0086300 00000000 00000000 00000000 00000000 c0086310 00000000 00000000 00000000 00000000 c0086320 00000000 00000000 00000000 00000000 c0086330 00000000 00000000 00000000 00000000 c0086340 00000000 00000000 00000000 00000000 c0086350 00000000 00000000 15e0086f 00000000 c0086360 00000000 00000000 00000000 00000000 c0086370 00000000 00000000 00000000 00000000 c0086380 00000000 00000000 00000000 00000000 c0086390 00000000 00000000 00000000 00000000 c00863a0 00000000 00000000 00000000 00000000 c00863b0 00000000 00000000 00000000 00000000 c00863c0 00000000 00000000 00000000 00000000 c00863d0 00000000 00000000 00000000 00000000 c00863e0 00000000 00000000 00000000 00000000 c00863f0 00000000 00000000 00000000 00000000 c0086400 00000000 00000000 00000000 00000000 c0086410 00000000 00000000 00000000 00000000 c0086420 00000000 00000000 00000000 00000000 c0086430 00000000 00000000 00000000 00000000 c0086440 00000000 00000000 00000000 00000000 c0086450 00000000 00000000 00000000 00000000 c0086460 15e0086f 00000000 00000000 00000000 c0086470 00000000 00000000 00000000 00000000 c0086480 00000000 00000000 00000000 00000000 c0086490 00000000 00000000 00000000 00000000 c00864a0 00000000 00000000 00000000 00000000 c00864b0 00000000 00000000 00000000 00000000 c00864c0 00000000 00000000 00000000 00000000 c00864d0 00000000 00000000 00000000 00000000 c00864e0 15e0086f 00000000 00000000 00000000 c00864f0 00000000 00000000 00000000 00000000 c0086500 00000000 00000000 00000000 00000000 c0086510 00000000 00000000 00000000 00000000 c0086520 00000000 00000000 00000000 00000000 c0086530 00000000 00000000 00000000 00000000 c0086540 00000000 00000000 00000000 00000000 c0086550 00000000 00000000 00000000 00000000 c0086560 15e0086f 00000000 00000000 00000000 c0086570 00000000 00000000 00000000 00000000 c0086580 00000000 00000000 00000000 00000000 c0086590 00000000 00000000 00000000 00000000 c00865a0 00000000 00000000 00000000 00000000 c00865b0 00000000 00000000 00000000 00000000 c00865c0 00000000 00000000 00000000 00000000 c00865d0 00000000 00000000 00000000 00000000 c00865e0 00000000 00000000 00000000 00000000 c00865f0 00000000 00000000 00000000 00000000 c0086600 00000000 00000000 00000000 00000000 c0086610 00000000 00000000 00000000 00000000 c0086620 00000000 00000000 00000000 00000000 c0086630 00000000 00000000 00000000 00000000 c0086640 00000000 00000000 00000000 00000000 c0086650 00000000 00000000 00000000 00000000 c0086660 00000000 00000000 15e0086f 00000000 c0086670 00000000 00000000 00000000 00000000 c0086680 00000000 00000000 00000000 00000000 c0086690 00000000 00000000 00000000 00000000 c00866a0 00000000 00000000 00000000 00000000 c00866b0 00000000 00000000 00000000 00000000 c00866c0 00000000 00000000 00000000 00000000 c00866d0 00000000 00000000 00000000 00000000 c00866e0 00000000 00000000 15e0086f 00000000 c00866f0 00000000 00000000 00000000 00000000 c0086700 00000000 00000000 00000000 00000000 c0086710 00000000 00000000 00000000 00000000 c0086720 00000000 00000000 00000000 00000000 c0086730 00000000 00000000 00000000 00000000 c0086740 00000000 00000000 00000000 00000000 c0086750 00000000 00000000 00000000 00000000 c0086760 00000000 00000000 15e0086f 00000000 c0086770 00000000 00000000 00000000 00000000 c0086780 00000000 00000000 00000000 00000000 c0086790 00000000 00000000 00000000 00000000 c00867a0 00000000 00000000 00000000 00000000 c00867b0 00000000 00000000 00000000 00000000 c00867c0 00000000 00000000 00000000 00000000 c00867d0 00000000 00000000 00000000 00000000 c00867e0 00000000 00000000 00000000 00000000
kd> !pfn 15e00 PFN 00015E00 at address 81BCA800 flink 00000000 blink / share count 00000001 pteaddress 000AF001 reference count 0002 Cached color 0 restore pte 00000080 containing page FFEDCB Active RW ReadInProgress WriteInProgress
The driver also has an outstanding call MmProbeAndLockPages call on the pages indicated by the reference count of 2. Thinking that this pfn value is incorrect I decided to search for this value and see what I could find.
kd> s -d 80000000 l?7fffffff 00015e00 8022d534 00015e00 0001f190 00041d50 0001f140 .^......P...@... 86cacbf4 00015e00 0000cd1c 0000cc27 0000cc08 .^......'....... 86e25cdc 00015e00 0a130005 e56c6946 00000000 .^......Fil.....
I found a few entries but the middle one looks like it could be an MDL allocation. So I verified this:
kd> !pool 86cacbf4 2 Pool page 86cacbf4 region is Nonpaged pool *86cacbd0 size: 80 previous size: 28 (Allocated) *Mdl Pooltag Mdl : Io, Mdls
Yes this is an MDL, let's inspect it:
kd> dt nt!_MDL 86cacbd8 +0x000 Next : (null) +0x004 Size : 32 +0x006 MdlFlags : 138 +0x008 Process : (null) +0x00c MappedSystemVa : 0x00004000 +0x010 StartVa : 0xf7baa000 +0x014 ByteCount : 0xfff +0x018 ByteOffset : 0
Notice that the page 15e00 is in the MDL’s page list.
kd> dd 86cacbd8+1c 86cacbf4 00015e00 0000cd1c 0000cc27 0000cc08 86cacc04 0000cc09 0000cc0a 0000cc0b 0000cbec 86cacc14 0000cbed 0000cbee 0000cbef 0000cbd0 86cacc24 0000cbd1 0000cbd2 0000cbd3 0000cbd4 86cacc34 0000cbd5 0000cbd6 00000000 00000000 86cacc44 00000000 00000000 00000000 00010010
Next I wanted to see if I could find a driver that may have references to this MDL and I found two:
kd> s -d 80000000 l?7fffffff 86cacbd8 86f9c6a0 86cacbd8 0000003d 00000000 0000636a ....=.......jc.. 86fc7e68 86cacbd8 00000001 00000001 00000000 ................
Now let's see who owns these
kd> !pool 86f9c6a0 2 Pool page 86f9c6a0 region is Nonpaged pool *86f9c618 size: d8 previous size: 30 (Allocated) *Crpt Pooltag Crpt : Memory corruption driver kd> !pool 86fc7e68 2 Pool page 86fc7e68 region is Nonpaged pool *86fc7e00 size: 98 previous size: 40 (Allocated) *Crpt Pooltag Crpt : Memory corruption driver
kd> !pool 86f9c6a0 2 Pool page 86f9c6a0 region is Nonpaged pool *86f9c618 size: d8 previous size: 30 (Allocated) *Crpt Pooltag Crpt : Memory corruption driver
kd> !pool 86fc7e68 2 Pool page 86fc7e68 region is Nonpaged pool *86fc7e00 size: 98 previous size: 40 (Allocated) *Crpt Pooltag Crpt : Memory corruption driver
This gives us a pretty convincing probability that this driver is at fault. So now you may ask, "Why did this problem only start after applying Service Pack 2?" By default when you install Server Pack 2, Data Execution Prevention (DEP) is enabled on systems that support it. The support for DEP is in the PAE kernel which uses extra bits to describe the page table entries. In this crash the solution was to disable DEP until the driver could be corrected. The driver was incorrectly using the memory mappings by ignoring the extra bits in the page number and causing the memory corruption by writing to the wrong page. For more information on default DEP settings and enabling/disabling it in Windows see the following article.
899298 The "Understanding Data Execution Prevention" help topic incorrectly states the default setting for DEP in Windows Server 2003 Service Pack 1 http://support.microsoft.com/default.aspx?scid=kb;EN-US;899298
As our group’s file systems expert, I see most of the common problems associated with the use/misuse of NTFS compression. Before you check the “Compress contents to save disk space” checkbox, it might be good to understand how this affects the happy little bits, bytes, & nibbles running around on the disk.
Often backup applications receive ERROR_DISK_FULL errors attempting to back up compressed files and this causes quite a bit of confusion when there are still several gigabytes of free space on the drive. Other issues may also occur when copying compressed files. The goal of this blog is to give the reader a more thorough understanding of what really happens when you compress NTFS files.
Compression Units…
NTFS uses a parameter called the “compression unit” to define the granularity and alignment of compressed byte ranges within a data stream. The size of the compression unit is based entirely on NTFS cluster size (refer to the table below for details). In the descriptions below, the abbreviation "CU" is used to describe a Compression Unit and/or its size.
The default size of the CU is 16 clusters, although the actual size of the CU really depends on the cluster size of the disk. Below is a chart showing the CU sizes that correspond to each of the valid NTFS cluster sizes.
Native NTFS compression does not function on volumes where the cluster size is greater than 4KB, but sparse file compression can still be used.
NTFS Sparse Files…
The Sparse Files features of NTFS allow applications to create very large files consisting mostly of zeroed ranges without actually allocating LCNs (logical clusters) for the zeroed ranges.
For the code-heads in the audience, this can be done by calling DeviceIoControl with the FSCTL_SET_SPARSE IO control code as shown below.
BOOL SetSparse(HANDLE hFile) { DWORD Bytes; return DeviceIoControl(hFile, FSCTL_SET_SPARSE, NULL, 0, NULL, 0, &Bytes, NULL); }
To specify a zeroed range, your application must then call the DeviceIoControl with the FSCTL_SET_ZERO_DATA IO control code.
BOOL ZeroRange(HANDLE hFile, LARGE_INTEGER RangeStart, LONGLONG RangeLength) { FILE_ZERO_DATA_INFORMATION FileZeroData; DWORD Bytes; FileZeroData.FileOffset.QuadPart = RangeStart.QuadPart; FileZeroData.BeyondFinalZero.QuadPart = RangeStart.QuadPart + RangeLength + 1; return DeviceIoControl( hFile, FSCTL_SET_ZERO_DATA, &FileZeroData, sizeof(FILE_ZERO_DATA_INFORMATION), NULL, 0, &Bytes, NULL); }
Because sparse files don't actually allocate space for zeroed ranges, a sparse file can be larger than the parent volume. To do this, NTFS creates a placeholder VCN (virtual cluster number) range with no logical clusters mapped to them.
Any attempt to access a “sparsed” range would result in NTFS returning a buffer full of zeroes. Accessing an allocated range would result in a normal read of the allocated range. When data is written to a sparse file, an allocated range is created which is exactly aligned with the compression unit boundaries containing the byte(s) written. Refer to the example below. If a single byte write occurs for virtual cluster number VCN 0x3a, then all of Compression Unit 3 (VCN 0x30 - 0x3f) would become an allocated LCN (logical cluster number) range. The allocated LCN range would be filled with zeroes and the single byte would be written to the appropriate byte offset within the target LCN.
[...] - ALLOCATED(,,,) - Compressed{ } - Sparse (or free) range / 00000000000000000000000000000000000000000000000000000000000000000000000000000000VCN 00000000000000001111111111111111222222222222222233333333333333334444444444444444 \ 0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef CU0 CU1 CU2 CU3 CU4 |++++++++++++++||++++++++++++++||++++++++++++++||++++++++++++++||++++++++++++++| { }[..............]{ }Extents { VCN = 0x0 LEN = 0x30 CU0 - CU2 VCN = 0x30 LEN = 0x10: LCN = 0x2a21f CU3 VCN = 0x10 LEN = 0x10 CU4}
Below is a screen shot of a 2GB file that was created using the sparse file API’s.
Notice that this volume is only 76.9 MB, yet it has a 2 GB file in the root folder.
If we attempt to duplicate this sparse file with the COPY command, it fails. This is because COPY doesn’t know how to duplicate the sparse attributes on the file, so it attempts to create a real 2GB file in the root of D:. This can occur in production when attempting to move a large database file from one volume to another. If you have a database application that uses sparse attributes, then it is a good practice to use the database software’s backup / restore features when moving the database to a different volume.
Now let’s look at the file’s properties with the COMPACT utility. Notice that the file shows up as compressed, and it has a huge compression ratio.
If you go back and look at the file properties from EXPLORER, you will notice that there is no compression checkbox (or any other indication that the file is compressed). This is because the shell does not check for the sparse bit on the file.
In short, use caution when moving sparse files from one location to another. Applications tell the file system the offsets of zeroed ranges, so you should always leave the management of sparse files to the application that created them. Moving or copying the sparse files manually may cause unexpected results.
NTFS Compression…
Now that we have discussed sparse files, we will move on to conventional NTFS compression.
NTFS compresses files by dividing the data stream into CU’s (this is similar to how sparse files work). When the stream contents are created or changed, each CU in the data stream is compressed individually. If the compression results in a reduction by one or more clusters, the compressed unit will be written to disk in its compressed format. Then a sparse VCN range is tacked to the end of the compressed VCN range for alignment purposes (as shown in the example below). If the data does not compress enough to reduce the size by one cluster, then the entire CU is written to disk in its uncompressed form.
This design makes random access very fast since only one CU needs to be decompressed in order to access any single VCN in the file. Unfortunately, large sequential access will be relatively slower since decompression of many CU’s is required to do sequential operations (such as backups).
In the example below, the compressed file consists of six sets of mapping pairs (encoded file extents). Three allocated ranges co-exist with three sparse ranges. The purpose of the sparse ranges is to maintain VCN alignment on compression unit boundaries. This prevents NTFS from having to decompress the entire file if a user wants to read a small byte range within the file. The first compression unit (CU0) is compressed by 12.5% (which makes the allocated range smaller by 2 VCNs). An additional free VCN range is added to the file extents to act as a placeholder for the freed LCNs at the tail of the CU. The second allocated compression unit (CU1) is similar to the first except that the CU compressed by roughly 50%.
NTFS was unable to compress CU2 and CU3, but part of CU4 was compressible by 69%. For this reason, CU2 & CU3 are left uncompressed while CU4 is compressed from VCNs 0x40 to 0x44. Thus, CU2, CU3, and CU4 are a single run, but the run contains a mixture of compressed & uncompressed VCNs.
NOTE: Each set of brackets represents a contiguous run of allocated or free space. One set of NTFS mapping pairs describes each set of brackets.
[...] - ALLOCATED(,,,) - Compressed{ } - Sparse (or free) range / 00000000000000000000000000000000000000000000000000000000000000000000000000000000VCN 00000000000000001111111111111111222222222222222233333333333333334444444444444444 \ 0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef CU0 CU1 CU2 CU3 CU4 |++++++++++++++||++++++++++++++||++++++++++++++||++++++++++++++||++++++++++++++| (,,,,,,,,,,,,){}(,,,,,,){ }[...............................,,,,){ }Extents { VCN = 0x0 LEN = 0xe : LCN = 0x29e32d CU0 VCN = 0xe LEN = 0x2 CU0 VCN = 0x10 LEN = 0x8 : LCN = 0x2a291f CU1 VCN = 0x18 LEN = 0x8 CU1 VCN = 0x20 LEN = 0x25 : LCN = 0x39dd49 CU2 - CU4 VCN = 0x45 LEN = 0xb CU4}
Now we come to the part where we describe limitations of this design. Below are some examples of what happens when things go wrong while reading / writing compressed files.
Disk full error during a backup read operation or file copy…
Failure to copy a large file to a compressed folder…
This is the most common problem seen with compression, and currently the solution is to educate users about limitations. NTFS compression creates approximately one file fragment for every 16 clusters of data. Because the max cluster size allowed for standard compression is 4K, the largest compression unit allowed is 64KB. In order to compress a 100 GB file into 64KB sections, you could potentially end up with 1,638,400 fragments. Encoding 1,638,400 fragments becomes problematic for the file system and can cause a failure to create the compressed file. On Vista and later, the file copy will fail with STATUS_FILE_SYSTEM_LIMITATION. On earlier releases, the status code would be STATUS_INSUFFICIENT_RESOURCES. If possible, avoid using compression on files that are large, or critical to system performance.
I received feedback from the NTFS Principal Development Lead about this blog. Fortunately, most of the feedback was good, but he asked that I add a maximum size recommendation. According to our development team’s research, 50-60 GB is a “reasonable size” for a compressed file on a volume with a 4KB cluster size. This “reasonable size” goes down sharply for volumes with smaller cluster sizes.
Compressing .VHD (Virtual Hard Disk) files causes slow virtual machine performance…
A popular hammer company received a call from a customer complaining “It sure hurts when I hit my thumb with your hammers!” Likewise, if you compress .VHD files, it is going to be a painful experience, so please compress responsibly.
Best Regards,
Dennis Middleton “The NTFS Doctor”
Hi NTDebuggers, this week’s puzzler just so happens to match its number: 0x000000006 = ERROR_INVALID_HANDLE. That said, let me give you a scenario and the challenge will be to provide the best action plan to isolate the problem. This should include an explanation of what types of code problems cause invalid handles.
Scenario 1 : You have a customer or client that is getting invalid handle errors. This is causing unhandled exceptions and the application is crashing. What action plan do you give the customer? In scenario 1, you don’t have access to the remote machine. Your customer will execute any action plan you give them, gather the data, and provide you with the results.
Scenario 2: You have full access to the process and you can debug it live on your own personal machine; you even have source access. What do you do?
Good luck and happy debugging.
[Update: our answer. Posted 5/27/2008]
Hi NTDebuggers, if you’ve been debugging for a while I’m sure you’ve run into an Invalid Handle Error 6 (aka ERROR_INVALID_HANDLE or “The handle is invalid”). If you see it in your debugger output it indicates you have tried to pass an invalid handle to a call that is either expecting a different handle type or a the value for the handle its self is not valid. To understand this error in more detail you have to understand what handles are and where they live in the operating system. Each process has a handle table in the kernel. Think of this as an array or table of objects in kernel mode. The handle values are actually multiples of four. Review http://blogs.msdn.com/oldnewthing/archive/2005/01/21/358109.aspx to see for what the bottom 3 bits are used for. For this illustration we will be numbering them 1,2,3 etc.
When you open a file you will get back a simple pointer-sized value that in the handle table points to a file object. In this case let’s say you opened a file and you received a handle value of 2. If you envision the handle table you might see something like this.
------------- USER ADDRESS SPACE FOR YOUR PROCESS -------------
Your handle for your open file is 2
------------- KERNEL ADDRESS SPACE FOR YOUR PROCESS -------------
Handle table
Handle [1]=Pointer to ->Reg
Handle [2]=Pointer to ->File <<<<<< Your handle is 2 It points to a file object
Handle [3]=Pointer to ->File
Handle [4]=Pointer to ->Reg
Handle [5]=Pointer to ->Process
Handle [6]=Pointer to ->Thread
Handle [7]=Pointer to ->Thread
Handle [8]=Pointer to ->File
Handle [9]=Pointer to ->File
In the debugger, you can see the handle table is referenced from the EPROCESS object for each process. In the !process output it is called the ObjectTable.
0: kd> !process 85df8da0
PROCESS 85df8da0 SessionId: 0 Cid: 0114 Peb: 7ffda000 ParentCid: 0268
DirBase: 0acc07e0 ObjectTable: e4b6a4c8 HandleCount: 49
Before we get into troubleshooting let’s look at a couple of scenarios for bad handles.
In the first bad handle scenario, let’s say your application closes handle 2 using CloseHandle(), however your application keeps the value 2 in a variable somewhere. In this scenario we will assume no other code has come along and opened a new object that is occupying element 2 of the handle table. If you go to make a file I/O call using handle 2 the object manager will catch this and your app will get an invalid handle exception. This is somewhat harmless but your app my crash if you don’t catch the exception and your I/O will obviously not complete.
In the second scenario we’ll say that your file I/O code closes the handle, and holds on to that handle value. Meanwhile another part of your application opens a registry key or some other object that is not a file object. Now your file I/O code goes to use handle 2 again. The problem is it’s not a handle to a file now. When the registry code opened the registry key element 2 was available so we now have a registry object in that location. If you go to do a file I/O on handle 2 you will get an invalid handle message.
Now in our final scenario, we close a file handle in our file I/O code and keep the handle around. In this case though we have different code that is also doing file I/O, It opens a file and gets handle value 2. Now you have big trouble because if the first file I/O code now writes to handle 2, and it should not be because it closed it! This handle now belongs to a different file. This means you have code basically doing a write into an unknown region for that code’s context and file format. This will result in file corruption and no invalid handle error.
So this really comes down to best practice. If you’re writing code and close a handle NULL IT OUT! That way you guarantee it will not be reused accidently at some other point.
CloseHandle(MyHandle);
MyHandle = NULL;
On to debugging: So how do you track this down? For the remote scenario you could give your customer the following action plan:
1. Start gflags and under the image file tab, enter your application name in the image edit control.
2. Check Create User mode stack trace database.
3. Start the process under windbg. “WinDBG processname.exe”
4. In windbg Run !htrace -enable
5. Do a sxe ch. This will cause windbg to break on an invalid handle exception.
6. Run sx to confirm “ch - Invalid handle – break” should be set.
7. Enter g for go.
8. Let the process run until you get an invalid handle message. The debugger should break in.
9. Now all you have to do is run !htrace
Htrace will dump out all the call stacks for each handle that was opened or closed.
You need to take the invalid handle value and search backward to see where it was last opened, and where it was closed before that. It’s likely that the close before the last open is your culprit code path. Make sure that suspect close nulls that handle value out so it does not reuse it.
In the live debug scenario following the same procedure. In this case you have the luxury of going back and setting a breakpoint in the offending call stack that freed the handle before the reuse. You can then figure out why it was not zeroed out.
Written by Jeff Dailey.
Hello NTDebuggers, I’d like to talk about a common issue we deal with on a regular basis. We are often tasked with finding what functions are using CPU within a user mode process / application. Typically a user will find an application that is using more CPU then they expect it to, and this can affect overall system performance and responsiveness.
For this exercise I’ve written some contrived sample code called EATCPU. It’s included at the bottom of the blog post. The following image of task manager shows EATCPU consuming 41% CPU time. A customer or user may tell you this does not “normally” happen. It’s always good to ask what “normal” is. In this case we will say normal is ~10%.
The best case scenario is a live debug on the process that is running at high CPU levels. If you’re lucky enough to have a customer / user that will let you do a remote debug, and the problem reproduces on demand, you can take the following action.
You need to install the debugging tools for Windows, and set your symbol path. If at all possible acquire the symbols for the application you are debugging. We’ll assume that you are the expert that supports said program. If it’s written in house, get the symbols from the developer. If it’s from a third party, that vendor may be willing to provide you with symbols for their product. Microsoft has most of the symbols for our products available on our symbol server on the internet (!sympath srv*DownstreamStore*http://msdl.microsoft.com/download/symbols.)
The next thing is to attach windbg.exe to the process in question.
From the debuggers directory, type TLIST, this will list your process. Get the process id and then run WinDBG.EXE –p PROCESSID, or if your debugging a program like EATCPU, you can run WINDBG C:\program\EATCPU.EXE.
After the debugger is attached or the process is started in the debugger, reproduce the problem.
Microsoft (R) Windows Debugger Version 6.8.0001.0
Copyright (c) Microsoft Corporation. All rights reserved.
***** WARNING: Your debugger is probably out-of-date.
***** Check http://dbg for updates.
CommandLine: eatcpu.exe
Symbol search path is: srv*C:\symbols*\\symbols\symbols
Executable search path is:
ModLoad: 00400000 0041a000 eatcpu.exe
ModLoad: 779b0000 77b00000 ntdll.dll
ModLoad: 76780000 76890000 C:\Windows\syswow64\kernel32.dll
ModLoad: 62bb0000 62cd1000 C:\Windows\WinSxS\x86_microsoft.vc80.debugcrt_1fc8b3b9a1e18e3b_8.0.50727.762_none_24c8a196583ff03b\MSVCR80D.dll
ModLoad: 75cb0000 75d5a000 C:\Windows\syswow64\msvcrt.dll
(1090.164): Break instruction exception - code 80000003 (first chance)
eax=00000000 ebx=00000000 ecx=712b0000 edx=00000000 esi=fffffffe edi=77a90094
eip=779c0004 esp=0017faf8 ebp=0017fb28 iopl=0 nv up ei pl zr na pe nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000246
779c0004 cc int 3
0:000> g
(1090.11d4): Break instruction exception - code 80000003 (first chance)
eax=7efa3000 ebx=00000000 ecx=00000000 edx=77a1d894 esi=00000000 edi=00000000
eip=779c0004 esp=0109ff74 ebp=0109ffa0 iopl=0 nv up ei pl zr na pe nc
0:007> .sympath SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols
Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols
0:007> g
(1090.17d4): Break instruction exception - code 80000003 (first chance)
Once the problem has started run the !runaway command.
0:007> !runaway
User Mode Time
Thread Time
2:c04 0 days 0:01:08.827 ß Note this thread, thread 2:c04 is using more CPU than any other.
7:17d4 0 days 0:00:00.000 ß Note the other threads are using very little if any CPU.
6:1a4c 0 days 0:00:00.000
5:d20 0 days 0:00:00.000
4:157c 0 days 0:00:00.000
3:1b28 0 days 0:00:00.000
1:1134 0 days 0:00:00.000
0:164 0 days 0:00:00.000
0:007> ~2s ß Using the thread number 2, set the thread context with the ~Ns command.
*** WARNING: Unable to verify checksum for eatcpu.exe
eax=cccccccc ebx=00b93c48 ecx=0000002b edx=00b937a8 esi=00000000 edi=00d9fcf0
eip=0041169c esp=00d9fcd0 ebp=00d9fd9c iopl=0 nv up ei pl nz na po nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010202
eatcpu!checkSomething+0x1c:
0041169c f3ab rep stos dword ptr es:[edi] es:002b:00d9fcf0=cccccccc
0:002> k ß Dump the call stack using k.
If you look at the following call stack, the applications code in this thread starts where you see EATCPU, the code before that is C Runtime code for begin thread. I want to trace all the code that is running under being thread. The assumption here is that I’ll find something looping and eating CPU. To do this I will use the WT command. However first I need to specify a beginning address for WT to start trace at.
Let's use UF to Unassemble the Function that started our code by taking the return address of eatcpu!myThreadFunction.
ChildEBP RetAddr
00d9fd9c 00411657 eatcpu!checkSomething+0x1c [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 49]
00d9fe74 004115a8 eatcpu!trySomething+0x27 [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 45]
00d9ff58 62bb4601 eatcpu!myThreadFunction+0x38 [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 35]
00d9ff94 62bb459c MSVCR80D!_beginthread+0x221
00d9ffa0 768019f1 MSVCR80D!_beginthread+0x1bc
00d9ffac 77a2d109 kernel32!BaseThreadInitThunk+0xe
00d9ffec 00000000 ntdll!_RtlUserThreadStart+0x23 [d:\vistartm\base\ntos\rtl\rtlexec.c @ 2695]
0:002> uf 004115a8 ß This command will unassemble the function at this address beginning to end.
0:007> uf 004115a8
eatcpu!myThreadFunction [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 30]:
30 00411570 55 push ebp
30 00411571 8bec mov ebp,esp
30 00411573 81eccc000000 sub esp,0CCh
30 00411579 53 push ebx
30 0041157a 56 push esi
30 0041157b 57 push edi
30 0041157c 8dbd34ffffff lea edi,[ebp-0CCh]
30 00411582 b933000000 mov ecx,33h
30 00411587 b8cccccccc mov eax,0CCCCCCCCh
30 0041158c f3ab rep stos dword ptr es:[edi]
31 0041158e 8b4508 mov eax,dword ptr [ebp+8]
31 00411591 8945f8 mov dword ptr [ebp-8],eax
eatcpu!myThreadFunction+0x24 [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 33]:
33 00411594 b801000000 mov eax,1
33 00411599 85c0 test eax,eax
33 0041159b 7410 je eatcpu!myThreadFunction+0x3d (004115ad)
eatcpu!myThreadFunction+0x2d [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 35]:
35 0041159d 8b45f8 mov eax,dword ptr [ebp-8]
35 004115a0 8b08 mov ecx,dword ptr [eax]
35 004115a2 51 push ecx
35 004115a3 e880faffff call eatcpu!ILT+35(?trySomethingYAHHZ) (00411028)
35 004115a8 83c404 add esp,4
36 004115ab ebe7 jmp eatcpu!myThreadFunction+0x24 (00411594)
eatcpu!myThreadFunction+0x3d [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 37]:
37 004115ad 5f pop edi
37 004115ae 5e pop esi
37 004115af 5b pop ebx
37 004115b0 81c4cc000000 add esp,0CCh
37 004115b6 3bec cmp ebp,esp
37 004115b8 e8a1fbffff call eatcpu!ILT+345(__RTC_CheckEsp) (0041115e)
37 004115bd 8be5 mov esp,ebp
37 004115bf 5d pop ebp
37 004115c0 c3 ret
0:002> wt -or 00411570 ß We will use WT to Watch Trace this code. I’ve selected the starting address of the myThreadFunction function. I’ve also specified –or to print the return value of each function. Wt produces very visual output. It allows you to quickly identify patterns in the way the code executes that would be much more difficult just using the T (TRACE) command.
8 0 [ 0] ntdll!RtlSetLastWin32Error eax = 0
>> No match on ret
8 0 [ 0] ntdll!RtlSetLastWin32Error
2 0 [ 0] eatcpu!checkSomething
1 0 [ 1] eatcpu!ILT+345(__RTC_CheckEsp)
2 0 [ 1] eatcpu!_RTC_CheckEsp eax = 0
9 3 [ 0] eatcpu!checkSomething
12 6 [ 0] eatcpu!checkSomething eax = 0
12 6 [ 0] eatcpu!checkSomething
7 0 [ 0] eatcpu!trySomething
10 3 [ 0] eatcpu!trySomething eax = 0
10 3 [ 0] eatcpu!trySomething
9 0 [ 0] eatcpu!myThreadFunction ß I see a pattern, a loop. Beginning of loop.
1 0 [ 1] eatcpu!ILT+35(?trySomethingYAHHZ)
60 0 [ 1] eatcpu!trySomething
1 0 [ 2] eatcpu!ILT+180(?checkSomethingYAHHZ)
62 0 [ 2] eatcpu!checkSomething
5 0 [ 3] kernel32!SetLastError
16 0 [ 3] ntdll!RtlSetLastWin32Error eax = 0
64 21 [ 2] eatcpu!checkSomething
1 0 [ 3] eatcpu!ILT+345(__RTC_CheckEsp)
2 0 [ 3] eatcpu!_RTC_CheckEsp eax = 0
71 24 [ 2] eatcpu!checkSomething
74 27 [ 2] eatcpu!checkSomething eax = 0
67 102 [ 1] eatcpu!trySomething
1 0 [ 2] eatcpu!ILT+345(__RTC_CheckEsp)
2 0 [ 2] eatcpu!_RTC_CheckEsp eax = 0
70 105 [ 1] eatcpu!trySomething eax = 0
18 176 [ 0] eatcpu!myThreadFunction ß End of loop / beginning of loop
5 0 [ 3] kernel32!SetLastError ß Always look for what might be going wrong! Last error can give you a clue. We are setting last error at the low level of the loop
74 27 [ 2] eatcpu!checkSomething eax = 0 ß Also note checkSomething is returning ZERO, this might indicate a problem. You need to look at the code or assembler.
27 352 [ 0] eatcpu!myThreadFunction ß End of loop / beginning of loop
36 528 [ 0] eatcpu!myThreadFunction ß End of loop / beginning of loop
12930 instructions were executed in 12929 events (0 from other threads)
Function Name Invocations MinInst MaxInst AvgInst
eatcpu!ILT+180(?checkSomethingYAHHZ) 69 1 1 1
eatcpu!ILT+345(__RTC_CheckEsp) 210 1 1 1
eatcpu!ILT+35(?trySomethingYAHHZ) 70 1 1 1
eatcpu!_RTC_CheckEsp 210 2 2 2
eatcpu!checkSomething 70 60 74 73
eatcpu!myThreadFunction 1 630 630 630
eatcpu!trySomething 71 10 70 68
kernel32!SetLastError 70 5 5 5
ntdll!RtlSetLastWin32Error 70 16 16 16
0 system calls were executed
eax=cccccccc ebx=00b93c48 ecx=00000002 edx=00b937a8 esi=00000000 edi=00d9fe6c
eip=0041164c esp=00d9fda8 ebp=00d9fe74 iopl=0 nv up ei pl nz na pe nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010206
eatcpu!trySomething+0x1c:
0041164c f3ab rep stos dword ptr es:[edi] es:002b:00d9fe6c=cccccccc
0:002> !gle ß Now that we have broken in let's check and see what the last error is using !GLE (Get Last Error) This dumps out the last error from the TEB.
LastErrorValue: (Win32) 0x57 (87) - The parameter is incorrect.
LastStatusValue: (NTSTATUS) 0xc000000d - An invalid parameter was passed to a service or function.
0:007> bp kernel32!SetLastError ß Lets set a breakpoint on last error to examine what is going on in the function calling it.
Breakpoint 1 hit
eax=cccccccc ebx=00b93c48 ecx=00000000 edx=00b937a8 esi=00d9fcd0 edi=00d9fd9c
eip=767913dd esp=00d9fcc8 ebp=00d9fd9c iopl=0 nv up ei pl zr na pe nc
kernel32!SetLastError:
767913dd 8bff mov edi,edi
0:002> kv ß Get the call stack
0:002> kv
00d9fcc4 004116cb 00000057 00d9fe74 00000000 kernel32!SetLastError (FPO: [Non-Fpo]) ß 0x57 Invalid parameter error, Why?
00d9fd9c 00411657 00000000 00d9ff58 00000000 eatcpu!checkSomething+0x4b (FPO: [Non-Fpo]) (CONV: cdecl) [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 57]
00d9fe74 004115a8 00000000 00000000 00000000 eatcpu!trySomething+0x27 (FPO: [Non-Fpo]) (CONV: cdecl) [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 45]
00d9ff58 62bb4601 0017ff34 4f9f12e9 00000000 eatcpu!myThreadFunction+0x38 (FPO: [Non-Fpo]) (CONV: cdecl) [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 35]
00d9ff94 62bb459c 00b937a8 00d9ffac 768019f1 MSVCR80D!_beginthread+0x221 (FPO: [Non-Fpo])
00d9ffa0 768019f1 00b937a8 00d9ffec 77a2d109 MSVCR80D!_beginthread+0x1bc (FPO: [Non-Fpo])
00d9ffac 77a2d109 00b93c48 00d926a6 00000000 kernel32!BaseThreadInitThunk+0xe (FPO: [Non-Fpo])
00d9ffec 00000000 62bb4520 00b93c48 00000000 ntdll!_RtlUserThreadStart+0x23 (FPO: [Non-Fpo]) (CONV: stdcall) [d:\vistartm\base\ntos\rtl\rtlexec.c @ 2695]
0:002> !error 00000057 ß double check, using !error, this will decode the error into a human readable string.
Error code: (Win32) 0x57 (87) - The parameter is incorrect.
0:002> uf checkSomething ß lets disassemble the function calling SetLastError.
eatcpu!checkSomething [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 49]:
49 00411680 55 push ebp
49 00411681 8bec mov ebp,esp
49 00411683 81ecc0000000 sub esp,0C0h
49 00411689 53 push ebx
49 0041168a 56 push esi
49 0041168b 57 push edi
49 0041168c 8dbd40ffffff lea edi,[ebp-0C0h]
49 00411692 b930000000 mov ecx,30h
49 00411697 b8cccccccc mov eax,0CCCCCCCCh
49 0041169c f3ab rep stos dword ptr es:[edi]
50 0041169e 837d0800 cmp dword ptr [ebp+8],0 ß Check what our first parameter is on the stack, EBP+8 remember PLUS == Parameters. Note looking at the stack it’s 00000000
50 004116a2 741d je eatcpu!checkSomething+0x41 (004116c1) ß if parameter 1 ones 0 we are going to jump to this addres, else we execute the following code. (WE JUMP) eatcpu!checkSomething+0x24 [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 52]:
52 004116a4 8bf4 mov esi,esp ß The green would have been the good code path, non errant.
52 004116a6 68fa000000 push 0FAh
52 004116ab ff15a8814100 call dword ptr [eatcpu!_imp__Sleep (004181a8)] ß Note we sleep or do some work other then eat CPU here if we are passed non ZERO
52 004116b1 3bf4 cmp esi,esp
52 004116b3 e8a6faffff call eatcpu!ILT+345(__RTC_CheckEsp) (0041115e)
53 004116b8 b801000000 mov eax,1 ß We are setting EAX to 1, this means we have succeded
53 004116bd eb15 jmp eatcpu!checkSomething+0x54 (004116d4) ß Now we jump to the clean up code for the fucntionn
eatcpu!checkSomething+0x41 [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 57]:
57 004116c1 8bf4 mov esi,esp ß This appears to be a failure case. We did not get an expected parameter so we report an error and return zero.
57 004116c3 6a57 push 57h ß Pusing error 0x57 on the stack, invalid parameter.
57 004116c5 ff15a4814100 call dword ptr [eatcpu!_imp__SetLastError (004181a4)] ß Our call to setlasterror
57 004116cb 3bf4 cmp esi,esp
57 004116cd e88cfaffff call eatcpu!ILT+345(__RTC_CheckEsp) (0041115e)
58 004116d2 33c0 xor eax,eax ß XORing eax with eax will make EAX Zero. This is an error condition.
eatcpu!checkSomething+0x54 [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 60]:
60 004116d4 5f pop edi
60 004116d5 5e pop esi
60 004116d6 5b pop ebx
60 004116d7 81c4c0000000 add esp,0C0h
60 004116dd 3bec cmp ebp,esp
60 004116df e87afaffff call eatcpu!ILT+345(__RTC_CheckEsp) (0041115e)
60 004116e4 8be5 mov esp,ebp
60 004116e6 5d pop ebp
60 004116e7 c3 ret
The key thing that should be observed by this scenario is that when dealing with a high CPU condition there is often some problem at the lower level of some loop condition that prevents the proper execution of code from happening. If you’re lucky the problem is reported by some error facility in the OS or the application. In either case you can use the above technique for isolation.
The following is the sample code for EATCPU.
// eatcpu.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <windows.h>
#include <process.h>
void myThreadFunction(void *);
int doSomething(int);
int trySomething(int);
int checkSomething(int);
int _tmain(int argc, _TCHAR* argv[])
{
int truevalue = 1;
int falsevalue = 0;
// let's start some threads. These should represent worker threads in a process. Some of them will do “valid work” one will fail to do so.
_beginthread(myThreadFunction,12000,(void *)&truevalue);
_beginthread(myThreadFunction,12000,(void *)&falsevalue );
Sleep(10*60000);
return 0;
}
void myThreadFunction(void *value)
int *localvalue = (int *)value;
while(1)
trySomething(*localvalue);
int doSomething(int value)
return trySomething(value);
int trySomething(int value)
return checkSomething(value);
int checkSomething(int value)
if(value)
{ // Make sure we have have some valid input parameter.
//We will pretend we are doing work, this could be anything , file I/O etc.
Sleep(250);
return TRUE;
else
{ // This is an error condition, this function expects an NON Zero parameter and will report ZERO as an invalid parameter.
SetLastError(ERROR_INVALID_PARAMETER);
return FALSE;
Hello, my name is Ron Stock and I’m an Escalation Engineer on the Microsoft Platforms Global Escalation Services Team. Today I’m going to talk about pool corruption which manifests itself in various ways. It’s usually hard to track down because the culprit is long gone when the machine crashes. Tools such as Special Pool make our debug lives easier; however tracking down corruption doesn’t always have to make you pull your hair out. In some cases simply re-tracing the steps of the crash can reveal a smoking gun.
Let’s take a look at a real world example. First we need to be in the right context so we set the trap frame to give us the register context when the machine crashed.
2: kd> .trap 0xfffffffff470662c
ErrCode = 00000002
eax=35303132 ebx=fd24d640 ecx=fd24d78c edx=fd24d784 esi=fd24d598 edi=fd24d610
eip=e083f7a5 esp=f47066a0 ebp=f47066e0 iopl=0 nv up ei pl nz na po nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010202
nt!KeWaitForSingleObject+0x25b:
e083f7a5 ff4818 dec dword ptr [eax+18h] ds:0023:3530314a=????????
From the register output we can tell that the system crashed while attempting to dereference a pointer at memory location [eax+18h]. The value stored in register eax is probably the address of a structure given that the code is attempting to dereference offset 18 from the base of eax. Currently eax is pointing to 0x35303132 which is clearly not a valid kernel mode address. Most kernel mode addresses on 32-bit systems will be above the 0x80000000 range assuming the machine is not using something like the /3GB switch. Our mission now is to determine how eax was set.
First we’ll unassemble the failing function using the UF command.
2: kd> uf nt!KeWaitForSingleObject
…..
……
e083f7a5 ff4818 dec dword ptr [eax+18h]
e083f7a8 8b4818 mov ecx,dword ptr [eax+18h]
e083f7ab 3b481c cmp ecx,dword ptr [eax+1Ch]
e083f7ae 0f836ef9ffff jae nt!KeWaitForSingleObject+0x2a3 (e083f122)
I truncated the results of the UF output to conserve space in this blog. Instruction e083f7a5 is the line of code that generated the fault so our focus is to determine how the value of eax was set prior to running instruction e083f7a5. Based on the UF output, instruction e083f11c could have jumped to e083f7a5. Let’s investigate how eax is set before instruction e083f11c jumped to the failing line.
nt!KeWaitForSingleObject+0x244:
e083f107 8d4208 lea eax,[edx+8]
e083f10a 8b4804 mov ecx,dword ptr [eax+4]
e083f10d 8903 mov dword ptr [ebx],eax
e083f10f 894b04 mov dword ptr [ebx+4],ecx
e083f112 8919 mov dword ptr [ecx],ebx
e083f114 895804 mov dword ptr [eax+4],ebx
e083f117 8b4668 mov eax,dword ptr [esi+68h]
e083f11a 85c0 test eax,eax
e083f11c 0f8583060000 jne nt!KeWaitForSingleObject+0x25b (e083f7a5) ß--Jump
Instruction e083f117 moves a value into eax so I’m dumping the value here.
2: kd> dd esi+68h l1
fd24d600 35303132
Bingo! There’s our bad value of 35303132 which is the value of the eax register too, so we probably took this code path. Just to confirm the current value of eax, I’m dumping the register which should mirror the results for eax when using the “r” command to get the full register set.
2: kd> r eax
Last set context:
eax=35303132
Now our focus moves to why dword ptr [esi+68h] points to the bad value? Without source code this can be challenging to narrow down, however the !pool command comes in handy for cases like this.
2: kd> ? esi+68h
Evaluate expression: -47917568 = fd24d600
Let’s examine fd24d600 a little more in detail using the !pool command. The !pool command neatly displays an entire page of 4k kernel memory listing all of the allocations contained on the page. From the output we can determine that our address is allocated from NonPaged pool and holds some sort of thread data, evidenced by the Thre tag next to our allocation. Notice the asterisk next to fd24d578 indicating the start of our pool. Virtual address fd24d578 is the beginning of an 8 byte pool header, and the header is followed by the actual data blob. Be aware that not all memory is allocated from the pool so the !pool command is not always useful. I have more information on !pool later in the blog.
2: kd> !pool fd24d600
Pool page fd24d600 region is Nonpaged pool
fd24d000 size: 270 previous size: 0 (Allocated) Thre (Protected)
fd24d270 size: 10 previous size: 270 (Free) `.lk
fd24d280 size: 40 previous size: 10 (Allocated) Ntfr
fd24d2c0 size: 20 previous size: 40 (Free) CcSc
fd24d2e0 size: 128 previous size: 20 (Allocated) PTrk
fd24d408 size: 128 previous size: 128 (Allocated) PTrk
fd24d530 size: 8 previous size: 128 (Free) Mdl
fd24d538 size: 28 previous size: 8 (Allocated) Ntfn
fd24d560 size: 18 previous size: 28 (Free) Muta
*fd24d578 size: 270 previous size: 18 (Allocated) *Thre (Protected) ß-our pool
fd24d7e8 size: 428 previous size: 270 (Allocated) Mdl
fd24dc10 size: 30 previous size: 428 (Allocated) Even (Protected)
fd24dc40 size: 30 previous size: 30 (Allocated) TCPc
fd24dc70 size: 18 previous size: 30 (Free) SeTd
fd24dc88 size: 28 previous size: 18 (Allocated) Ntfn
fd24dcb0 size: 128 previous size: 28 (Allocated) PTrk
fd24ddd8 size: 228 previous size: 128 (Allocated) tdLL
I’ll dump out the contents of the allocation using the dc command starting at the pool header for this block of memory. Remember, we expect to move a value from [esi+68] into eax. Later the code dereferences [eax+18] which leads me to believe that eax is the base of a structure. So we expect a valid Kernel mode value to be moved into eax rather than something like a string, otherwise the code wouldn’t dereference an offset.
2: kd> dc fd24d578
fd24d578 0a4e0003 e5726854 00000003 00000002 ..N.Thr.........
fd24d588 eb10ee70 20000000 e08b5c60 eb136f96 p...... `\...o..
fd24d598 006e0006 00000000 fd24d5a0 fd24d5a0 ..n.......$...$.
fd24d5a8 fd24d5a8 fd24d5a8 f4707000 f4704000 ..$...$..pp..@p.
fd24d5b8 f4706d48 00000000 fd24d700 fd24d700 Hmp.......$...$.
fd24d5c8 fd24d5c8 fd24d5c8 fd270290 01000100 ..$...$...'.....
fd24d5d8 00000002 00000000 00000001 01000a02 ................
fd24d5e8 00000000 fd24d640 32110000 0200009f ....@.$....2....
2: kd> dc
fd24d5f8 00000000 20202020 32313532 000a6953 .... 25125Si.. <-- appears to be a string.
fd24d608 20202020 20202020 20202020 5c4e4556 VEN\
fd24d618 32313532 20202020 20202020 20202020 2512
fd24d628 00000000 00000000 00000000 00000000 ................
fd24d638 00000000 00000000 fd24d78c fd24d78c ..........$...$.
fd24d648 00000000 fd24d784 fd24d640 30010000 ......$.@.$....0
fd24d658 00343033 00000000 00000000 00000000 304.............
fd24d668 00000000 01000000 00000000 00000000 ................
fd24d678 fd24d598 00000000 00000000 00000000 ..$.............
fd24d688 fd24d618 fd24d618 fd24d598 fd24d610 ..$...$...$...$.
fd24d698 00000000 00010102 00000000 00000000 ................
fd24d6a8 00000000 00000000 e08aeee0 00000000 ................
fd24d6b8 00000801 0000000f fd270290 0000000f ..........'.....
fd24d6c8 fd24d5c0 fd24d6d0 00000000 00000000 ..$...$.........
fd24d6d8 00000000 00000000 00000000 00000000 ................
fd24d6e8 00000000 00000000 f4707000 06300612 .........pp...0.
Examining the memory contents above you can clearly see a string overwrite starting around 0xfd24d5f8. The memory we dereferenced, fd24d600 or [esi+68], is right in the middle of the string. The string appears to be a vendor number for a piece of hardware. After examining the setupapi.log and the OEM**.inf files in the Windows\inf directory we found a similar string and narrowed it down to a third party.
A little more on the !pool command is important to mention. The memory address of interest may not always be allocated from the pool in which case you would encounter a message similar to this.
0: kd> !pool 80000ae5
Pool page 80000ae5 region is Unknown
80000000 is not a valid large pool allocation, checking large session pool...
80000000 is freed (or corrupt) pool
Bad allocation size @80000000, too large
***
*** An error (or corruption) in the pool was detected;
*** Pool Region unknown (0xFFFFFFFF80000000)
*** Use !poolval 80000000 for more details.
If this had been the case I would have enabled Special Pool to narrow down the culprit.
A trend that I’ve noticed recently are cases involving paged pool depletion with high MmSt tag usage that remains after trying KB304101 (PoolUsageMaximum). These pool allocations are used by the memory manager for section object prototype PTEs. There are generally only two options when this happens: 1) upgrade to a 64-bit platform, or 2) reduce the size of the volumes. But we may want to know what mapped files are using this memory. Here is how it can be done. Start with !memusage.
5: kd> !memusage
loading PFN database
loading (100% complete)
Compiling memory usage data (99% Complete).
Zeroed: 19073 ( 76292 kb)
Free: 0 ( 0 kb)
Standby: 1468824 (5875296 kb)
Modified: 368 ( 1472 kb)
ModifiedNoWrite: 1927 ( 7708 kb)
Active/Valid: 605772 (2423088 kb)
Transition: 0 ( 0 kb)
Bad: 0 ( 0 kb)
Unknown: 0 ( 0 kb)
TOTAL: 2095964 (8383856 kb)
Building kernel map
Finished building kernel map
Scanning PFN database - (100% complete)
Following this you will see the list of mapped files and their control areas.
Usage Summary (in Kb):
Control Valid Standby Dirty Shared Locked PageTables name
…
8c62a638 1108 945868 3064 0 0 0 mapped_file( $Mft )
The control area is the address at the far left and has a Segment field that contains the total number of PTEs.
5: kd> dt 8c62a638 _CONTROL_AREA Segment->TotalNumberOfPtes
nt!_CONTROL_AREA
+0x000 Segment :
+0x004 TotalNumberOfPtes : 0x1e8b00
The MmSt allocations contain these PTEs so all we need to do is multiply this by the size of a PTE to get the total size of the MmSt allocations for this control area. Note that there may be multiple allocations for this control area, but this number will reflect the total size all these allocations.
5: kd> ?? 0x1e8b00 * sizeof(nt!_MMPTE)
unsigned int 0xf45800
So now we know the MmSt size in bytes for a single control area, or mapped file. What if we would like to see the totals for all mapped files from the !memusage output? First, place the !memusage output in a text file and remove all header information. You will also need to remove all tail information including the page file and process summaries. Every line should look like these.
8b06ac18 516 0 0 0 0 0 No Name for File
We want to include the “No Name for File” entries since those are valid mapped files even though the name could not be located. Next strip out everything but the control area address. You can use Excel or any other tool that allows you to select and delete columns in a text file. Now we have a file with a single column of all the control areas on the system. The following debugger command script can be used to process this file.
$$ countptes.txt script
r $t2 = 0;
$$ Replace the memusage.txt file name with your file name.
.foreach /f (ca "memusage.txt") {
r $t1 = @@c++(((nt!_CONTROL_AREA *)(0x${ca}))->Segment->TotalNumberOfPtes);
.printf "Control Area %p : %d\n", ${ca}, @$t1;
r $t2 = @$t2 + @$t1;
.printf "Total PTEs : %d\n", @$t2;
.printf "MmSt size : %d bytes\n", (@$t2 * @@c++(sizeof(nt!_MMPTE)));
The following command will execute the script.
5: kd> $$><countptes.txt
This will show the number of PTEs for each control area, followed by a summary.
Total PTEs : 62790244
MmSt size : 502321952 bytes
A common high user of MmSt allocations is $Mft. The cache manager will hold the MmSt allocations for these file system metadata files at a cost of up to 4 files per PTE. This technique can be used to determine how much $Mft is using MmSt pool memory by first using findstr at a command prompt to isolate just those values from the !memusage output.
C:\Projects>findstr /c:"$Mft" memusage.txt >mftusage.txt
After stripping out the control area addresses with Excel and running the command script you’ll have the size of the MmSt allocations for just the $Mft files. If this is consuming most of the MmSt bytes then you are limited to the options mentioned at the beginning of this article. There may be other options if something else is the primary user but it will likely involve reducing some heavy load on the system.
-Bryan
Introduction
Hi everyone, Bob here again with a description of Work Queues and Dispatcher Headers. For those of you that look at dumps, you may have noticed that there are always threads waiting at KeRemoveQueue. You may have wondered what this function does. Well, I’m glad you asked… J
What are those threads doing?
Those threads waiting on the Remove Queue are worker threads. Worker threads are used when a system task cannot or does not want to do a particular task. For example, a thread running a DPC cannot pend and wait for a task to be done, so it sends the work to a worker thread.
How does this mechanism work?
The worker thread and the entities that are going to give the worker thread work, each know of a KQUEUE structure. The KQUEUE structure is initialized and, since the queue has an embedded dispatcher object, the worker thread pends on it waiting to be signaled. That is what you see on one of these waiting stacks.
Below is a KQUEUE:
typedef struct _KQUEUE { DISPATCHER_HEADER Header; LIST_ENTRY EntryListHead; ULONG CurrentCount; ULONG MaximumCount; LIST_ENTRY ThreadListHead;} KQUEUE, *PKQUEUE, *RESTRICTED_POINTER PRKQUEUE;
Below is an example of a waiter:
Priority 9 BasePriority 9 PriorityDecrement 0
Child-SP RetAddr Call Site
fffffadc`b053dab0 fffff800`01027752 nt!KiSwapContext+0x85
fffffadc`b053dc30 fffff800`01024ef0 nt!KiSwapThread+0x3c9 ß Waits on the dispatcher object
fffffadc`b053dc90 fffffadc`b9a380b0 nt!KeRemoveQueue+0x656
fffffadc`b053dd10 fffff800`0124b972 srv!WorkerThread+0xb0
fffffadc`b053dd70 fffff800`010202d6 nt!PspSystemThreadStartup+0x3e
fffffadc`b053ddd0 00000000`00000000 nt!KxStartSystemThread+0x16
What is a dispatcher object?
A dispatcher object can be passed into kernel routines such as KeWaitForSingleObject. This object is a synchronization object. This means that a thread can wait on this object until another thread “signals” it. The function KeRemoveQueue is waiting for its dispatcher object to be “signaled”.
Below is a dispatcher object. Basically threads are queued on this object until the object is “signaled”. Once that happens the waiting thread is readied for execution.
nt!_DISPATCHER_HEADER
+0x000 Type : UChar
+0x001 Absolute : UChar
+0x001 NpxIrql : UChar
+0x002 Size : UChar
+0x002 Hand : UChar
+0x003 Inserted : UChar
+0x003 DebugActive : UChar
+0x000 Lock : Int4B
+0x004 SignalState : Int4B ß Set when the object is signaled.
+0x008 WaitListHead : _LIST_ENTRY ß List of waiters on this object.
Below is an actual dispatcher object for a queue:
5: kd> dt nt!_dispatcher_header fffffadcdb3ed368
+0x000 Type : 0x4 ''
+0x001 Absolute : 0 ''
+0x001 NpxIrql : 0 ''
+0x002 Size : 0x10 ''
+0x002 Hand : 0x10 ''
+0x003 Inserted : 0 ''
+0x003 DebugActive : 0 ''
+0x000 Lock : 1048580
+0x004 SignalState : 0
+0x008 WaitListHead : _LIST_ENTRY [ 0xfffffadc`db3f4ce8 - 0xfffffadc`da74dce8 ] ß List of threads waiting for this object
Each thread has a Wait List entry for each object it is waiting for:
5: kd> dt nt!_KWAIT_BLOCK 0xfffffadc`db3f4ce8
+0x000 WaitListEntry : _LIST_ENTRY [ 0xfffffadc`da74dce8 - 0xfffffadc`db3ed370 ] ß Next thread waiting for this object
+0x010 Thread : 0xfffffadc`db3f4bf0 _KTHREAD ß The thread waiting for the object
+0x018 Object : 0xfffffadc`db3ed368 ß The object the thread is waiting for (queue object)
+0x020 NextWaitBlock : 0xfffffadc`db3f4ce8 _KWAIT_BLOCK ß Next object this thread is waiting for (thread 0xfffffadc`db3f4bf0) if any.
+0x028 WaitKey : 0
+0x02a WaitType : 0x1 ''
+0x02b SpareByte : 0 ''
+0x02c SpareLong : 1533340
What wakes up or signals the thread?
When the thread is waiting, an entity can call KeInsertQueue to insert elements in the work queue. When that event happens the thread is woken up and the system will remove the entry from the work queue and the call from KeRemoveQueue will return with the element. If the thread is not waiting when the call is made, the dispatcher object is put in the queue and the next call to KeRemoveQueue will not pend.
What about synchronization objects?
When one thread wants to synchronize with another, a synchronization object (such as an event) is used. When a thread waits for an event, another thread will signal the event when a job is done such as I/O. The dispatcher objects above are used for all the synchronization objects. As you can see by how the structures are designed, one thread can wait for many objects.
Below this thread is waiting for a synchronization object.
THREAD fffffadff752b040 Cid 0004.2858
fffffadcbe1c3768 NotificationEvent ß Object thread is waiting for.
Not impersonating
DeviceMap fffffa80000840f0
Owning Process fffffadce06e15a0 Image: System
Wait Start TickCount 49664324 Ticks: 247591 (0:01:04:28.609)
Context Switch Count 1
UserTime 00:00:00.000
KernelTime 00:00:00.000
Start Address EmcpBase (0xfffffadcbe22d810)
Stack Init fffffadcb870be00 Current fffffadcb870b940
Base fffffadcb870c000 Limit fffffadcb8706000 Call 0
Priority 8 BasePriority 8 PriorityDecrement 0
Child-SP RetAddr : Args to Child : Call Site
fffffadc`b870b980 fffff800`01027752 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSwapContext+0x85
fffffadc`b870bb00 fffff800`0102835e : 00000000`00000000 00000000`00000000 fffffadf`f752b0d8 fffffadf`f752b040 : nt!KiSwapThread+0x3c9
fffffadc`b870bb60 fffffadc`be21832b : 00000000`00000000 fffff800`00000000 00000000`00000000 fffffadc`be88b100 : nt!KeWaitForSingleObject+0x5a6
fffffadc`b870bbe0 fffffadc`be1bd0da : 00000000`00000004 00000000`00000000 fffffadc`be239c40 00000000`00000000 : EmcpBase+0xb32b
fffffadc`b870bc20 fffffadc`be22c9a1 : 00000000`00000000 fffffadc`b870bd08 fffffadc`be239c40 fffffadc`e06f6fe0 : EmcpMPAA+0xd0da
fffffadc`b870bc70 fffffadc`be22d90b : fffffadc`da2338c0 00000000`00000001 fffffadc`d9eb3c10 fffffadc`b870bd08 : EmcpBase+0x1f9a1
fffffadc`b870bce0 fffff800`0124b972 : fffffadc`d9f85780 fffffadf`f752b040 00000000`00000080 fffffadf`f752b040 : EmcpBase+0x2090b
fffffadc`b870bd70 fffff800`010202d6 : fffff800`011b1180 fffffadf`f752b040 fffff800`011b5500 00000000`00000000 : nt!PspSystemThreadStartup+0x3e
fffffadc`b870bdd0 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KxStartSystemThread+0x16
Dispatcher header from address above:
5: kd> dt nt!_dispatcher_header fffffadcbe1c3768
+0x000 Type : 0 ''
+0x002 Size : 0x6 ''
+0x002 Hand : 0x6 ''
+0x000 Lock : 393216
+0x008 WaitListHead : _LIST_ENTRY [ 0xfffffadf`f752b138 - 0xfffffadf`f752b138 ]
Wait block for this thread:
5: kd> dt 0xfffffadf`f752b138 _KWAIT_BLOCK
nt!_KWAIT_BLOCK
+0x000 WaitListEntry : _LIST_ENTRY [ 0xfffffadc`be1c3770 - 0xfffffadc`be1c3770 ]
+0x010 Thread : 0xfffffadf`f752b040 _KTHREAD
+0x018 Object : 0xfffffadc`be1c3768
+0x020 NextWaitBlock : 0xfffffadf`f752b138 _KWAIT_BLOCK
+0x02c SpareLong : 1
Conclusion
I hope this gives a better understanding of Work Queues and Dispatcher Headers. More detailed information can be found here: http://msdn2.microsoft.com/en-us/library/ms810047.aspx and here: http://www.microsoft.com/whdc/driver/kernel/locks.mspx.
Hello NTDebuggers, from time to time we see the following problem. It’s another access violation, and the debug notes below are from a minidump.
Here is what we need to know…
· Generally speaking what happened to cause this AV?
· What method you would use to isolate root cause of the failure?
There are a lot of ways to do this. We look forward to hearing your approach.
We will post our methods and answer at the end of the week. If you need anything please let us know.
-------------------------------------------
Loading Dump File [D:\test123.dmp]
User Mini Dump File: Only registers, stack and portions of memory are available
0:000> k 123
0017f93c 75e4edb5 ntdll!ZwWaitForMultipleObjects+0x15
0017f9d8 75e430c3 kernel32!WaitForMultipleObjectsEx+0x11d
0017f9f4 75ef2084 kernel32!WaitForMultipleObjects+0x18
0017fa60 75ef22b1 kernel32!WerpReportFaultInternal+0x16c
0017fa74 75ebbe60 kernel32!WerpReportFault+0x70
0017fb00 7732d15a kernel32!UnhandledExceptionFilter+0x1c1
0017fb08 773000c4 ntdll!_RtlUserThreadStart+0x6f
0017fb1c 77361d05 ntdll!_EH4_CallFilterFunc+0x12
0017fb44 772eb6d1 ntdll!_except_handler4+0x8e
0017fb68 772eb6a3 ntdll!ExecuteHandler2+0x26
0017fc10 772cee57 ntdll!ExecuteHandler+0x24
0017fc10 10011127 ntdll!KiUserExceptionDispatcher+0xf
*** ERROR: Module load completed but symbols could not be loaded for crash3.exe
0017ff40 0040104a 0x10011127
0017ffa0 75eb19f1 crash3+0x104a
0017ffac 7732d109 kernel32!BaseThreadInitThunk+0xe
0017ffec 00000000 ntdll!_RtlUserThreadStart+0x23
0:000> lm
start end module name
00400000 0040d000 crash3 (no symbols)
6c250000 6c288000 odbcint (deferred)
6c290000 6c2f5000 odbc32 (deferred)
72a00000 72a86000 comctl32 (deferred)
74820000 749b4000 comctl32_74820000 (deferred)
75240000 75251000 samlib (deferred)
75260000 75281000 ntmarta (deferred)
754b0000 75510000 secur32 (deferred)
75510000 75570000 imm32 (deferred)
75700000 75790000 gdi32 (deferred)
757a0000 75870000 user32 (deferred)
758a0000 758a6000 nsi (deferred)
758b0000 759f4000 ole32 (deferred)
75a00000 75aaa000 msvcrt (deferred)
75ab0000 75ba0000 rpcrt4 (deferred)
75ba0000 75c1d000 usp10 (deferred)
75c20000 75c75000 shlwapi (deferred)
75d60000 75e27000 msctf (deferred)
75e30000 75f40000 kernel32 (pdb symbols)
76140000 76189000 Wldap32 (deferred)
76190000 7624f000 advapi32 (deferred)
76250000 76d1e000 shell32 (deferred)
76d20000 76d94000 comdlg32 (deferred)
76da0000 76dcd000 ws2_32 (deferred)
77280000 77287000 psapi (deferred)
77290000 77299000 lpk (deferred)
772b0000 77400000 ntdll (pdb symbols)
Jeff-
[Update: our answer. Posted 5/13/2008]
We enjoyed seeing different people’s approaches on this week’s puzzler. This was a simple module unload. We loaded a lib, did a GetProcAddress, freed the lib, and called the function. The dump was a mini dump created via .dump /m C:\dump file. There are various ways this type of scenario may arise. Obviously someone could unload a lib, but why? In most cases I’ve seen, it was due to a ref count problem in a com object. Poor accounting leading to one too many decrements, and the dll will get unloaded causing a simple crash footprint.
There are quite a few ways to track this down. First of all, if you had the debugger attached and got a full dump or /ma dump you would have seen the loaded module list. This would have been a dead giveaway and part of why we did the .dump /m. There are other options you can enable that make tracking of module loads easy under the debugger. I personally like “loader snaps” if I’m trying to track down module load shenanigans. To enable this, just go into the image section of the gflags tool and enable loader snaps for the exe in question. Now attach a debugger and watch the mode load and GetProcAddress details scroll by.
Yet another popular approach is to use process monitor. This tool is not only easy to set up, but it also gives you great logs with call stacks and other details such as registry accesses.
This puzzler provided the bare minumum data required. We did not give you much to go on because sometimes in real debugging scenarios you have to work with a lack of data. I really liked how many people questioned the source of the dump file. It really shows how familiar you all are with the various dump types.
Great work!
Hello NTDebuggers… I look at a lot of dump files every day. This being the case I like to take full advantage of the customizable look and feel of windbg. I actually have an association setup between DMP files and a CMD file that loads my customized COLOR workspace each time I double click on a dump file. I like a black background with colorized source and debug command window output.
Here is a snapshot of my typical debug session.
Here is how you set it up.
1) Create the following CMD file and put it in your path. It’s called D.CMD on my system..
echo off Title kd.exe -z %1 C: CD\debuggers start C:\Debuggers\windbg.exe -z %1 -W color
2) Load windbg and under view \ options configure your color options. Note: you will want to set every option in the list box.
3) Save the workspace by selecting FILE \ Save Workspace, I saved mine as COLOR
4) If you don’t have a dump file create one by opening a process in the debugger and doing a .dump /ma C:\test\test.dmp (I’ll assume you have a directory called C:\test)
5) Select the C:\test\test.dmp file in explorer and right click on the properties. (note I used badwindow.dmp)
6) Change the dump file association via drilling down to Change \ Browse \ (Select D.CMD)
7) That’s all there is to it. Now you can simply double click on any DMP file and windbg will start and load with your COLOR workspace.
Jeff Dailey-