Hi NTDebuggers, something rarely talked about are the odds of a problem being in one piece of code vs. another. From time to time we see some very strange debugs or symptoms reported by customers. The problems can be associated with anything from an internally written application, a Microsoft product running on Windows, or an application written by a 3rd party vendor. In fact we are often engaged to assist one of our customers or vendors with troubleshooting or debugging their applications.
One of the first things we do is assess the situation. We ask questions like:
· Where is the program crashing?
· What binaries comprise the program?
· How often are those various binaries used worldwide?
Let’s use the following pseudo call stack and binaries as an example.
NTDLL!VeryCommonFunction << Crash happens in this function.
ADVAPI32!RatherCommonFunction
MYCustomApp!RarelyUsedCode
MyCustomApp!Main
If I see a crash in NTDLL!VeryCommonFunction I’m going to make some assumptions as I assess the domain of the problem. This holds true for any operating system, product, or software in general. The code that runs more than any other code is, by its nature, effectively tested more because it runs more. Therefore it is less likely to be the root cause of the fault, and in some cases it is simply the victim. This applies to all operating systems: UNIX, Mac OS, Windows... core code tends to be less buggy.
Let’s look at a real world example of some very common code in Windows. NTDLL!RtlAllocateHeap and NTDLL!RtlFreeHeap. For those of you not familiar with NTDLL, it’s loaded in just about every process on every machine running a modern copy of Windows, worldwide. The average machine has ~40-200+ process (applications, and miscellaneous services running), and there are hundreds of millions of PCs worldwide running Windows, so that gives us ~billions of processes running NTDLL, give or take a few billion. Collectively, those processes are going to call RtlFreeHeap or RtlAllocateHeap millions of times in the next second.
So what are the odds? Is it likely that this core API used by billions of processes is crashing because of a bug in the core API? Or is it more likely that a smaller vertical market or custom application running on ~500 machines worldwide did something to destabilize one of the process heaps?
Typically when an application is crashing in a heap function inside of NTDLL, support engineers become suspicious of activity in the process space, and in this case it’s more likely to be a problem with heap corruption. It is likely that code running in the host process that has NTDLL loaded has corrupted one of the heaps by overwriting a buffer, doing a double free, or some other problem. Then when a call is made into the Microsoft heap API, NTDLL has to traverse the heap structures that are corrupted by the host application, so the process crashes. And yes, the crash is in NTDLL. In this case, I typically ask the customer to enable full page heap via gflags (this puts an additional page marked with the PAGE_NOACCESS attribute at the end of each allocation). We then wait for the next crash and analyze it. Enabling full page heap helps you catch the corruptor with “their hand in the cookie jar”.
The same scenario holds true for other core functionality such as kernel pool allocations, invalid handles, leaks etc. Again, core code tends to be rock solid because of sheer volume of use and exposure to a variety of environments. This being the case, it also tends to change less over time. Of course there is code in the OS or other components that is not used as much, which is more likely to have problems. We always take that into consideration when scoping an issue.
The good news is we are always happy to dig in and help our customers isolate these types of problems.
Please feel free to chime in and share your stories.
Good Luck and happy debugging.
Jeff-
Hello, my name is Graham, and I’m an escalation engineer on the Platforms Global Escalation Team. I recently worked a case where a group of Windows XP machines were hitting a bugcheck on boot, error 0xC000021A. This error occurs when a critical usermode process such as winlogon or csrss crashes. I had access to a failing machine, so I attached the kernel debugger to find out why winlogon was crashing. I found the cause, and a little bit more about Data Execution Prevention (DEP) in the process.
The initial debugger spew gave me this information:
*** An Access Violation occurred in winlogon.exe:
The instruction at 10030F90 tried to write to an invalid address, 10030F90
*** enter .exr 0006F4AC for the exception record
*** enter .cxr 0006F4C8 for the context
*** then kb to get the faulting stack
So I followed its cue and got the exception record and context record:
1: kd> .exr 0006F4AC
ExceptionAddress: 10030f90
ExceptionCode: c0000005 (Access violation)
ExceptionFlags: 00000000
NumberParameters: 2
Parameter[0]: 00000008
Parameter[1]: 10030f90
Attempt to execute non-executable address 10030f90
Ahh, OK, so we know this is a DEP crash now.
1: kd> .cxr 0006F4C8
eax=00000400 ebx=00000000 ecx=00000000 edx=00010000 esi=00000000 edi=00084370
eip=10030f90 esp=0006f794 ebp=0006f81c iopl=0 nv up ei pl nz na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010206
001b:10030f90 33c0 xor eax,eax
Let's check out the crashing stack to see what's going on:
1: kd> kb
*** Stack trace for last set context - .thread/.cxr resets it
ChildEBP RetAddr Args to Child
0006f81c 010297c1 00084370 01010ab4 00000000 3rdparty!nosymbols
0006fcfc 010312a6 00072364 7c80b6a1 00000000 winlogon!ExecSystemProcesses+0x14d
0006ff50 0103d4d0 01000000 00000000 00072364 winlogon!WinMain+0x2b6
0006fff4 00000000 7ffd7000 000000c8 000001ec winlogon!WinMainCRTStartup+0x174
The first thing I decided to look for was how we got to this address. To begin, I unassembled the code right before the return address to winlogon!ExecSystemProcesses.
kd> ub 010297c1
winlogon!ExecSystemProcesses+0x12e
010297a2 6a02 push 2
010297a4 ffb594fbffff push dword ptr [ebp-46Ch]
010297aa 6880000000 push 80h
010297af 56 push esi
010297b0 56 push esi
010297b1 68b40a0101 push offset winlogon!`string' (01010ab4)
010297b6 ffb5a0fbffff push dword ptr [ebp-460h]
010297bc e891fcffff call winlogon!StartSystemProcess (01029452)
According to the stack, winlogon!ExecSystemProcesses didn't call the function currently running. So, I suspected some hooking was going on. Using !chkimg, I verified this was the case. Note that chkimg requires a valid copy of the binary in the symbol path.
1: kd> !chkimg -db kernel32
10 errors : kernel32 (7c802332-7c80236b)
7c802330 90 90 *e9 *59 *ec *82 *93 6a 00 ff 75 2c ff 75 28 ff ...Y...j..u,.u(.
...
7c802360 28 00 90 90 90 90 90 *e9 *d4 *eb *82 *93 6a 00 ff 75 (...........j..u
1: kd> u 7c802330
kernel32!WriteProcessMemory+0x10d:
7c802330 90 nop
7c802331 90 nop
kernel32!CreateProcessW
7c802332 e959ec8293 jmp 3rdparty!nosymbols (10030f90)
Aha! Something has hooked CreateProcessW to jump to our current instruction. Now that we know how we got there, let's understand why we crashed. Since DEP fired, that means this address is non-executable. I verified this by dumping out the PTE for the address.
1: kd> !pte 10030F90
VA 10030f90
PDE at 00000000C0600400 PTE at 00000000C0080180
contains 000000004E102867 contains 800000004E021867
pfn 4e102 ---DA--UWEV pfn 4e021 ---DA--UW-V
Notice that in the protection flags for the PTE, the 'E' bit isn't set, saying this page isn't executable. So, where is this address we were trying to execute? Many times with DEP crashes this will be in stack or heap memory. But not this time. In this case, the address is actually in a module's memory mapped address space, as shown by the 'lm' command
1: kd> lm m 3rdparty
10000000 1003c000 3rdparty C (export symbols) 3rdparty.dll
Hmm... So the address falls in this module. Why isn't it executable? Usually when I think of image files, I think of running code. But, remembering back to how the PE images are laid out, a module is broken into subsections, with different types of data in each one, and different protection levels. There's a place in the image for code, and for data, such as global variables and static data. So, let's dump the image header and find which section offset 0x30F90 is in.
1: kd>!dh 3rdparty
<snip>
SECTION HEADER #3
.data name
1EE3C virtual size
1A000 virtual address // (1A000+1EE3C=0x38e3c so mem range for section is 1A000 to 0x38e3c)
3000 size of raw data
1A000 file pointer to raw data
0 file pointer to relocation table
0 file pointer to line numbers
0 number of relocations
0 number of line numbers
C0000040 flags
Initialized Data
(no align specified)
Read Write // no Execute !
This is our section, since the virtual address starts at 0x1A000 and is 0x1EE3C in size, putting the end of the section at 0x38e3c. Our address of 0x30F90 falls between them.
Sure enough, this section is labeled as "Initialized Data", and the protection flags show Read and Write, but no Execute! So, this address is not in a code section of the module, and DEP will not allow it to run.
Knowing this, I was able to find an update on the 3rd party manufacturer's site that modified their DLL to prevent this from occurring. Mystery solved!
Correction, Windows Internals Beta Exam 71-660
This was previously listed incorrectly as 70-660, the actual exam ID is now 71-660
Update 7-18-2008: The link to the exam preparation information has been updated below.
Update 7-21-2008: When registering, there is no need to use a credit card. Be sure to instead use the following promotional code listed below when registering for the exam: WINT. If you have already been charged for this exam, please contact Prometric and request a refund, and reference ticket number 3711938.
Hello Windows Debugging Community, We are super excited to announce a new certification. With the help of industry partners, including some of our very talented community members, we have developed a new Windows Internals Certification. This Certification targets developers and IT professionals that require knowledge of Windows Internals as part of their profession. Chance are if you are reading this blog post you are part of the target audience. I encourage you to register and please take the Beta exam. Please help us to improve the exam by providing feedback during the beta exam.
Register for Beta Exam 71-660: TS: Windows® Internals
You are invited to take beta exam 70-660: TS: Windows® Internals. If you pass the beta exam, the exam credit will be added to your transcript and you will not need to take the exam in its released form. By participating in beta exams, you have the opportunity to provide the Microsoft Certification program with feedback about exam content, which is integral to development of exams in their released version. We depend on the contributions of experienced IT professionals and developers as we continually improve exam content and maintain the value of Microsoft certifications. Please remember that participation in the beta process is completely voluntary and Microsoft makes no promises or guarantees regarding the beta exam process. You can expect to receive your score on the beta exam within 12 weeks of taking the exam, although in some instances, beta exams may take longer to score and your results may be delayed.70-660: TS: Windows® Internals counts as credit towards the following certification.
· Microsoft Certified Technology Specialist (MCTS): Windows® Internals
Availability
Registration begins: July 17, 2008
Beta exam period runs: July 21, 2008– August 5, 2008
Receiving this invitation does not guarantee you a seat in the beta; we recommend that you register immediately. Beta exams have limited availability and are operated under a first-come-first-served basis. Once all beta slots are filled, no additional seats will be offered.
Testing is held at Prometric testing centers worldwide, although this exam may not be available in all countries (see Regional Restrictions). All testing centers will have the capability to offer this exam in its live version.
Regional Restrictions: India, Pakistan, China
Registration Information
You must register at least 24 hours prior to taking the exam.Please use the following promotional code when registering for the exam: WINTReceiving this invitation does not guarantee you a seat in the beta; we recommend that you register immediately.
To register in North America, please call:
•
Prometric: (800) 755-EXAM (800-755-3926)
Outside the U.S./Canada, please contact:
Prometric: http://www.register.prometric.com/ClientInformation.asp
Test Information and Support
You are invited to take this beta exam at no charge.You will be given four hours to complete the beta exam. Please plan accordingly.Find exam preparation information: http://blogs.msdn.com/ntdebugging/pages/exam-preparation-information-for-exam-71-660.aspx
Frequently Asked Questions
For Microsoft Certified Professional (MCP) help and information, you may log in to the MCP Web site at http://www.microsoft.com/learning/mcp/or contact your Regional Service Center: http://www.microsoft.com/learning/support/worldsites.asp.
What is a beta exam?
Where can I learn more about the registration process?
Where can I learn more about the beta exam invitation process?
Where can I learn more about the new structure of Microsoft Certification?
Who do I contact for help with this beta exam or other MCP questions?
Thank you.
Jeff Dailey, Microsoft Global Escalation Services.
Hello, my name is East, and I’m an Escalation Engineer on the Microsoft Platform Global Escalation Services team. Here are some tips for those wanting to review dumps from Cluster Administrator (Cluadmin.exe), the GUI version of cluster.exe that allows management of your cluster environment. Historically there were more Cluadmin hangs in Windows 2000 Cluster server than Windows 2003 Cluster server so the volume of these dumps submitted to my group has dropped significantly. People have submitted complete memory dumps thinking the entire Cluster server was hung when it turned out to be a Cluadmin hang. It’s important to determine what is really hung when troubleshooting these types of problems.
So what causes Cluadmin to hang? In most cases a resource is working to come online, or a remote connection is misbehaving which results in the hang. Cluadmin dumps are not as helpful in revealing why a resource is misbehaving in your Cluster environment, when compared to dumping the cluster service and resrcmon. In most cases the cluster log can help you narrow down a problematic resource that’s holding Cluadmin up. Later I will show how to find the possible resource that caused a Cluster Administrator hang by using the debugger.
To help narrow down the hang you should start by testing that the Cluster environment is healthy.
Let’s say at the time of the hang you were using Cluadmin to view properties of certain resources. At this point it would be useful to examine the status of the cluster by using the Cluster.exe tool. Here are some examples.
Using the cluster.exe command you can check the Groups status with the following command:
C:\>cluster group
Listing status for all available resource groups:
Group Node Status
-------------------- --------------- --------------
Test1 NODE1 Online
Cluster Group JNODE1 Online
You can check the Node status using this command:
C:\>cluster node
Listing status for all available nodes:
Node Node ID Status
-------------- ------- ---------------------
NODE1 1 Up
NODE2 2 Down
And you can check the Resource status with the resource command:
C:\>cluster resource
Listing status for all available resources:
Resource Group Node Status
-------------------- -------------------- --------------- ------
Disk Q: Cluster Group NODE1 Online
tst1 Test1 NODE1 Online
Cluster IP Address Cluster Group NODE1 Online
Cluster Name Cluster Group NODE1 Online
MSDTC Cluster Group NODE1 Online
You can get more granular by using the /priv switch to list out the private properties like this:
C:\>cluster resource /priv
Listing private properties for all resources:
<ouput not shown to save space>
Since hangs can occur during startup of Cluadmin I recommend the following article which discusses different ways to start Cluadmin.
280125 Cluster Administrator Switches for Connecting to a Cluster
http://support.microsoft.com/default.aspx?scid=kb;EN-US;280125
Now let’s take a look at a Cluadmin dump to illustrate what you would see during a hang. Most dumps we gather from customers are snapped when Cluadmin is starting, resulting in limited information contained in the dump. Here’s an example of the resource that Cluster was working with when the process hung. The thread below is making an RPC call to the local cluster service. This can be the culprit causing Cluadmin to be in the hung state:
0:000> kb
0006eddc 7c59a0a2 00000154 00000001 0006edfc NTDLL!ZwWaitForSingleObject+0xb
0006ee04 77d7f41d 00000154 000007d0 00000001 KERNEL32!WaitForSingleObjectEx+0x71
0006ee28 77d5fadf 00000000 0006ee58 0006ee50 rpcrt4!DG_ReceivePacket+0xd3
0006ee68 77d5f998 0006f0c8 0006f07c 0006ee88 rpcrt4!DG_CCALL::ReceiveSinglePacket+0x5e
0006ee7c 77d3b96c 0006f07c 0006f274 77d6a686 rpcrt4!DG_CCALL::SendReceive+0xd6
0006ee88 77d6a686 0006f07c 00000009 7393270e rpcrt4!I_RpcSendReceive+0x2c
0006ee9c 77d93b64 0006f0c8 000a425c 0006eeac rpcrt4!NdrSendReceive+0x31
0006f274 73939b6d 73931778 739326c0 0006f28c rpcrt4!NdrClientCall2+0x512
0006f284 73934bd5 00091ed8 01000059 00000000 CLUSAPI!ApiResourceControl+0x14
0006f2c4 0102705f 000921e8 00000000 01000059 CLUSAPI!ClusterResourceControl+0xd0
0006f2f8 01027b00 000921e8 01000059 00000000 CLUADMIN!CClusPropList::ScGetResourceProperties+0x3e
0006f374 010129ae 00c988a0 00000080 00239770 CLUADMIN!CResource::ReadItem+0xb9
0006f3b4 0101141d 002397c8 00239770 00000001 CLUADMIN!CClusterDoc::InitResources+0x9a
0006f3ec 01010f45 00000001 00239770 00000000 CLUADMIN!CClusterDoc::CollectClusterItems+0xd5
0006f410 01010e11 002352d0 00234960 00239770 CLUADMIN!CClusterDoc::OnOpenDocumentWorker+0x80
0006f444 76fbaea3 002352d0 00234960 01044c28 CLUADMIN!CClusterDoc::OnOpenDocument+0xa2
0006f46c 0101039f 002352d0 00000001 01044c28 mfc42u!CMultiDocTemplate::OpenDocumentFile+0xb5
0006fc88 0100fff6 002352d0 002373a0 00000000 CLUADMIN!CClusterAdminApp::OpenDocumentFile+0xbf
0006fcc0 01021c9e 00000001 00000000 76fb2089 CLUADMIN!CClusterAdminApp::OnRestoreDesktop+0xc4
0006fccc 76fb2089 00000001 00000000 002373a0 CLUADMIN!CMainFrame::OnRestoreDesktop+0x17
The first parameter passed to CLUSAPI!ClusterResourceControl in this Windows 2000 dump reveals the resource in question.
Here I’m dumping out the raw data of the parameter which yields the name of the resource, Joseph.
0:000> dc 000921e8+28
00092210 006F004A 00650073 00680070 00000000 J.o.s.e.p.h.....
00092220 00030007 000c0100 00091d40 00092780 ........@....'..
00092230 000923e8 000923e8 00092238 00092238 .#...#..8"..8"..
00092240 00000000 000805e8 00000000 00000000 ................
00092250 00020019 00000000 00070005 00080100 ................
00092260 00000000 00092378 00092b48 000925f0 ....x#..H+...%..
00092270 00000000 00000000 d3cc9553 00000000 ........S.......
00092280 00050005 00080100 000922b0 000922d8 ........."..."..
0:000> du 000921e8+28
00092210 "Joseph"
On a Windows 2003 stack I again examined the first parameter to CLUSAPI!ClusterResourceControl...
lpBytesReturned = 0x0006f2a8
0006f278 0102bb53 000a03c8 00000000 01000059 CLUSAPI!ClusterResourceControl
0006f2ac 0102ec09 000a03c8 01000059 00000000 cluadmin!CClusPropList::ScGetResourceProperties+0x78
0006f328 010142c7 00000000 0026d6b8 01078468 cluadmin!CResource::ReadItem+0xcb
0006f36c 010161dc 00268508 01078468 00000001 cluadmin!CClusterDoc::InitResources+0xa2
0006f3a8 01016647 01078468 00000000 01078a48 cluadmin!CClusterDoc::CollectClusterItems+0xe9
0006f3c8 010167a3 002667b8 00268508 01078468 cluadmin!CClusterDoc::OnOpenDocumentWorker+0x90
0006f400 7f062793 002667b8 00268508 002667b8 cluadmin!CClusterDoc::OnOpenDocument+0x9c
0006f428 01010d97 002667b8 00000001 00264da0 MFC42u!CMultiDocTemplate::OpenDocumentFile+0x103
0006fc50 01010a9b 002667b8 00264da0 00268680 cluadmin!CClusterAdminApp::OpenDocumentFile+0xdc
0006fc64 7f03babe 00099c48 00264da0 00264da0 cluadmin!CClusterAdminApp::OnClusterConnectionOpened+0x27
0006fce8 7f038edc 0000040e 00099c48 01006d70 MFC42u!CWnd::OnWndMsg+0x62e
0006fd10 7f03af27 0000040e 00099c48 00264da0 MFC42u!CWnd::WindowProc+0x2c
0006fd70 7f03b06e 00268680 00000000 0000040e MFC42u!AfxCallWndProc+0xa7
0006fd94 7f0e6d8d 000501c6 0000040e 00099c48 MFC42u!AfxWndProc+0x3e
0006fdc4 7739b6e3 000501c6 0000040e 00099c48 MFC42u!AfxWndProcBase+0x4d
0006fdf0 7739b874 7f0e6d40 000501c6 0000040e USER32!InternalCallWinProc+0x28
0006fe68 7739ba92 00000000 7f0e6d40 000501c6 USER32!UserCallWinProcCheckWow+0x151
0006fed0 7739bad0 01049b24 00000000 0006ff08 USER32!DispatchMessageWorker+0x327
0006fee0 7f073000 01049b24 01049b24 01049af0 USER32!DispatchMessageW+0xf
0006fef0 7f072dda 01049af0 01049af0 ffffffff MFC42u!CWinThread::PumpMessage+0x40
0:000> dc 000a03c8+b8
000a0480 00650052 006f0073 00720075 00650063 R.e.s.o.u.r.c.e.
000a0490 005c0073 00330031 00350034 00310030 s.\.1.3.4.5.0.1.
000a04a0 00370030 0031002d 00340065 002d0034 0.7.-.1.e.4.4.-.
000a04b0 00660034 00310035 0038002d 00370061 4.f.5.1.-.8.a.7.
000a04c0 002d0030 00610035 00300034 00630034 0.-.5.a.4.0.4.c.
000a04d0 00350032 00620031 00650036 abab0000 2.5.1.b.6.e.....
000a04e0 abababab feeeabab 00000000 00000000 ................
000a04f0 000f0007 001c0731 000a0530 fedcba98 ....1...0.......
0:000> du 000a03c8+b8
000a0480 "Resources\13450107-1e44-4f51-8a7"
000a04c0 "0-5a404c251b6e"
You will also notice that Windows 2003 shows the resource guid for the resource.
NOTE: You can speculate that the problem resource is the one in the bottom view of the Cluadmin GUI, however using that method can lead to an improper conclusion because the next resource may actually be causing the problem. The resource may have hung the GUI before getting displayed in Cluadmin.
Ok now I know the resource because I dumped it out so what can I do with it now? You can take the resource offline to see if Cluster exhibits an issue bringing that resource off or online. Additionally, the cluster log may show that this resource is having an issue. Your best bet from here would be to review the cluster log for any additional information about the specific resource or get a Resource Monitor (Resrcmon.exe) dump to investigate any further issues. I also did not show all the threads of the Cluadmin.exe dump since the rest are normally RPC calls and irrelevant to this discussion. If the connection was to another machine we would do a network trace to reveal additional problems.
I have recently seen a number of issues where customers called in to report a significant difference between the “Size on disk” for the root of a volume, and the reported amount of “Used space” in the volume properties. While considering this, I noticed that my own C: drive had a mysterious 20GB difference.
Below is the story of how I found the answer.
Before we begin, there are two methods used for calculating disk usage…
Method 1 – Volume Bitmap Analysis
The % used and %free indication shown below with the pie chart is based on volume bitmap analysis. The hidden $Bitmap:$Data:”” stream is read from the volume via the FSCTL_GET_VOLUME_BITMAP IO control code, and the ones and zeroes are counted. A binary “1” represents an allocated cluster, while a “0” indicates a free cluster. The total number of set bits indicates the total number of allocated clusters on the volume. This is the fastest and most reliable method of calculating allocated and free space.
Method 2 – File Scanning with FindFirstFileEx / FindNextFile
Navigating to the “C:\” folder, we select all files/folders, then right click, and then select properties. A dialog will appear as the shell enumerates all files that it can find in the root and subfolders. The “Size” and “Size on disk” fields will eventually display the total size of all primary data streams for all files that are found by Explorer. In my case, this method only finds 80.25 GB of 102 GB (leaving a 21.75 GB difference).
Why such a big difference?
The short answer is - “If you can’t see it with Explorer, it doesn’t get added to the total size.”
How to find the hidden disk usage…
Step 1:
First, check to see if your files are all actually visible with the current user account and Explorer settings. Uncheck the “Hide protected operating system files” and select “Show hidden files and folders”. When this is completed, look at the size on disk again.
Does it match the pie chart within a few percent? If not, continue to Step 2.
Step 2:
Run CHKDSK and save the output. The summary information at the end is what we want to see.
Based on CHKDSK output, we can calculate the total metadata usage by adding up the following…
KB
GB
Description
57036
.05 GB
Space used by 22677 indexes.
0
0 GB
Space used by $Badclus file.
1352680
1.29 GB
Space used by $MFT.
65536
.06 GB
Space used by $Loffile.
1475252
1.4 GB
Metadata Total
In this example, metadata usage accounts for only 1.4 GB. If the metadata number were a high percentage of the usage, then we need to take closer a look at how the volume is being used instead of simply looking for hidden files. High metadata usage will be the subject for part 2 of this blog.
We can see from the CHKDSK output that the total space occupied by our user files is 106,095,080 KB (101.18 GB). This is a large percentage of the total usage, so we should be looking at the user files to see why we can’t see these files with Explorer.
Step 3:
Do you have permissions to see all files/folders the volume?
a. If this is a system drive, look through the “\Documents and Settings”, or “\users” folder and see if you can browse all of the profile folders. If not, you may need contact the owner(s) to check their folder size and see what they have saved in their user folder. If they are like me, they will have 5 years worth of memory dumps, MP3’s, VHD’s, PDF’s, etc.
b. Try “CACLS.EXE x:\ /T”, or just browse through the folders on the drive looking for an “Access Denied” error. Once this happens, give yourself permissions to that part of the subtree (if permitted by your administrative privileges) or have someone who has access to that folder enumerate the contents for you. Check the number of files and the size on disk for all files in that folder and add it to the total.
c. Continue this process until you have a total of all files in all folders.
d. Does the total make sense? If not, then continue to the next step.
In the case of my mysterious 20 GB difference, I found an old user profile from a previous installation. Sure enough, I received an access denied error when browsing to it. To access the folder, I acquired ownership. This user was in the habit of collecting memory dumps and virtual machines in a subfolder on his desktop. I recovered quite a bit of free space by using the delete key. I rescanned the volume, but to my disappointment, there was still a significant difference.
Step 4.
Are there alternate streams? Try using STREAMS.EXE from (http://technet.microsoft.com/en-us/sysinternals/default.aspx). Streams will recurse subdirectories and report space usage by alternate named streams. Pipe this to a text file and analyze it with a text editor or spreadsheet program.
Step 5.
Hard links can be problematic when calculating the total usage via the file scan method. Hard links are not extensively used, but this is worth mentioning. A hard link is an extra index entry that points to an existing file. These are created via the CreateHardLink function. Although the file actually exists in one location, each hard link has its own size field. Because of this, a single file can be added twice to the running total. Another thing to know is that hard links are not synchronized, so it is possible that only one link will show the actual file size (see the example below).
Unfortunately, there are few options available for detecting hard link paradoxes, but it is something to consider when the file scan shows more usage than the bitmap. Since we have the opposite situation here, hard links are not a significant factor.
Step 6.
Is Volume Shadow Copy Service maintaining diff area files for snapshots? Use VSSADMIN LIST SHADOWSTORAGE to find out. Add shadow storage to the running total.
VSSVC pre-allocates space for making volume snapshots. In order to support this feature, diff area files are kept in the “\System Volume Information” folder. This pre-allocated space is used to maintain point-in-time information for the “Previous Versions” feature and for the “System Restore” application. If you are the type of user who prefers to minimize the impact of such features, then you can resize your shadow storage with VSSADMIN so it has less impact on disk usage. I prefer to leave these features at their default settings (and just make a note of how much disk space it is using).
Step 7.
If your numbers still don’t make sense, then it’s time to check for actively hidden files / folders. There are many different rootkit scanners that can help you identify the presence of actively hidden files & folders. Try using a scanner like Rootkit Revealer. If you suspect that your machine has been compromised by a rootkit, refer to http://www.microsoft.com/uk/business/security/security-malware.mspx.
Seeing that I did not have a significant amount of mysterious usage, I was satisfied that there was no malicious logic hiding my files. After taking into account all of the factors, there was only a 1.57GB difference remaining (which was accounted for by the contents of a user’s recycle bin).
Whenever you see a difference between the “Used space”, and “Size on disk”, always ask yourself “What am I not seeing?”
Best regards,
Dennis Middleton “The NTFS Doctor”
Hello, my name is Louis Shanks and my story starts off with a tricky little deadlock issue that I worked on recently. Let me set the stage by explaining the surface level symptom, and then we will get to the good stuff. Randomly when users running Vista would attempt to launch Internet Explorer nothing would happen. So they would again double-click the icon, and still nothing happened. So the users would reboot and go back to their web activities.
After some basic trouble shooting it was discovered that Iexplore.exe was running multiple times and every attempt to open the browser would result in another Iexplore.exe instance. Since no UI was ever presented I went ahead and collected a few Iexplore.exe process dumps along with a complete system dump.
The process dumps didn’t prove to be terribly useful so I decided to move over to the complete memory dump.
Since I knew we had multiple IExplore.exe processes running I wanted to investigate them first.
0: kd>!process 0 0
PROCESS 87289d90 SessionId: 5 Cid: 0fdc Peb: 7ffdc000 ParentCid: 0968
DirBase: 7d66c520 ObjectTable: a203a178 HandleCount: 205.
Image: iexplore.exe
PROCESS 861a9d90 SessionId: 5 Cid: 0d34 Peb: 7ffd9000 ParentCid: 0fdc
DirBase: 7d66c7a0 ObjectTable: a5651a98 HandleCount: 170.
Image: ieuser.exe
PROCESS 86c175b0 SessionId: 5 Cid: 1250 Peb: 7ffd9000 ParentCid: 0968
DirBase: 7d66c940 ObjectTable: a0853f10 HandleCount: 91.
PROCESS 861ac7a8 SessionId: 5 Cid: 1024 Peb: 7ffde000 ParentCid: 0968
DirBase: 7d66c620 ObjectTable: a83f7898 HandleCount: 91.
Image: iexplore.exe</snip>
I decided to just start with the first one in the list (87289d90)
0: kd> !process 87289d90 Image: iexplore.exe
ElapsedTime 00:24:06.666
It appears this process has been around a while, so I wanted to look at the first thread in this process as it is responsible for the UI in Internet Explorer. I am looking at this thread because the surface level problem illustrates that when we launch Iexplore.exe the UI is never displayed. Here I used the !thread command to display the specific thread.
THREAD 870b6398 Cid 0fdc.16a4 Teb: 7ffdf000 Win32Thread: fd850b78 WAIT: (UserRequest) UserMode Non-Alertable
85e86cf0 SynchronizationEvent
85cb35e0 SynchronizationEvent
Not impersonating
Attached Process 87289d90 Image: iexplore.exe
Wait Start TickCount 10817197 Ticks: 69571 (0:00:18:05.314)
UserTime 00:00:00.015
KernelTime 00:00:00.093
Based on the tick count this thread has been waiting for 18 minutes.
0: kd> kvn
# ChildEBP RetAddr Args to Child
00 931208d8 81a940a1 870b6398 803d2120 870b6420 nt!KiSwapContext+0x26
01 9312091c 81a8dacc 870b6398 00000000 00000002 nt!KiSwapThread+0x44f
02 93120970 81bedb4f 00000002 93120aa8 00000001 nt!KeWaitForMultipleObjects+0x53d
03 93120bfc 81bee2e3 00000002 00000001 00000000 nt!ObpWaitForMultipleObjects+0x256
04 93120d48 81a30b3a 00000002 0027e9c4 00000001 nt!NtWaitForMultipleObjects+0xcc
05 93120d48 77568254 00000002 0027e9c4 00000001 nt!KiFastCallEntry+0x12a
06 0027e970 775679fc 7708c973 00000002 0027e9c4 ntdll!KiFastSystemCallRet
07 0027e974 7708c973 00000002 0027e9c4 00000001 ntdll!NtWaitForMultipleObjects+0xc
08 0027ea10 7608ff95 0027e9c4 0027ea38 00000000 kernel32!WaitForMultipleObjectsEx+0x11d
09 0027ea64 773d3a46 00000028 0027eaac ffffffff USER32!RealMsgWaitForMultipleObjectsEx+0x13c
0a 0027ea8c 773d3b7f 0027eaac ffffffff 0027eabc ole32!CCliModalLoop::BlockFn+0x97
0b 0027eab4 774e75be ffffffff 003ddf90 0027ebc0 ole32!ModalLoop+0x5b
0c 0027ead0 774e9294 00000000 0027ebd4 00000000 ole32!ThreadSendReceive+0x12c
0d 0027eaf8 774e9100 0027ebc0 003d3538 0027ec1c ole32!CRpcChannelBuffer::SwitchAptAndDispatchCall+0x194
0e 0027ebd8 773d3c25 003d3538 0027ecfc 0027ece4 ole32!CRpcChannelBuffer::SendReceive2+0xef
0f 0027ebf4 773d3bd7 0027ecfc 0027ece4 003d3538 ole32!CCliModalLoop::SendReceive+0x1e
10 0027ec6c 77403efb 003d3538 0027ecfc 0027ece4 ole32!CAptRpcChnl::SendReceive+0x73
11 0027ecc0 76052ff4 003d3538 0027ecfc 0027ece4 ole32!CCtxComChnl::SendReceive+0x1c5
12 0027ecd8 76053055 003e50a4 0027ede4 76052e0d RPCRT4!NdrProxySendReceive+0x43
13 0027ece4 76052e0d 09ea0d59 0027f1fc 070001f3 RPCRT4!NdrpProxySendReceive+0xc
Based on the call stack it appears this thread made an OLE call and is now waiting for some objects to signal.
What is IExplore doing in this thread? It is waiting on another process to service this OLE request. We can find the PID of the other process by digging into the ole32 SendReceive call.
I looked at the first argument:
0: kd> dd 003d3538
003d3538 77403a50 773f57b4 00000003 00000002
003d3548 00000000 00000000 003d2ef0 003d8840
003d3558 0038b1c0 003e05f8 77403980 00070005
This gives us a pointer to another data structure that contains the information we are attempting to locate.
0: kd> dd 003d2ef0
003d2ef0 774ee600 003d2e70 00000d34 00000000
003d2f00 862941fa c94cde09 55a88424 801a2601
003d2f10 55a88424 801a2601 0000c400 00000d34
Ah there it is! 0d34 is the process id servicing our OLE request.
I went back and reviewed the !process 0 0 output to locate the process containing PID 0D34. I’m dumping it here.
Ok, so the process handing this request is IEUSER.exe, which is the broker process used by Internet Explorer in Vista when running in protect mode. This is new in Vista.
More on IEUser and Protected mode can be seen at the following location:http://msdn.microsoft.com/en-us/library/bb250462.aspx
Now I needed to dig into the IEUser.exe process to find out what it was working on causing Iexplore to wait.
0: kd> !process 861a9d90
I discovered the thread in the ieuser.exe process working on our call from Iexplore.exe. However this thread is also waiting on access to something before it can complete.
0: kd> kvn+200
00 a41b0c50 81a940a1 85ecbac0 85ecbb48 81af9920 nt!KiSwapContext+0x26
01 a41b0c94 81a2ddc8 85ecbac0 00000000 85e18510 nt!KiSwapThread+0x44f
02 a41b0ce8 81bf0d9f 85e18510 00000006 00000001 nt!KeWaitForSingleObject+0x492
03 a41b0d50 81a30b3a 00000224 00000000 00000000 nt!NtWaitForSingleObject+0xbe
04 a41b0d50 77568254 00000224 00000000 00000000 nt!KiFastCallEntry+0x12a
05 0163e8a4 77567a0c 7708c7c4 00000224 00000000 ntdll!KiFastSystemCallRet
06 0163e8a8 7708c7c4 00000224 00000000 00000000 ntdll!NtWaitForSingleObject+0xc
07 0163e918 7708c732 00000224 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xbe
08 0163e92c 747e60aa 00000224 ffffffff 00000000 kernel32!WaitForSingleObject+0x12
09 0163e948 747e5fe0 0163ec48 00497b38 00000000 RASAPI32!ReadPhonebookFileEx+0x1a4
0a 0163e968 747e4e9d 0163ec48 00000000 00000000 RASAPI32!ReadPhonebookFile+0x1b
0b 0163e9a8 747e52d3 0163ec48 0048f660 0163e9e0 RASAPI32!DwEnumEntriesFromPhonebook+0x35
0c 0163ee58 747e4fd4 0163ee9c 00000001 0048f660 RASAPI32!DwEnumEntriesInDir+0x1cc
0d 0163f0ac 747e677e 00000001 0048f660 0163f0f4 RASAPI32!DwEnumEntriesForPbkMode+0xab
0e 0163f0e0 76e534b9 00000000 00004180 0048f660 RASAPI32!RasEnumEntriesW+0xb0
0f 0163f10c 76e5342a 76ecc2dc 00000000 00000000 WININET!RasEnumHelp::RasEnumHelp+0x55
10 0163f120 76e533a3 00000001 00000000 00000001 WININET!DoConnectoidsExist+0x2b
11 0163f14c 76e534ff 76ecf02c 76ecc2dc 00000000 WININET!GetRasConnections+0x34
12 0163f168 76e4196e 00000000 0163f180 6ee3228c WININET!IsDialUpConnection+0xa9
13 0163f184 76e70564 00000001 00000000 00000000 WININET!FixProxySettingsForCurrentConnection+0x31
14 0163f1a8 6ec2db3b 00000000 0163f1bc 0163f3d4 WININET!InternetInitializeAutoProxyDll+0x6c
15 0163f1b8 75fe9142 004884a0 00000202 0047e900 ieframe!CShdocvwBroker::StartAutoProxyDetection+0x4d
16 0163f1d0 76054294 6ec2daf8 0163f3d8 00000001 RPCRT4!Invoke+0x2a
17 0163f5fc 76054a49 004858d0 0046e4e0 0047e900 RPCRT4!NdrStubCall2+0x27b
18 0163f64c 774e86d7 004858d0 0047e900 0046e4e0 RPCRT4!CStdStubBuffer_Invoke+0xa0
19 0163f694 774e877d 0047e900 0047f968 00480b38 ole32!SyncStubInvoke+0x3c
1a 0163f6e0 7740249e 0047e900 0047f860 004858d0 ole32!StubInvoke+0xb9
1b 0163f7bc 774023af 0046e4e0 00000000 004858d0 ole32!CCtxComChnl::ContextInvoke+0xfa
1c 0163f7d8 774e85df 0047e900 00000001 004858d0 ole32!MTAInvoke+0x1a
1d 0163f808 774e8a18 d0908070 0046e4e0 004858d0 ole32!AppInvoke+0xaa
1e 0163f8e4 774e85ef 0047e8a8 00474118 0047ebc0 ole32!ComInvokeWithLockAndIPID+0x32c
1f 0163f930 75fe94b0 0047ebc0 08ac70b7 0047ebc0 ole32!ThreadInvoke+0x2fd
I’m dumping the first parameter passed to KeWaitForSingleObject to determine the object type.
0: kd> !object 85e18510
Object: 85e18510 Type: (85484d40) Mutant
ObjectHeader: 85e184f8 (old version)
HandleCount: 3 PointerCount: 5
Directory Object: 881f3030 Name: RasPbFile
Ok, we are waiting on a Mutex named RasPbFile. The big question in your mind now is, “who owns this lock”, right?
To determine this I’ll dump the structure for the mutex object:
0: kd> dt _KMUTANT 85e18510
nt!_KMUTANT
+0x000 Header : _DISPATCHER_HEADER
+0x010 MutantListEntry : _LIST_ENTRY [ 0x870b6564 - 0x870b6564 ]
+0x018 OwnerThread : 0x870b6398 _KTHREAD
+0x01c Abandoned : 0 ''
+0x01d ApcDisable : 0 ''
The owner of the mutex is 870b6398. The bad news is this is the same thread I started with in the Iexplore.exe process so we’re looking at a classic deadlock scenario. Thread 870b6398 running under Iexplore.exe has made an OLE call over to the IEuser.exe broker process. The IEuser process needs the mutex owned by the thread that called it in the first place.This turned out to be a bug in the RasApi code, not the browser or IEuser process. The RasAPI code was written in such a way that it would wait for an infinite period of time to gain the Mutex.
0163e92c 747e60aa 00000224 ffffffff 00000000 kernel32!WaitForSingleObject+0x12
The good news is that we have created hotfix 951738 to address this issue. The official KB article is still pending.