Hello All, Ron Riddle here again to share another mystery relating to the XP Mode Auto Publish feature for Windows 7 that had an unexpected root cause. As you may know, the Auto Publish feature allows applications installed within the guest to be automagically available from the start menu of the Windows 7 host. Of course, one requirement for this to work properly is that the feature must be enabled; however, I have encountered two such instances where the feature is listed as ‘Not Available’ within the Virtual PC Settings UI once the guest is started.
Debugging the Issue
The Virtual PC Integration Components Services Application (1-vmsrvc) service, which runs within a vmsrvc.exe instance on the guest, decides whether the Auto Publish feature should be enabled. For non-Windows 7 guests, the first thing 1-vmsrvc does is issue a WMI query to determine whether KB961742 (for an XP guest) or KB961741 (for a Vista guest), which provide Remote Applications Integrated Locally (RAIL) support, have been applied.
Root Cause Analysis
Before we issue the WMI query, we must first activate the CLSID_WbemLevel1Login component. Here’s an excerpt below:
0:009> k
ChildEBP RetAddr
00b1f730 74ef186e wbemprox!CDCOMTrans::DoActualCCI+0x3d
00b1f774 74ef15db wbemprox!CDCOMTrans::DoCCI+0x12d
00b1f830 74ef17e4 wbemprox!CDCOMTrans::DoActualConnection+0x25c
00b1f85c 74ef1ee1 wbemprox!CDCOMTrans::DoConnection+0x25
00b1f89c 01018283 wbemprox!CLocator::ConnectServer+0x7c
00b1fae8 010182da vmsrvc!VPCRAILUpdates::Connect+0xa8
00b1faf4 0101842b vmsrvc!VPCRAILUpdates::QueryInstalledFixes+0xb
00b1fe34 0100b61f vmsrvc!VPCRAILUpdates::CheckIfUpdatesArePresent+0x91
00b1ffb4 7c80b729 vmsrvc!Win32VPCAppPublisherService::AllowListNotificationThreadProc+0x90
00b1ffec 00000000 kernel32!BaseThreadStart+0x37
However, I noticed that the activation attempt failed with WBEM_E_CRITICAL_ERROR(0x8004100a).
0:009> r eax
eax=8004100a
So, I proceeded to debug the activation attempt from within the Windows Management Instrumentation(winmgmt) service, since it provides the class factory for this component. Notice the call to LoadLibraryExW passing a relative path for the lpFileName parameter. This means that a search strategy must be applied which leverages the PATH environment variable.
0:002> k
0086f438 594976e2 kernel32!LoadLibraryExW
0086f494 7751d8a7 wmisvc!CForwardFactory::CreateInstance+0xf8
0086f4b8 7751daac ole32!GetInstanceHelperMulti+0x20
0086f578 77e799f4 ole32!CObjServer::CreateInstance+0x251
0086f59c 77ef421a RPCRT4!Invoke+0x30
0086f9a8 77ef4bf3 RPCRT4!NdrStubCall2+0x297
0086fa00 77600c15 RPCRT4!CStdStubBuffer_Invoke+0xc6
0086fa40 77600bbf ole32!SyncStubInvoke+0x33
0086fa88 7752ad31 ole32!StubInvoke+0xa7
0086fb60 7752ac56 ole32!CCtxComChnl::ContextInvoke+0xe3
0086fb7c 776007f5 ole32!MTAInvoke+0x1a
0086fbac 77602df3 ole32!AppInvoke+0x9c
0086fc80 77600715 ole32!ComInvokeWithLockAndIPID+0x2c2
0086fccc 77e794bd ole32!ThreadInvoke+0x1cd
0086fd00 77e79422 RPCRT4!DispatchToStubInC+0x38
0086fd54 77e7934e RPCRT4!RPC_INTERFACE::DispatchToStubWorker+0x113
0086fd78 77e8a384 RPCRT4!RPC_INTERFACE::DispatchToStub+0x84
0086fdb8 77e8a3c5 RPCRT4!RPC_INTERFACE::DispatchToStubWithObject+0xc0
0086fdf8 77e7bcc1 RPCRT4!LRPC_SCALL::DealWithRequestMessage+0x2cd
0086fe1c 77e7bc05 RPCRT4!LRPC_ADDRESS::DealWithLRPCRequest+0x16d
0:002> du poi(esp+4)
59491668 "wbemcore.dll"
Surprisingly, I found that the requested library could not be found using the standard search strategy.
0:002> !gle
LastErrorValue: (Win32) 0x7e (126) - The specified module could not be found.
LastStatusValue: (NTSTATUS) 0xc0000135 - {Unable To Locate Component} This application has failed to start because %hs was not found. Re-installing the application may fix this problem.
I then decided to enable boot logging within the Process Monitor tool to catch a glimpse into why the load for wbemcore.dll was failing. Here’s an excerpt from the Process Monitor log:
8:18:11.1652951 PM svchost2.exe 1628 QueryOpen C:\WINDOWS\system32\%SystemRoot%\system32\wbemcore.dll PATH NOT FOUND
8:18:11.1653627 PM svchost2.exe 1628 QueryOpen C:\WINDOWS\system32\%SystemRoot%\wbemcore.dll PATH NOT FOUND
8:18:11.1654161 PM svchost2.exe 1628 QueryOpen C:\WINDOWS\system32\%SystemRoot%\System32\Wbem\wbemcore.dll PATH NOT FOUND
At this point, it became clear to me that something must be wrong with the PATH environment variable configuration because the SystemRoot environment variable was not being properly expanded. Sure enough, the registry showed that the Path value was of type REG_SZ rather than REG_EXPAND_SZ.
Once I saved off the data for the Path value and recreated it specifying the proper type(REG_EXPAND_SZ), the issue was resolved!
Sidenote on the Debugging Strategy
I chose to debug the services by attaching ntsd.exe and redirecting the session to the kernel debugger. Alternatively, I could have chosen to use a remote debugging session since it’s much more natural than redirecting to the kernel debugger, in my opinion. However, this issue was further complicated by the fact that introducing a debugger in the mix had potential to change the timing enough such that I struggled to reproduce the issue using a remote session. The reason for this is that if we slow down the 1-vmsrvc execution enough, the winmgmt service itself will attempt to load wbemcore.dll, albeit through an activation request where an absolute path is specified, thereby avoiding application of a search strategy, which will succeed. Thus, when 1-vmsrvc attempts to activate CLSID_WbemLevel1Login, it will now succeed because wbemcore.dll is already loaded and the Auto Publish feature will now be ‘Enabled’!
Configuration
I leveraged the Image File Execution Options key, creating a new entry for vmsrvc.exe and configuring the Debugger value with the following command line:
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\vmsrvc.exe]
"Debugger"="c:\\debuggers\\ntsd.exe -d -y srv*c:\\vmsrvc -c \"bu vmsrvc!Win32VPCAppPublisherService::AllowListNotificationThreadProc;bu wbemprox!CDCOMTrans::DoActualCCI;g\""
I also isolated the winmgmt service into its own svchost.exe, copied %systemroot%\system32\svchost.exe to %systemroot%\system32\svchost2.exe, and then created a new key for svchost2.exe with the following command line:
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\svchost2.exe]
"Debugger"="c:\\debuggers\\ntsd.exe -d -y srv*c:\\winmgmt -c \"bu wmisvc!CForwardFactory::CreateInstance;g\""
Although redirecting ntsd.exe to the kernel debugger was cumbersome from a usability perspective, I found the ability to debug both processes from a central facility very appealing and worthwhile.
Conclusion
For issues like these that ultimately resolve to a misconfiguration of the OS, I can’t help but think how unfortunate it was that I didn’t stumble onto root cause sooner via the routine task of launching some executable from a command shell. Surely this would have been a red flag and could have saved me a lot of time debugging!
While this was a rather extreme example of how a misconfiguration of the OS can affect other seemingly unrelated parts such as the Auto Publish feature of XP Mode, the take-away here is when you detect that the Auto Publish feature is ‘Not Available’, you should begin with standard WMI troubleshooting; and, as we’ve just seen here, a quick sanity check of the environment might not be a bad idea either! :)
Until next time, happy debugging!
Hello, my name is Venkatesh Ganga, and I’m an Escalation Engineer on the Microsoft Platform Global Escalation Services team. Recently, I worked on an issue where registry reflection was not happening while installing 32bit Office under the system account. This required looking into both the 32bit Office code and the Wow64 code where the registry reflection is implemented. When attaching to the Wow64 process using the 32bit debugger it’s like debugging a 32bit process on the 32bit machine; there are no 64bit binaries in the process. However, we needed to debug Wow64 to debug the registry reflection code. To do this we attached to the Wow64 process using the 64bit debugger which allows you to see the Wow64 binaries.
Ø lm
Base TimeStamp Module
400000 42435b2a Mar 24 18:28:26 2005 C:\Program Files (x86)\Internet Explorer\IEXPLORE.EXE
77ec0000 45d6cc72 Feb 17 03:35:46 2007 C:\WINDOWS\system32\ntdll.dll
6b000000 45d6943d Feb 16 23:35:57 2007 C:\WINDOWS\system32\wow64.dll
6b280000 45d695f3 Feb 16 23:43:15 2007 C:\WINDOWS\system32\wow64win.dll
78b80000 42438b7a Mar 24 21:54:34 2005 C:\WINDOWS\system32\wow64cpu.dll
There are 2 options for debugging Wow64 applications.
1. Using the 64bit debugger and Wow64 debugger extension (Wow64exts.dll)
2. Using the 32bit debugger
Using 64bit debugger and Wow64 debugger extension (Wow64exts.dll)
I ran the 32bit version of Internet Explorer on a 64bit machine and attached to it using the 64bit debugger. Here is the thread 0 call stack when viewed from 64bit debugger.
0:000> kL
Child-SP RetAddr Call Site
00000000`0013edf8 00000000`78b8428e wow64cpu!CpupSyscallStub+0x9
00000000`0013ee00 00000000`6b006a5a wow64cpu!Thunk0Arg+0x5
00000000`0013ee70 00000000`6b005e0d wow64!RunCpuSimulation+0xa
00000000`0013eea0 00000000`77ed8030 wow64!Wow64LdrpInitialize+0x2ed
00000000`0013f6d0 00000000`77ed582f ntdll!LdrpInitializeProcess+0x1538
00000000`0013f9d0 00000000`77ef30a5 ntdll!_LdrpInitialize+0x18f
00000000`0013fab0 00000000`77d59620 ntdll!KiUserApcDispatch+0x15
00000000`0013ffa8 00000000`00000000 0x77d59620
00000000`0013ffb0 00000000`00000000 0x0
00000000`0013ffb8 00000000`00000000 0x0
00000000`0013ffc0 00000000`00000000 0x0
00000000`0013ffc8 00000000`00000000 0x0
00000000`0013ffd0 00000000`00000000 0x0
00000000`0013ffd8 00000000`00000000 0x0
00000000`0013ffe0 00000000`00000000 0x0
00000000`0013ffe8 00000000`00000000 0x0
00000000`0013fff0 00000000`00000000 0x0
00000000`0013fff8 00000000`00000000 0x0
00000000`00140000 00000020`78746341 0x0
00000000`00140008 00005370`00000001 0x20`78746341
The above stack only shows the 64 bit calls; we cannot see what the 32 bit calls are doing. To get the 32bit stack you must use one of the below methods.
Ø Option 1 : Run “!wow64exts.k”
0:000> !wow64exts.k
Walking 64bit Stack…
Walking 32bit Stack...
002ded98 75ec1c83 USER32!NtUserWaitMessage+0x15
002dee24 75ec61ef BROWSEUI!BrowserProtectedThreadProc+0x44
002dfea8 779ba3a6 BROWSEUI!SHOpenFolderWindow+0x22c
002dfec8 0040243d SHDOCVW!IEWinMain+0x129
002dff1c 00402748 IEXPLORE!WinMain+0x316
002dffc0 7d4e7d2a IEXPLORE!WinMainCRTStartup+0x186
002dfff0 00000000 KERNEL32!BaseProcessStart+0x28
Ø Option 2 : Switch to x86 mode (using “!wow64exts.sw”) and do KB.
0:000> !wow64exts.sw
Switched to 32bit mode
0:000:x86> kb
ChildEBP RetAddr Args to Child
002ded98 75ec1c83 002f1be8 002dee50 002f1be8 USER32!NtUserWaitMessage+0x15
002dee24 75ec61ef 002f1be8 002f1be8 00000000 BROWSEUI!BrowserProtectedThreadProc+0x44
002dfea8 779ba3a6 002f1be8 00000001 00000000 BROWSEUI!SHOpenFolderWindow+0x22c
002dfec8 0040243d 002e2508 00000001 ffffffff SHDOCVW!IEWinMain+0x129
002dff1c 00402748 00400000 00000000 002e2508 IEXPLORE!WinMain+0x316
002dffc0 7d4e7d2a 00000000 00000000 7efdf000 IEXPLORE!WinMainCRTStartup+0x186
002dfff0 00000000 004025c2 00000000 000000c8 KERNEL32!BaseProcessStart+0x28
The easiest way to see all of the 32bit call stacks is by switching to 32bit mode (!wow64exts.sw) and doing ~*k. In addition, you can set breakpoints in 32bit or 64 bit binaries using the 64bit debugger. Also note “!peb” will show both the 64bit and 32bit PEB.
Using the 32bit debugger
As mentioned earlier there is nothing wrong with using the 32 bit debugger. If you just need to debug the application’s 32bit code, using it is probably the simplest approach. However, if you need to view Wow64 code or binaries, you must use the 64bit debugger. Note that these techniques apply to debugging Wow64 dumps and live processes.
You can find more information about WoW64 applications at http://msdn.microsoft.com/en-us/library/aa384249(VS.85).aspx
Hi, my name is Chad. I work as an escalation engineer for Microsoft’s OEM support team.
A while back, I encountered an interesting crash on one of my computers at home, and I thought I’d post about how I debugged it.
This particular machine had been humming along quite happily for some time, but one day while I was scanning some photos, it bluescreened. Naturally, I hoped it was just a fluke, but after it happened a few more times while doing the same thing, I decided to debug it.
Ordinarily, if a machine crashes when performing a specific activity, like scanning photos, my first inclination would be to suspect a bug in one of the drivers involved in that activity, like the scanner driver or the USB driver. But in this case, I had been using this scanner for a long time, with the same drivers, and never had this problem, so this sudden crashing was kind of mysterious.
Let's see what we can tell from the dump!
The first order of business when looking at a crash dump is the !analyze -v command. I've trimmed some of it here for brevity, but it goes something like this:
kd> !analyze -v
*******************************************************************************
* *
* Bugcheck Analysis *
KERNEL_MODE_EXCEPTION_NOT_HANDLED (8e)
This is a very common bugcheck. Usually the exception address pinpoints
the driver/function that caused the problem. Always note this address
as well as the link date of the driver/image that contains this address.
Some common problems are exception code 0x80000003. This means a hard
coded breakpoint or assertion was hit, but this system was booted
/NODEBUG. This is not supposed to happen as developers should never have
hardcoded breakpoints in retail code, but ...
If this happens, make sure a debugger gets connected, and the
system is booted /DEBUG. This will let us see why this breakpoint is
happening.
Arguments:
Arg1: c0000005, The exception code that was not handled
Arg2: 8738e300, The address that the exception occurred at
Arg3: b9b3dc7c, Trap Frame
Arg4: 00000000
Debugging Details:
------------------
EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at "0x%08lx" referenced memory at "0x%08lx". The memory could not be "%s".
FAULTING_IP:
+ffffffff8738e300
8738e300 0000 add [eax],al
TRAP_FRAME: b9b3dc7c -- (.trap ffffffffb9b3dc7c)
ErrCode = 00000002
eax=00000001 ebx=bc514c68 ecx=0001065e edx=bc510000 esi=00000955 edi=b9b3dd64
eip=8738e300 esp=b9b3dcf0 ebp=b9b3dd08 iopl=0 nv up ei pl zr na po nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010246
8738e300 0000 add [eax],al ds:0023:00000001=??
Resetting default scope
DEFAULT_BUCKET_ID: DRIVER_FAULT
BUGCHECK_STR: 0x8E
LAST_CONTROL_TRANSFER: from 8051d6a7 to 8053331e
STACK_TEXT:
WARNING: Frame IP not in any known module. Following frames may be wrong.
b9b3dcec bf801619 b9b3dd64 0012eee4 bf81551f 0x8738e300
b9b3dd08 bf81553c b9b3dd64 0012eee4 bf81551f win32k!ValidateHwnd+0x5c
b9b3dd50 804de7ec 0001065e 012eb244 00000000 win32k!NtUserInvalidateRect+0x1d
b9b3dd50 7c90eb94 0001065e 012eb244 00000000 nt!KiFastCallEntry+0xf8
0012eecc 77d4b601 0105a943 0001065e 012eb244 ntdll!KiFastSystemCallRet
0012eed0 0105a943 0001065e 012eb244 00000000 USER32!NtUserInvalidateRect+0xc
77d4b601 90909090 001134b8 0300ba00 12ff7ffe NikonScan4!GetSource+0x21e93
...
From the stack trace, we can tell that NikonScan4.dll made a call into User32, which ultimately ended up calling into win32k.sys down in kernel mode. Win32k was in a function called ValidateHwnd() and then tried to call some function at address 0x8738e300, at which point we tried to dereference an invalid pointer and crashed.
What’s at 0x8738e300?
kd> dc 8738e300
8738e300 00000000 00000001 00000000 87360350 ............P.6.
8738e310 00000000 00000001 f71af9fe f71b0030 ............0...
8738e320 f71afb0e f71afbb4 f71b0098 f71b0214 ................
8738e330 f71afef6 f71aff8e 07fef800 00000000 ................
8738e340 f71afffc 00000000 0a0e000a 644c6d4d ............MmLd
8738e350 8732ea58 870303e0 ffffffff ffffffff X.2.............
8738e360 00000012 00000000 f797f000 f7989905 ................
8738e370 0000c000 00500050 e1971458 00160016 ....P.P.X.......
kd> !pool 8738e300 2
Pool page 8738e300 region is Nonpaged pool
*8738e2f8 size: 50 previous size: 8 (Allocated) *NV
Owning component : Unknown (update pooltag.txt)
Well, that’s bad. 0x8738e300 isn’t actually a valid address of a function. That location contains some user data (specifically, some nonpaged pool).
So, that's why we blew up: the ValidateHwnd() function in win32k made a call to this bad address which contains data instead of code! Let's see if we can figure out why it did this. We can find the return address in win32k!ValidateHwnd on the stack, and unassemble the instructions leading up to the point where we ran off into the weeds.
kd> kv L8
b9b3dd08 bf81553c b9b3dd64 0012eee4 bf81551f win32k!ValidateHwnd+0x5c (FPO: [Non-Fpo])
b9b3dd50 804de7ec 0001065e 012eb244 00000000 win32k!NtUserInvalidateRect+0x1d (FPO: [Non-Fpo])
b9b3dd50 7c90eb94 0001065e 012eb244 00000000 nt!KiFastCallEntry+0xf8 (FPO: [0,0] TrapFrame @ b9b3dd64)
0012eecc 77d4b601 0105a943 0001065e 012eb244 ntdll!KiFastSystemCallRet (FPO: [0,0,0])
77d4b605 001134b8 0300ba00 12ff7ffe 900008c2 0x90909090
The return address from where we made the bad call is bf801619. Let’s unassemble a few instructions backwards from that address (using the handy but often-overlooked “ub” command) and see what the code was doing:
kd> ub bf801619
win32k!ValidateHwnd+0x2f:
bf8015ff 8d1c82 lea ebx,[edx+eax*4]
bf801602 8bc1 mov eax,ecx
bf801604 c1e810 shr eax,0x10
bf801607 663b430a cmp ax,[ebx+0xa]
bf80160b 75ad jnz win32k!ValidateHwnd+0x3d (bf8015ba)
bf80160d 807b0801 cmp byte ptr [ebx+0x8],0x1
bf801611 7573 jnz win32k!ValidateHwnd+0xff (bf801686)
bf801613 ff15e0b298bf call dword ptr [win32k!_imp__PsGetCurrentThread (bf98b2e0)]
OK, so it's pretty simple. Win32k decided which function address to call by reading it from a pointer stored at a hardcoded location (in other words, in a global variable) within Win32k itself. That pointer is located at bf98b2e0. The debugger helpfully tells us that this pointer is intended to contain the address of a function called PsGetCurrentThread, but let’s double-check this and make sure it actually does. At this point, the working assumption would be that this pointer had gotten corrupted somehow.
kd> dd bf98b2e0 L1
bf98b2e0 804e4a15
Interesting. bf98b2e0 contains the value 804e4a15. This is not even close to the bad address the processor actually called! Remember, from the stack trace, we jumped to 8738e300 instead. What does this pointer actually point to?
kd> ln 804e4a15
(804e4a15) nt!PsGetCurrentThread | (804e4a51) nt!KeSetEventBoostPriority
Exact matches:
nt!PsGetCurrentThread = <no type information>
Sure enough, 804e4a15 is the address of nt!PsGetCurrentThread.
So, basically, win32k was trying to call PsGetCurrentThread(), and even had the correct pointer in memory to get there, but the processor instead jumped to a bogus address located in the middle of some user data.
At this point it's pretty safe to say that this is a hardware problem, and furthermore, since the relevant memory looks fine, it looks like a problem within the CPU itself. (I’m not an expert on CPU architecture, but if I were to take a wild guess I’d say that maybe the processor had some problem when reading from its L2 cache.)
Upon discovering this, I decided to crack open my case and take a look. It didn’t take long to spot the problem:
Yes, that's my CPU fan. Notice that the heatsink is completely clogged up with dust!
The resolution to this story: I took a can of compressed air and blew all the dust out of the heatsink. This took care of the problem, and the computer happily scanned hundreds of photos and hasn’t crashed again since. Why was it only crashing when I was scanning photos? Most likely because this was causing the CPU to run at 100% utilization for extended periods of time, and it was simply overheating!
Written by Jeff Dailey
Hello, my name is Jeff, I’m a escalation engineer on the Microsoft CPR Platforms team. This blog entry is a follow on for how to detect a hung window. This process and training lab is right out of our CPR Training curriculum. In order to do the lab I have prepared for you, you will need to have downloaded the dumphungwindow and then badwindow.exe from my earlier blog post. You will also need to install the debugging tools for windows.
Debugging tools: http://www.microsoft.com/whdc/devtools/debugging/installx86.mspx
Previous blog http://blogs.msdn.com/ntdebugging/archive/2007/05/29/detecting-and-automatically-dumping-hung-gui-based-windows-applications.aspx
After you have both of these installed we can get started. We are going to debug and figure out why the window stops repainting and does not respond.
Step 1 start badwindow.exe
Step 2 run dumphungwindow.exe
Step 3 select Hang \ Hang Type 1 from the BadWindow.exe menu.
You should see dump hung window detect your window no processing messages and as a result it will dump the badwindow.exe process
************ OUTPUT *************
C:\source\dumphungwindow\release>dumphungwindow.exeDumps will be saved in C:\Users\jeffda\AppData\Local\Temp\scanning for hung windows
**
Hung Window found dumping process (7924) badwindow.exeDumping unresponsive processC:\Users\jeffda\AppData\Local\Temp\HWNDDump_Day6_12_2007_Time9_53_34_Pid7924_badwindow.exe.dmp
Dump complete
Hung Window found dumping process (7924) badwindow.exeDumping unresponsive processC:\Users\jeffda\AppData\Local\Temp\HWNDDump_Day6_12_2007_Time9_53_46_Pid7924_badwindow.exe.dmp
Hung Window found dumping process (7924) badwindow.exeDumping unresponsive processC:\Users\jeffda\AppData\Local\Temp\HWNDDump_Day6_12_2007_Time9_53_56_Pid7924_badwindow.exe.dmp
Dump complete*
Step 4 create a local symbol directory at C:\websymbols
Step 5 set your symbol path under file \ symbols in windbg to SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols
See http://www.microsoft.com/whdc/devtools/debugging/debugstart.mspx for details.
Step 6 start windbg select file\open crash dump and select the first dump file.
Your initial output should look like this.
Microsoft (R) Windows Debugger Version 6.7.0001.0
Copyright (c) Microsoft Corporation. All rights reserved.
***** WARNING: Your debugger is probably out-of-date.
***** Check http://dbg for updates.
Loading Dump File [C:\Users\jeffda\AppData\Local\Temp\HWNDDump_Day6_12_2007_Time9_53_34_Pid7924_badwindow.exe.dmp]
User Mini Dump File with Full Memory: Only application data is available
Symbol search path is: Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols
Executable search path is:
Windows Vista Version 6000 MP (2 procs) Free x86 compatible
Product: WinNt, suite: SingleUserTS
Debug session time: Tue Jun 12 09:53:35.000 2007 (GMT-4)
System Uptime: 11 days 18:41:43.089
Process Uptime: 0 days 0:00:32.000
....................................
Loading unloaded module list
.
eax=00000000 ebx=00000000 ecx=00000000 edx=00000000 esi=004e3750 edi=004d9c60
eip=777fb2b2 esp=0017ee18 ebp=0017ee44 iopl=0 nv up ei pl nz na po nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000202
ntdll!NtAlpcSendWaitReceivePort+0x12:
777fb2b2 c22000 ret 20h
Step 7 from the debugger prompt (Locate a prompt at the bottom of windbg that has a 0:000> next to it. Type ~* k
You will most likely see output similar to this.
. 0 Id: 1ef4.17a8 Suspend: 0 Teb: 7efdd000 Unfrozen
ChildEBP RetAddr 0017ee14 76f584c5 ntdll!NtAlpcSendWaitReceivePort+0x120017ee44 76f5899d rpcrt4!LRPC_CASSOCIATION::AlpcSendWaitReceivePort+0x2f0017ee78 76f58514 rpcrt4!LRPC_BASE_CCALL::DoSendReceive+0x480017ee90 76f57785 rpcrt4!LRPC_BASE_CCALL::SendReceive+0x2c0017eea0 76f5766b rpcrt4!I_RpcSendReceive+0x250017eeb4 76f5763a rpcrt4!NdrSendReceive+0x310017eec0 76ff0146 rpcrt4!NdrpSendReceive+0x90017f2c8 756e5ff3 rpcrt4!NdrClientCall2+0x18f0017f2e0 756e5ef9 dnsapi!R_ResolverQuery+0x190017f34c 756e5d34 dnsapi!Query_PrivateExW+0x2570017f388 756e6025 dnsapi!Query_Shim+0xbb0017f3b0 74f461b4 dnsapi!DnsQuery_W+0x200017f3e0 74f4611c mswsock!SaBlob_Query+0x2d0017f424 74f459db mswsock!Rnr_DoDnsLookup+0x1290017f6c0 76d75a72 mswsock!Dns_NSPLookupServiceNext+0x25c0017f6d8 76d75a52 ws2_32!NSPROVIDER::NSPLookupServiceNext+0x170017f6f4 76d75994 ws2_32!NSPROVIDERSTATE::LookupServiceNext+0x1c0017f720 76d7590a ws2_32!NSQUERY::LookupServiceNext+0xa40017f764 76d7b11f ws2_32!WSALookupServiceNextW+0x740017f7b4 76d7ab77 ws2_32!WSALookupServiceNextA+0x52
1 Id: 1ef4.f80 Suspend: 0 Teb: 7efda000 Unfrozen
ChildEBP RetAddr 0275fef0 76e41220 ntdll!ZwWaitForSingleObject+0x150275ff60 76e41188 kernel32!WaitForSingleObjectEx+0xbe*** ERROR: Symbol file could not be found. Defaulted to export symbols for FwcWsp.dll - 0275ff74 55620611 kernel32!WaitForSingleObject+0x12WARNING: Stack unwind information not available. Following frames may be wrong.0275ff88 5561f10c FwcWsp!RatDumpStack+0x1bd0275ffa0 76eb19f1 FwcWsp!RatDbgPrint+0x1560275ffac 7782d109 kernel32!BaseThreadInitThunk+0xe0275ffec 00000000 ntdll!_RtlUserThreadStart+0x23
2 Id: 1ef4.1d04 Suspend: 0 Teb: 7efd7000 UnfrozenChildEBP RetAddr 028cfe90 76e4edb5 ntdll!ZwWaitForMultipleObjects+0x15028cff2c 76e430c3 kernel32!WaitForMultipleObjectsEx+0x11d028cff48 5560c841 kernel32!WaitForMultipleObjects+0x18WARNING: Stack unwind information not available. Following frames may be wrong.028cff84 5560c9c0 FwcWsp!WspDisable+0x1b47028cffa0 76eb19f1 FwcWsp!WspDisable+0x1cc6028cffac 7782d109 kernel32!BaseThreadInitThunk+0xe028cffec 00000000 ntdll!_RtlUserThreadStart+0x23
Note the [NUMBER] Id: indicates the thread number, to the right of this you have the process id and thread id > PROCESS 1ef4.1d04 < THREAD | THREAD STATE > Suspend: 0 Teb: 7efd7000 Unfrozen
Each of these threads represents a call stack. The most recent call is at the TOP of the stack. As each call is made the stack grows larger. In this case thread 0 is not showing us all it’s stack. We can get more of the stack be specifying the number of frames we would like to see.
In this case my current thread is 0 so I will issue a k 123 command asking for as many 123 calls to be displayed.
0:000> k 123
0017ee14 76f584c5 ntdll!NtAlpcSendWaitReceivePort+0x12
0017ee44 76f5899d rpcrt4!LRPC_CASSOCIATION::AlpcSendWaitReceivePort+0x2f
0017ee78 76f58514 rpcrt4!LRPC_BASE_CCALL::DoSendReceive+0x48
0017ee90 76f57785 rpcrt4!LRPC_BASE_CCALL::SendReceive+0x2c
0017eea0 76f5766b rpcrt4!I_RpcSendReceive+0x25
0017eeb4 76f5763a rpcrt4!NdrSendReceive+0x31
0017eec0 76ff0146 rpcrt4!NdrpSendReceive+0x9
0017f2c8 756e5ff3 rpcrt4!NdrClientCall2+0x18f
0017f2e0 756e5ef9 dnsapi!R_ResolverQuery+0x19
0017f34c 756e5d34 dnsapi!Query_PrivateExW+0x257
0017f388 756e6025 dnsapi!Query_Shim+0xbb
0017f3b0 74f461b4 dnsapi!DnsQuery_W+0x20
0017f3e0 74f4611c mswsock!SaBlob_Query+0x2d
0017f424 74f459db mswsock!Rnr_DoDnsLookup+0x129
0017f6c0 76d75a72 mswsock!Dns_NSPLookupServiceNext+0x25c
0017f6d8 76d75a52 ws2_32!NSPROVIDER::NSPLookupServiceNext+0x17
0017f6f4 76d75994 ws2_32!NSPROVIDERSTATE::LookupServiceNext+0x1c
0017f720 76d7590a ws2_32!NSQUERY::LookupServiceNext+0xa4
0017f764 76d7b11f ws2_32!WSALookupServiceNextW+0x74
0017f7b4 76d7ab77 ws2_32!WSALookupServiceNextA+0x52
0017f7e0 76d7dba3 ws2_32!getxyDataEnt+0xbd
0017fa1c 0040147d ws2_32!gethostbyname+0xb4
0017fbc8 004012fc badwindow!hangtype1+0x3d [c:\source\badwindow\badwindow\badwindow.cpp @ 310]
0017fc24 772a87af badwindow!WndProc+0xdc [c:\source\badwindow\badwindow\badwindow.cpp @ 274]
0017fc50 772a8936 user32!InternalCallWinProc+0x23
0017fcc8 772a8a7d user32!UserCallWinProcCheckWow+0x109
0017fd2c 772a8ad0 user32!DispatchMessageWorker+0x380
0017fd3c 004010fb user32!DispatchMessageW+0xf
0017ff0c 00401817 badwindow!wWinMain+0xfb [c:\source\badwindow\badwindow\badwindow.cpp @ 124]
0017ffa0 76eb19f1 badwindow!__tmainCRTStartup+0x150 [f:\sp\vctools\crt_bld\self_x86\crt\src\crtexe.c @ 589]
0017ffac 7782d109 kernel32!BaseThreadInitThunk+0xe
0017ffec 00000000 ntdll!_RtlUserThreadStart+0x23
As you can see our current thread is the WinMain thread. Note this from the call badwindow!wWinMain, this winmain is processing messages, you can see it calling DispatchMessageW this goes back into User32, and user32 has to rout this message for the proper WinProc that is registered for the associated Window.
Looking further up the stack you see our winproc being called badwindow!WndProc. Remember it’s critical that your WinProc stays available to process incoming messages. If it makes a blocking call or runs in a loop you will not be able to process incoming messages and your window will not repaint and your application will appear to be hung.
If you look deeper yet into the call stack you can see that the WndProc calls hangtype1, this calls gethostbyname. http://msdn2.microsoft.com/en-us/library/ms738524.aspx
Looking at the docs, Gethostbyname takes one parameter being the server name. Looks dump it out and see what we are looking for.
0:000> kv 123
0017ee14 76f584c5 00000424 40020000 004d9c60 ntdll!NtAlpcSendWaitReceivePort+0x12 (FPO: [8,0,0])
0017ee44 76f5899d 00020000 004d9c60 004e3638 rpcrt4!LRPC_CASSOCIATION::AlpcSendWaitReceivePort+0x2f (FPO: [Non-Fpo])
0017ee78 76f58514 0017eed8 0017ef04 0017f2e8 rpcrt4!LRPC_BASE_CCALL::DoSendReceive+0x48 (FPO: [Non-Fpo])
0017ee90 76f57785 0017eed8 0017ef04 0017eeb4 rpcrt4!LRPC_BASE_CCALL::SendReceive+0x2c (FPO: [Non-Fpo])
0017eea0 76f5766b 0017eed8 756e54f0 0017ef04 rpcrt4!I_RpcSendReceive+0x25 (FPO: [Non-Fpo])
0017eeb4 76f5763a 0017ef04 004d9d18 76ff0146 rpcrt4!NdrSendReceive+0x31
0017eec0 76ff0146 59a28201 00000000 00000000 rpcrt4!NdrpSendReceive+0x9 (FPO: [0,0,0])
0017f2c8 756e5ff3 756e54f0 756e5aec 0017f2e8 rpcrt4!NdrClientCall2+0x18f
0017f2e0 756e5ef9 00000000 0279091c 00000001 dnsapi!R_ResolverQuery+0x19 (FPO: [Non-Fpo])
0017f34c 756e5d34 0279091c 00000001 14000000 dnsapi!Query_PrivateExW+0x257 (FPO: [Non-Fpo])
0017f388 756e6025 00000001 0279091c 00000001 dnsapi!Query_Shim+0xbb (FPO: [Non-Fpo])
0017f3b0 74f461b4 0279091c 00000001 14000000 dnsapi!DnsQuery_W+0x20 (FPO: [Non-Fpo])
0017f3e0 74f4611c 0279091c 00000001 14000000 mswsock!SaBlob_Query+0x2d (FPO: [Non-Fpo])
0017f424 74f459db 0279091c 0031c2e0 00307118 mswsock!Rnr_DoDnsLookup+0x129 (FPO: [Non-Fpo])
0017f6c0 76d75a72 027908b8 00000000 0017f794 mswsock!Dns_NSPLookupServiceNext+0x25c (FPO: [Non-Fpo])
0017f6d8 76d75a52 00307270 027908b8 00000000 ws2_32!NSPROVIDER::NSPLookupServiceNext+0x17 (FPO: [Non-Fpo])
0017f6f4 76d75994 0031bc18 00000000 0017f794 ws2_32!NSPROVIDERSTATE::LookupServiceNext+0x1c (FPO: [Non-Fpo])
0017f720 76d7590a 00307130 00000000 0017f794 ws2_32!NSQUERY::LookupServiceNext+0xa4 (FPO: [Non-Fpo])
0017f764 76d7b11f 00307118 00000000 0017f794 ws2_32!WSALookupServiceNextW+0x74 (FPO: [Non-Fpo])
0017f7b4 76d7ab77 00307118 00000000 0017f7ec ws2_32!WSALookupServiceNextA+0x52 (FPO: [Non-Fpo])
0017f7e0 76d7dba3 00307118 0000013c 0017fa38 ws2_32!getxyDataEnt+0xbd (FPO: [Non-Fpo])
0017fa1c 0040147d 0017fa38 00000000 00070520 ws2_32!gethostbyname+0xb4 (FPO: [Non-Fpo])
0017fbc8 004012fc 00401220 0017fbfc 00401220 badwindow!hangtype1+0x3d (FPO: [Uses EBP] [0,101,0]) (CONV: cdecl) [c:\source\badwindow\badwindow\badwindow.cpp @ 310]
0017fc24 772a87af 00070520 00000111 00008003 badwindow!WndProc+0xdc (CONV: stdcall) [c:\source\badwindow\badwindow\badwindow.cpp @ 274]
0017fc50 772a8936 00401220 00070520 00000111 user32!InternalCallWinProc+0x23
0017fcc8 772a8a7d 00000000 00401220 00070520 user32!UserCallWinProcCheckWow+0x109 (FPO: [Non-Fpo])
0017fd2c 772a8ad0 00401220 00000000 0017ff0c user32!DispatchMessageWorker+0x380 (FPO: [Non-Fpo])
0017fd3c 004010fb 0017fd54 00403938 00000001 user32!DispatchMessageW+0xf (FPO: [Non-Fpo])
0017ff0c 00401817 00400000 00000000 004a0f84 badwindow!wWinMain+0xfb (CONV: stdcall) [c:\source\badwindow\badwindow\badwindow.cpp @ 124]
0017ffa0 76eb19f1 7efde000 0017ffec 7782d109 badwindow!__tmainCRTStartup+0x150 (FPO: [Non-Fpo]) (CONV: cdecl) [f:\sp\vctools\crt_bld\self_x86\crt\src\crtexe.c @ 589]
0017ffac 7782d109 7efde000 0017206b 00000000 kernel32!BaseThreadInitThunk+0xe (FPO: [Non-Fpo])
0017ffec 00000000 00401987 7efde000 00000000 ntdll!_RtlUserThreadStart+0x23 (FPO: [Non-Fpo])
0:000> da 0017fa38
0017fa38 "AServerThatDoesNotExist14"
Generally speaking it’s not a good idea to make any call that could possibly block in your winproc.
Let’s look deeper into what the caller (hangtype1) is really doing, Feel free to look at the C code as it was included with the project. I like to look at the assembler, it’s nice practice for those times you don’t have the source. We can take the return address from the gethostbyname call 0040147d to figure out how it was called. A neat trick for disassembling the entire function based on an address within that functions code range is to use the unassembled function or UF command. In the following sample output I’ll simply UF the return address of the gethostbyname call. I’ll then walk trough and annotate the disassembly. Then we will look at the C source.
For reference on intel cpu programming see http://developer.intel.com/design/pentium/manuals/24319001.pdf
0:000> uf 0040147d
We are making run for locals on the stack, 402 dec bytes. If you look at the source you will see this is a sznamebuff[400], int I, and pointer to a host structure.badwindow!hangtype1 [c:\source\badwindow\badwindow\badwindow.cpp @ 298]:298 00401440 81ec94010000 sub esp,194h
This is done for stack checking to ensure none is trying to exploit our stack. 298 00401446 a100304000 mov eax,dword ptr [badwindow!__security_cookie (00403000)]298 0040144b 33c4 xor eax,esp298 0040144d 89842490010000 mov dword ptr [esp+190h],eax
Save ebx register state on the stack298 00401454 53 push ebx
Get the function pointer for sprint from the import table.299 00401455 8b1d7c204000 mov ebx,dword ptr [badwindow!_imp__sprintf (0040207c)]Save the stack base / frame pointer.299 0040145b 55 push ebpGet the address of gethostbyname from the import table and save it in EBP.299 0040145c 8b2d20214000 mov ebp,dword ptr [badwindow!_imp__gethostbyname (00402120)]Save ESI and EDI on the stack.299 00401462 56 push esi299 00401463 57 push ediZero out ESI299 00401464 33f6 xor esi,esibadwindow!hangtype1+0x26 [c:\source\badwindow\badwindow\badwindow.cpp @ 306]:Push ESI on the stack and the first pass it will be zero == 0, this is the 3rd parameter to the call we are about to make.306 00401466 56 push esi We are loading the effective address of something on the stack, we will see what it is used for when we use EAX, remember that.306 00401467 8d442414 lea eax,[esp+14h]Hum what is this?. badwindow!`string' (004021ac) Well when in doubt dump it out.0:000> db 004021ac004021ac 41 53 65 72 76 65 72 54-68 61 74 44 6f 65 73 4e AServerThatDoesN004021bc 6f 74 45 78 69 73 74 25-64 00 00 00 57 65 20 61 otExist%d...We a004021cc 72 65 20 69 6e 20 68 61-6e 67 74 79 70 65 32 74 re in hangtype2t004021dc 68 72 65 61 64 61 00 00-57 65 20 61 72 65 20 69 hreada..We are i004021ec 6e 20 68 61 6e 67 74 79-70 65 32 74 68 72 65 61 n hangtype2threa004021fc 64 62 00 00 48 00 00 00-00 00 00 00 00 00 00 00 db..H...........0040220c 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 ................0040221c 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 ................Ok so we are pushing our print formatting string for sprintf this is the second parameter to the call we are about to make.306 0040146b 68ac214000 push offset badwindow!`string' (004021ac)
Ok here we are pushing EAX, remember eax is just an address on our stack, and in this case the 1st parameter passed to sprint, it’s our BUFFER. szNameBuffer to be exact. 306 00401470 50 push eax
Call sprint (Remember we stored it’s address in ebx)306 00401471 ffd3 call ebx
Stacks grown down or negative as they get larger, so adding in this case is cleaning up the stack from our sprintf call 306 00401473 83c40c add esp,0Ch
Get the address of the szNameBuffer307 00401476 8d4c2410 lea ecx,[esp+10h]
Push the address of the name buffer on the stack.307 0040147a 51 push ecx
Call gethostbyname (Rember we stored it’s address in ebp the) NOTE when you do a return the value is stored in EAX (REMBER THIS)307 0040147b ffd5 call ebp
Push 250 dec FA hex on the stack. (This is our return address from get host by name we see on the stack) 310 0040147d 68fa000000 push 0FAh
Store EAX in EDI We need to save it because EAX will be over written when we come back from Sleep and we may want to use it later. Eax is the return value from gethostbyname.310 00401482 8bf8 mov edi,eax
Call sleep it takes 1 parameter and that values was just pushed on the stack 250ms ¼ second.310 00401484 ff150c204000 call dword ptr [badwindow!_imp__Sleep (0040200c)]Increment ESI this is where we are storing the value int i. It was zero the first time through now it’s ESI+1 if we loop it will grow311 0040148a 83c601 add esi,1
Compare ESI we know this is in I to 0x64h or 100dec.312 0040148d 83fe64 cmp esi,64hWhere we have a jump if greater then, that is to say if ESI > 64h or 100dec jump to 00401496312 00401490 7f04 jg badwindow!hangtype1+0x56 (00401496)Now we test EDI, this is the return value from gethostbyname. TEST is a logical compare. So if you test EDI and EDI and EDI = 0304 00401492 85ff test edi,ediIf EDI is ZERO 0 jump to this address je is jump if ZERO this take us to the TOP of the loop.304 00401494 74d0 je badwindow!hangtype1+0x26 (00401466)
The rest of this is us cleaning up the stack and getting ready to drop out of the function. We can stop here.318 00401496 8b8c24a0010000 mov ecx,dword ptr [esp+1A0h]318 0040149d 5f pop edi318 0040149e 5e pop esi318 0040149f 5d pop ebp318 004014a0 5b pop ebx318 004014a1 33cc xor ecx,esp318 004014a3 e8c5010000 call badwindow!__security_check_cookie (0040166d)318 004014a8 81c494010000 add esp,194h318 004014ae c3 ret
Here is the C code snipped for hangtype1
void hangtype1(void)
{
int i =0;
char szNameBuff[400];
struct hostent * myhost=NULL;
// Lets see if any of my favorte machines are on the network.
while (!myhost)
sprintf(szNameBuff,"AServerThatDoesNotExist%d",i);
myhost = gethostbyname (szNameBuff);
// We are hitting network resource in a loop.
// Let's not hit the server with to many request at once. We will pause 250ms.
Sleep(250);
i++;
if(i>100)
break;
}
So what is happening, if you recall we saw AServerThatDoesNotExist14 on the stack. We know from this loop and the format string passed to sprint that this loop will run to 100 and will sleep 250ms with each pass. That’s 25 seconds total just in sleeps not to mention the network stall on a bad server name. I realize this code is somewhat contrived but it’s based on the types of things we actually see and troubleshoot every day.
So how do we get this work done without stalling the application?. The answer is rather simple. We go multithreaded. You can look for hangtype1fix in the project. Instread of calling hangtype1 from our windproc directly we declare a threadproc called hangtype1fixthread
void __cdecl hangtype1fixthread(void *)
// our blocking function.
hangtype1();
Then from our winproc instead of calling hangtype1 we call another function called hangtype1fix that creates a new thread that in turn executes out hangtype1 code in another thread context so our winproc can continue to execute and process incoming messages. The being thread call is asynchronously . The thread is create immediately and your thread continues to run.
void hangtype1fix(void)
_beginthread(hangtype1fixthread, 12000, NULL);
I hope you found this blog informative and I welcome your feedback.
Thanks Jeff-
Hi my name is Jason. I’m an escalation engineer on the Microsoft critical problem resolutions platform team. We recently dealt with an interesting issue that I would like to share with you and show how we went about discovering the cause.
The customer stated that when his server (Windows Server 2003 SP1) boots up the DNS Server service does not become availabe/responsive for ~5 minutes. Customer also stated he has other DC's in this domain that are the same model server (same specs too) that do not have the problem. So I ask, "What do you mean not responsive?". During this 5 minute period, the administrator cannot use the DNS MMC to administer the server, and DNS lookups to this server fail with a timeout error. The customer can: Ping the server, connect to file shares, etc.
Troubleshooting So my initial question was "What is DNS doing for 5 minutes?". In order to determine where dns.exe was hung up or spinning his wheels, I replaced dns.exe with the checked build.
Walkthrough:
1. Downloaded the checked build of SP1
2. Extracted and expanded the checked build of dns.exe (What is a checked build?)
3. Worked through the process of replacing the free build of DNS.exe (Booting into safe mode in order to bypass WFP (Windows file protection))
4. Created the dnsdebug file (a text file with no extension under %windir%\system32\dns). The checked build of DNS looks for this file, and reads the contents to determine the level of output.
5. In the file I set the debug flags for DNS_DEBUG_INIT (0x00000010 ) & DNS_DEBUG_INIT2 (0x01000000) and set the output to go to the debugger (0x00000002), net result (0x01000012). The image below shows contents of the file.
6. Instead of using the debugger I decided to use DebugView from Sysinternals to view the output. I could have attached a debugger to DNS.exe to see the output, but in this case the DebugView was sufficient.
With all this in place we rebooted. We launched DebugView and did a "net start dns". Here is a screen scrape of what the output looked like:
I have highlighted the debug flag I have set.
From this output we were able to determine that the Server was going through the following process:
1. DNS gets started by Service Control Manger (a.k.a. SCM) and then DNS communicates back to SCM (across the NtControlPipe)
2. DNS loads the zones from the registry (HKLM\Software\Microsoft\Windows NT\Current Version\DNS Server\Zones
3. DNS pulls this zones from the Active Directory (as this is an AD integrated zone)
During these phases the ability to respond to administrative requests and name lookup requests is blocked (this is by design). So based on this I needed to figure out where we were spending the most of our 5 minutes.
1. For the SCM portion, I used Filemon monitoring only Named Pipes for dns.exe and services.exe.
2. For the registry portion, I used Regmon with a filter for dns.exe
3. For the AD portion, I used perfmon with the NTDS>LDAP Searches/Sec counter With these running we were able to isolate the delay to the point where DNS was pulling the zones from AD. Here is a scrape of me running this test on my VM:
You can see (top left) the service communicating back to SCM across the NtControlPipe. Lets take a moment here and have a short discussion how a service starting works.
1. SCM goes to a registry value that will be used as a seed value to build a named pipe. (HKLM\System\CurrentControlSet\Control\ServiceCurrent\(default) = <a number>2. The result is incremented by 1 and then returned back to SCM 3. SCM then creates a named pipe using NtControlPipe as the prefix followed by the numerical value from the registry. So when starting the first service, the pipe name would be NtControlPipe1 (To get a peek at these pipes download PipeList from Sysinternals)4. The SCM then starts the service process5. The service then goes to the same registry value and uses this seed to know which named pipe he will need to talk to in order to convey progress back to SCM.
We can then see (top right) the service pulling in the dns zone information from the registry. Finally we see (bottom) the small spike in LDAP searches/sec (highlighted) where the service was pulling the zone information from the AD. Until the pull of the zones from the registry and AD is complete, the DNS service does not respond to MMC administration or DNS requests.
The majority of the 5 minute delta fell into the last bucket (the LDAP queries to pull the zone data). So begins our next discussion, What causes LDAP to be slow?
So LDAP lives in LSASS.EXE. There are two LDAP interfaces in AD. The entire directory is available of TCP port 389, and then the subset of AD that is published (a.k.a. Global Catalog) over TCP port 3268. The DNS services is pulling the DNS zones from the "Entire Directory" interface.
Lsass memory usage on domain controllers has two components:
In Windows Server 2003, there is no limit to how large the database buffer cache can grow (outside of the limits of the 2GB of virtual address space that can be modified via the /3GB switch). This helps explain the behavior of lsass.exe normally being the largest value in the "Mem Usage" category in task manager on a DC. Q. What can affect the amount of Database buffer cache that LSASS is maintaining "In-RAM? A. Memory pressure. This could be caused by configuration (pagefile settings, etc.) or another application(s) using and paging a lot of memory on the machine. lsass.exe on the affected machine was using 300mb in the "Mem Usage" column in task manager. The machine that was not affected was closer to 450mb. This means there was 150mb more data in the Database buffer cache, and due to this, the process is able to respond a lot faster to LDAP requests since he can pull from the RAM cache more and this limits the amount of time spent pulling this data into the cache from the DIT file on the disk.
The solution is to tune the pagefile settings to match the better performing server.
By simply increasing the pagefile from ~1 gig to ~3gig the DNS Server can be available within ~30 seconds
So what are we (Microsoft) doing about this?
Excerpt from the "Book of Longhorn"
What new DNS functionality is provided by this feature in Windows Server "Longhorn"?
The DNS Server service in Windows Server "Longhorn" includes a number of new and enhanced features compared to the DNS Server service that was available in the Microsoft Windows NT® Server, Windows 2000 Server, and Windows Server® 2003 operating systems. The following sections describe these features.
Background zone loading Very large organizations with extremely large zones that store their DNS data in AD DS sometimes discover that restarting a DNS server can take an hour or more while the DNS data is retrieved from the directory service. The result is that the DNS server is effectively unavailable to service client requests for the entire time that it takes to load AD DS-based zones.
A DNS server running Windows Server "Longhorn" now loads zone data from AD DS in the background while it restarts so that it can respond to requests for data from other zones. When the DNS server starts, it:
Because the task of loading zones is performed by separate threads, the DNS server is able to respond to queries while zone loading is in progress. If a DNS client requests data for a host in a zone that has already been loaded, the DNS server responds with the dat (or, if appropriate, a negative response) as expected. If the request is for a node that has not yet been loaded into memory, the DNS server reads the node's data from AD DS and updates the node's record list accordingly.
Why is this functionality important? The DNS server can use background zone loading to begin responding to queries almost immediately when it restarts, instead of waiting until its zones are fully loaded. The DNS server can respond to queries for the nodes that it has loaded or that can be retrieved from AD DS. This functionality also provides another advantage when zone data is stored in AD DS rather than in a file: AD DS can be accessed asynchronously and immediately when a query is received, while file-based zone data can be accessed only through a sequential read of the file.
I hope you find this information helpful.
Event Tracing for Windows (ETW) is a system and software diagnostic, troubleshooting and performance monitoring component of Windows that has been around since Windows 2000. However, it wasn't until Windows Vista that major components of the OS were updated to heavily use ETW tracing; making it much more practical and useful.
ETW is useful from a variety of scenarios, including:
-User & Admin: Being in control of your system and knowing what is going on. -User & Admin: Troubleshooting performance, hardware and OS issues. -Enthusiast: Learning further about the OS and the low level guts of the OS. -Software Developer/ISV/OEM: Investigating issues with your software's interaction with Microsoft OS & technologies -Hardware Developer/IHV/OEM: Investigating issues with hardware interaction with the OS, including kernel, driver subsystems, up to the user stack.
ETW is a set of technologies and tools that can absolutely complement existing tools while providing a look into the guts of the OS at a very low level.
A great article from the April 2007 edition of MSDN Magazine - http://msdn.microsoft.com/en-us/magazine/cc163437.aspx, covers ETW in great depth, and is recommended reading.
Here is graphical overview of the ETW infrastructure that covers how provider(s) log to high performance in memory buffers, which can be kept in memory in a circular buffer, or written to disk in a sequential or circular fashion.
Today on Windows Vista, Windows 7, and sparingly on earlier OSes; ETW is used by Microsoft Development and Microsoft Support, as well as others; to help troubleshoot issues, find root-cause of bugs, analyze performance; and a large variety of other tasks.
As it stands today, ETW has two major thrusts:
1. Exposing time-based event data similar in practicality to plain old text based logged (without the normal performance overhead). Troubleshooting logs have long been used across the industry by software developers and IT professionals to troubleshoot issues. Here are some examples of how ETW logging is used.
First off, shown below is Windows Event Viewer enabling viewing of analytic and debug logs (which show you many more logs than the default view):
A variety of logs can be viewed graphically using the Event Viewer or via the built-in command line OS utility – wevtutil. There are a quite a variety of logs that are enabled by default and are available via the Event Viewer or in a kernel memory dump.
Since ETW is also a foundational component, you might find it appearing in other troubleshooting utilities. For example, Office Communications Server 2007 uses ETW for troubleshooting.
ETW is used to monitor for network events from the kernel which are then added to Process Monitor, a utility from Microsoft Sysinternals. You should also see tools such netsh and NetMon being able to capture and decode ETW traces.
There are many other methods and utilities to view the ETW tracing providers available, which may be found in the OS, via the Device Driver Kit (DDK), enabled with Microsoft products (such as Exchange or IIS), or used by tools, such as Office Communication System Logging.
The second major thrust of ETW is exposing performance data.
2. Performance data is exposed from the lowest levels of the kernel, driver subsystems, and every major user-mode component of Windows (such as Windows Media Player).
Performance data can be used in a hybrid fashion both for troubleshooting and performance reasons. Take for example a built in OS tool that uses ETW tracing to provide rich data - Resource Monitor.
Resource Monitor is available from the Windows Task Manager - Performance Tab. The cool thing about Resource Monitor is that it internally uses kernel ETW tracing to gather its data, providing rich data from the kernel itself; and thus providing a practical exposure of ETW technology for Administrators to troubleshoot their systems. This goes beyond the data that Task Manager provides, which often tends to be based on performance counters which you cannot drill down into further.
Resource Monitor is useful from both a performance standpoint, as well as troubleshooting. For example, you can view top cpu users, top disk users (why your disk is thrashing), or search for a locked file handle that is open.
For a deeper drill-down into performance data XPerf can be used to capture ETW traces and make them available for decoding offline (on another box).
Here, Xperf (available with the Windows DDK and as a standalone download) is shown looking at Internet Explorer's file access time.
There have been several posts on the NTDebugging Blog regarding XPerf, other blog posts, as well as the Windows Performance Analysis Dev Center.
In Windows 7 and Windows Server 2008 R2, ETW tracing has even been further improved with many more providers including 600+ inbox providers’ registered on the box. This number increases when add-on features or products are installed (such as Hyper-V).
The tracing provided by ETW, whether it is used directly in the OS, or via add-on tools, provides deep insight into Windows and how it works. We will explore ETW in depth in future blog posts.
Share this post :
Hi, I'm Manish from Global Escalation Services. I would like to present a multiple random bug check issue, which was caused by malicious code (trojan) running on the machine. This is the walkthrough of how we found the virus on the server.
In this particular dump, the machine crashed with Bugcheck 0xA (IRQL_NOT_LESS_OR_EQUAL) because we got a Page Fault at dispatch level, also known as DPC Level (IRQL 2). Windows system architecture governs that we cannot have a page fault at dispatch level because paging requires I/O, I/O requires a wait, and we cannot wait while the IRQL is above dispatch level. So when this anomaly happens Windows will intentionally crash the machine. We trapped in CcMapData because we touched the address c226d800 which was not valid because it’s paged out. If we look at implementation of CcMapData on MSDN http://msdn.microsoft.com/en-us/library/windows/hardware/ff539155(v=vs.85).aspx we see that this function can only be called below Dispatch Level, but the current CPU IRQL is 2.
So how did this happen? The most likely possibility is that some driver on the stack raised the IRQL by calling KeRaiseIrql and then forgot to lower it by calling KeLowerIrql. There are many drivers on this stack and anyone could be the culprit. It is difficult to track when one of these may have done it as it’s long gone.
We could have enabled IRQL Checking using Driver Verifier to find the culprit but we have to enable it on all drivers listed in the stack (and possibly others), which could cause some performance issues. Also this was not the only crash we were seeing; there were various stop codes. So before going the verifier route I decided to dig more into this dump. I noticed there is an address on the stack for which the module name is not getting resolved. This looked odd (suspicious) as most of the time the debugger does a great job of finding the module. I started investigating what is this module. It turned out to be a Trojan “TrojanDropper:Win32/Sirefef.B”. http://www.microsoft.com/security/portal/Threat/Encyclopedia/Entry.aspx?Name=TrojanDropper%3AWin32%2FSirefef.B
0: kd> !analyze -v
IRQL_NOT_LESS_OR_EQUAL (a)
An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high. This is usually
caused by drivers using improper addresses.
If a kernel debugger is available get the stack backtrace.
Arg1: c226d800, memory referenced
Arg2: d0000002, IRQL
Arg3: 00000000, bitfield :
bit 0 : value 0 = read operation, 1 = write operation
bit 3 : value 0 = not an execute operation, 1 = execute operation (only on chips which support this level of status)
Arg4: 808b64a6, address which referenced memory
0: kd> kv
f78ae41c 808b64a6 badb0d00 00000000 00000001 nt!KiTrap0E+0x2a7 (FPO: [0,0] TrapFrame @ f78ae41c)
f78ae4cc f71a6f2d 8b22d520 f78ae4fc 00000400 nt!CcMapData+0x8c (FPO: [Non-Fpo])
f78ae4ec f71a4494 f78ae7ec 8b64c150 01c6d800 Ntfs!NtfsMapStream+0x4b (FPO: [Non-Fpo])
f78ae560 f71a6df0 f78ae7ec 8b3a7100 e7c50ce0 Ntfs!NtfsReadMftRecord+0x86 (FPO: [Non-Fpo])
f78ae598 f71a6fac f78ae7ec 8b3a7100 e7c50ce0 Ntfs!NtfsReadFileRecord+0x7a (FPO: [Non-Fpo])
f78ae5d0 f718e46d f78ae7ec e7c50cd8 e7c50ce0 Ntfs!NtfsLookupInFileRecord+0x37 (FPO: [Non-Fpo])
f78ae67c f718e541 f78ae7ec e7c50cd8 f718e2cb Ntfs!NtfsWalkUpTree+0xbe (FPO: [Non-Fpo])
f78ae6d8 f718e263 f78ae7ec e7c50cd8 00000000 Ntfs!NtfsBuildNormalizedName+0x44 (FPO: [Non-Fpo])
f78ae704 f7196c2e f78ae7ec 8801d600 e7c50da0 Ntfs!NtfsQueryNameInfo+0x4b (FPO: [Non-Fpo])
f78ae774 f71a0ff6 f78ae7ec 88081cb0 8b5f9260 Ntfs!NtfsCommonQueryInformation+0x291 (FPO: [Non-Fpo])
f78ae7d8 f71a102f f78ae7ec 88081cb0 00000001 Ntfs!NtfsFsdDispatchSwitch+0x12a (FPO: [Non-Fpo])
f78ae8f4 8081df85 8b3a7020 88081cb0 88081cb0 Ntfs!NtfsFsdDispatchWait+0x1c (FPO: [Non-Fpo])
f78ae908 f721fd28 8b2fd220 8b60d308 8b54b020 nt!IofCallDriver+0x45 (FPO: [Non-Fpo])
f78ae934 8081df85 8b5f9260 88081cb0 88081cb0 fltMgr!FltpDispatch+0x152 (FPO: [Non-Fpo])
f78ae948 f721fd28 88081ed0 8b60d308 c000000d nt!IofCallDriver+0x45 (FPO: [Non-Fpo])
f78ae974 8081df85 8b54b020 88081cb0 88081cb0 fltMgr!FltpDispatch+0x152 (FPO: [Non-Fpo])
f78ae988 f5c62bec 88081cb0 8b54a980 88081cb0 nt!IofCallDriver+0x45 (FPO: [Non-Fpo])
WARNING: Stack unwind information not available. Following frames may be wrong.
f78ae99c f5c5e3ee 8b03b690 00000000 f78ae9c0 CtxSbx+0x5bec
f78ae9ac 8081df85 8b03b690 88081cb0 8b0528e0 CtxSbx+0x13ee
f78ae9c0 f61074e1 8b0528e0 8801d600 f78aea00nt!IofCallDriver+0x45 (FPO: [Non-Fpo])
f78ae9e4 f61075d0 8b0a9b80 00081cb0 f78aea08 CtxAltStr+0x44e1
f78ae9f4 8081df85 8b0a9b80 88081cb0 88081ed0 CtxAltStr+0x45d0
f78aea08 8b5cfc89 00000000 87ef0000 87ef9002 nt!IofCallDriver+0x45 (FPO: [Non-Fpo])
f78aec90 8081df85 8b5ef610 8b08c110 8b08c2a8 0x8b5cfc89 <------------ This looks odd what is this module?
f78aeca4 f7241607 8b08c2a8 00000000 f78aece8nt!IofCallDriver+0x45 (FPO: [Non-Fpo])
f78aecb4 f72412b2 8b08c2a8 8b392b70 87f11974 CLASSPNP!SubmitTransferPacket+0xbb (FPO: [Non-Fpo])
f78aece8 f7241533 00000000 00000e00 87f11808 CLASSPNP!ServiceTransferRequest+0x1e4 (FPO: [Non-Fpo])
f78aed0c 8081df85 8b392ab8 00000000 8b2a2670 CLASSPNP!ClassReadWrite+0x159 (FPO: [Non-Fpo])
f78aed20 f74c80cf 8b2b7d80 87f11998 f78aed44 nt!IofCallDriver+0x45 (FPO: [Non-Fpo])
f78aed30 8081df85 8b6e4020 87f11808 87f119bc PartMgr!PmReadWrite+0x95 (FPO: [Non-Fpo])
f78aed44 f7317053 87f119d8 8b6e8148 882b9888 nt!IofCallDriver+0x45 (FPO: [Non-Fpo])
f78aed60 8081df85 8b2b7cc8 87f11808 87f119fc ftdisk!FtDiskReadWrite+0x1a9 (FPO: [Non-Fpo])
f78aed74 f72cf4f5 885360cc 88536098 87f26810 nt!IofCallDriver+0x45 (FPO: [Non-Fpo])
f78aed90 f72d1517 87f11808 885360cc 8b60c470 volsnap!VspDecrementIrpRefCount+0x14f (FPO: [Non-Fpo])
f78aeda8 f72c0398 87f26810 87f26810 88536098 volsnap!VspWriteVolumePhase22+0x3d (FPO: [Non-Fpo])
f78aee30 f72d15d3 00f26810 885360cc 00000000 volsnap!VspAcquireNonPagedResource+0xc6 (FPO: [Non-Fpo])
f78aee50 8081e123 00000000 87f215e8 88536098 volsnap!VspWriteVolumePhase2+0x59 (FPO: [Non-Fpo])
f78aee80 f7241829 f78aeeb0 f72413ec 8b392ab8 nt!IopfCompleteRequest+0xcd (FPO: [Non-Fpo])
f78aee88 f72413ec 8b392ab8 87f215e8 00000001 CLASSPNP!ClassCompleteRequest+0x11 (FPO: [Non-Fpo])
f78aeeb0 8081e123 00000000 8b08c428 8b08c5c0 CLASSPNP!TransferPktComplete+0x1fd (FPO: [Non-Fpo])
f78aeee0 f7266545 8b6e30e8 8b08c428 f78aef24 nt!IopfCompleteRequest+0xcd (FPO: [Non-Fpo])
f78aeef0 f7265a8a 87f856a8 00000001 00000000 SCSIPORT!SpCompleteRequest+0x5e (FPO: [Non-Fpo])
f78aef24 f7265130 8b6e30e8 87f856a8 f78aef9b SCSIPORT!SpProcessCompletedRequest+0x6a7 (FPO: [Non-Fpo])
f78aef9c 8083211c 8b6e30a4 8b6e3030 00000000 SCSIPORT!ScsiPortCompletionDpc+0x2bd (FPO: [Non-Fpo])
f78aeff4 8088dba7 f3f4f92c 00000000 00000000 nt!KiRetireDpcList+0xca (FPO: [Non-Fpo])
f3f4f948 80a603d9 ffdffa02 f3f4f980 f3f4f980 nt!KiDispatchInterrupt+0x37 (FPO: [Uses EBP] [0,0,1])
f3f4f964 80a60577 8b6c226c f3f4f980 8088d91d hal!HalpCheckForSoftwareInterrupt+0x81 (FPO: [Non-Fpo])
f3f4f970 8088d91d 8b304c00 000001a3 f3f4fa04 hal!HalEndSystemInterrupt+0x67 (FPO: [Non-Fpo])
f3f4f970 809395e7 8b304c00 000001a3 f3f4fa04 nt!KiInterruptDispatch+0x5d (FPO: [0,2] TrapFrame @ f3f4f980)
f3f4fa04 80939c37 88139178 88028301 00000008 nt!ObpAllocateObject+0x199 (FPO: [Non-Fpo])
f3f4fa38 808f8d28 00000000 8b76aad0 f3f4fa70 nt!ObCreateObject+0x129 (FPO: [Non-Fpo])
f3f4fb44 80937a40 8b6afd10 00000000 88028398 nt!IopParseDevice+0x710 (FPO: [Non-Fpo])
f3f4fbc4 80933b74 00000000 f3f4fc04 00000042 nt!ObpLookupObjectName+0x5b0 (FPO: [Non-Fpo])
f3f4fc18 808eaee7 00000000 00000000 dffa7c01 nt!ObOpenObjectByName+0xea (FPO: [Non-Fpo])
f3f4fc94 808ec181 077ef198 c0140000 077ef14c nt!IopCreateFile+0x447 (FPO: [Non-Fpo])
f3f4fcf0 808eec10 077ef198 c0140000 077ef14c nt!IoCreateFile+0xa3 (FPO: [Non-Fpo])
f3f4fd30 808897ec 077ef198 c0140000 077ef14c nt!NtCreateFile+0x30 (FPO: [Non-Fpo])
f3f4fd30 7c82847c 077ef198 c0140000 077ef14c nt!KiFastCallEntry+0xfc (FPO: [0,0] TrapFrame @ f3f4fd64)
077ef22c 00000000 00000000 00000000 00000000 0x7c82847c
The current CPU IRQL is 2, at which pagefaults cannot be serviced, hence windows crashed the machine.
0: kd> !irql
Debugger saved IRQL for processor 0x0 -- 2 (DISPATCH_LEVEL)
Un-assembling from the return address of this unknown driver, to try to confirm that this is really driver code and that it really does belong in this call stack:
0: kd> ub 8b5cfc89
8b5cfc62 8bd7 mov edx,edi
8b5cfc64 c60605 mov byte ptr [esi],5
8b5cfc67 895e18 mov dword ptr [esi+18h],ebx
8b5cfc6a c7460400020000 mov dword ptr [esi+4],200h
8b5cfc71 c7460809000000 mov dword ptr [esi+8],9
8b5cfc78 c7461c70fb5c8b mov dword ptr [esi+1Ch],8B5CFB70h
8b5cfc7f c64603e0 mov byte ptr [esi+3],0E0h
8b5cfc83 ff158c805d8b call dword ptr ds:[8B5D808Ch]
Sure does call IofCallDriver
0: kd> dps 8B5D808Ch l1
8b5d808c 8081df40 nt!IofCallDriver
So what is this driver? In Windows every image starts with a PE header which starts with letters “MZ”. So I started looking for PE header. I took the offset of the module on the stack and went back to its page boundary, then I started moving 1 page back at a time looking for the PE header.
0: kd> dc 8b5cf000
8b5cf000 0010b9f2 f02b0000 3b063c8b 83137538 ......+..<.;8u..
8b5cf010 c08304e9 04f98304 b85fee73 00000001 ........s._.....
8b5cf020 9cb8c35e b98b5d86 00000010 ff8bd02b ^....]......+...
8b5cf030 3b02348b 83137530 c08304e9 04f98304 .4.;0u..........
8b5cf040 b85fee73 00000001 335fc35e ccc35ec0 s._.....^._3.^..
8b5cf050 83ec8b55 78a04cec 538b5da4 c0b60f56 U....L.x.].SV...
8b5cf060 827ae857 ff330000 33f46589 f845c7db W.z...3..e.3..E.
8b5cf070 00000400 8b084d8b 52510c55 50b4458d .....M..U.QR.E.P
0: kd> dc 8b5cf000-1000
8b5ce000 01c73024 00000000 c70cc483 00800002 $0..............
8b5ce010 845e5f00 b10874db 0815ff01 8b8b5d80 ._^..t.......]..
8b5ce020 6a0c2444 15ff5000 8b5d8028 14c25b5d D$.j.P..(.].][..
8b5ce030 cccccc00 cccccccc cccccccc cccccccc ................
8b5ce040 6a306a56 2415ff00 8b8b5d80 74f685f0 Vj0j...$.].....t
8b5ce050 6a006a53 68026a01 8b5d0892 5d08ae68 Sj.j.j.h..].h..]
8b5ce060 ff006a8b 5d80ac15 ff56508b 5d80a815 .j.....].PV....]
8b5ce070 000d8b8b ff8b5da2 5d80b015 6a006a8b .....].....].j.j
0: kd> dc 8b5cf000-1000*2
8b5cd000 0689c033 89044689 46890846 1046890c 3....F..F..F..F.
8b5cd010 89144689 ec831846 1c46891c 57204689 .F..F.....F..F W
8b5cd020 85244689 8d0574f6 02eb284e c033c933 .F$..t..N(..3.3.
8b5cd030 41890189 08418904 8b34468b 4e891048 ...A..A..F4.H..N
8b5cd040 14508b10 8b145689 4e891848 1c508b18 ..P..V..H..N..P.
8b5cd050 8b1c5689 4e891848 1c508b20 2424448d .V..H..N .P..D$$
8b5cd060 244c8d50 186a510c 8b245689 6a302454 P.L$.Qj..V$.T$0j
8b5cd070 46c65205 46c70005 00000008 0c46c700 .R.F...F......F.
Finally I found the PE header for this image.
0: kd> dc 8b5cf000-1000*3
8b5cc000 00905a4d 00000003 00000004 0000ffff MZ..............
8b5cc010 000000b8 00000000 00000040 00000000 ........@.......
8b5cc020 00000000 00000000 00000000 00000000 ................
8b5cc030 00000000 00000000 00000000 000000d0 ................
8b5cc040 0eba1f0e cd09b400 4c01b821 685421cd ........!..L.!Th
8b5cc050 70207369 72676f72 63206d61 6f6e6e61 is program canno
8b5cc060 65622074 6e757220 206e6920 20534f44 t be run in DOS
8b5cc070 65646f6d 0a0d0d2e 00000024 00000000 mode....$.......
Using the built-in debugger extension !dh I dumped the header of this image to find the name. Unfortunately there is no name for this image. This address 8b5cc000 is not in the loaded module list, which raised further suspicion. Either it is hiding its load address or this driver was not loaded by standard loading mechanism. This module date shows it’s pretty recent build.
0: kd> !dh 8b5cc000
File Type: DLL
FILE HEADER VALUES
14C machine (i386)
4 number of sections
4EA3461E time date stamp Sun Oct 23 04:09:26 2011
0 file pointer to symbol table
0 number of symbols
E0 size of optional header
2102 characteristics
Executable
32 bit word machine
DLL
OPTIONAL HEADER VALUES
10B magic #
9.00 linker version
A400 size of code
2000 size of initialized data
0 size of uninitialized data
46C0 address of entry point
1000 base of code
----- new -----
10000000 image base
1000 section alignment
200 file alignment
1 subsystem (Native)
5.00 operating system version
0.00 image version
5.00 subsystem version
10000 size of image
400 size of headers
DF1F checksum
00100000 size of stack reserve
00001000 size of stack commit
00100000 size of heap reserve
00001000 size of heap commit
0 DLL characteristics
0 [ 0] address [size] of Export Directory
C91C [ 3C] address [size] of Import Directory
0 [ 0] address [size] of Resource Directory
0 [ 0] address [size] of Exception Directory
0 [ 0] address [size] of Security Directory
F000 [ 3FC] address [size] of Base Relocation Directory
0 [ 0] address [size] of Debug Directory
0 [ 0] address [size] of Description Directory
0 [ 0] address [size] of Special Directory
0 [ 0] address [size] of Thread Storage Directory
0 [ 0] address [size] of Load Configuration Directory
0 [ 0] address [size] of Bound Import Directory
C000 [ 1A8] address [size] of Import Address Table Directory
0 [ 0] address [size] of Delay Import Directory
0 [ 0] address [size] of COR20 Header Directory
0 [ 0] address [size] of Reserved Directory
SECTION HEADER #1
.text name
A354 virtual size
1000 virtual address
A400 size of raw data
400 file pointer to raw data
0 file pointer to relocation table
0 file pointer to line numbers
0 number of relocations
0 number of line numbers
60000020 flags
Code
(no align specified)
Execute Read
SECTION HEADER #2
.rdata name
13AC virtual size
C000 virtual address
1400 size of raw data
A800 file pointer to raw data
40000040 flags
Initialized Data
Read Only
SECTION HEADER #3
.data name
4B0 virtual size
E000 virtual address
200 size of raw data
BC00 file pointer to raw data
C0000040 flags
Read Write
SECTION HEADER #4
.reloc name
576 virtual size
F000 virtual address
600 size of raw data
BE00 file pointer to raw data
42000040 flags
Discardable
0: kd> !lmi 8b5cc000
Loaded Module Info: [8b5cc000]
ffffffff8b5cc000 is not a valid address
0: kd> lmvm 8b5cc000
start end module name
Checking the import table it does have calls to raise and lower the IRQL. We get to the import table by taking the base address plus the offset to the Import Address Table Directory. The below output is just a snippet of the whole table.
0: kd> dps 8b5cc000+c000
8b5d8000 80a603f4 hal!KfLowerIrql
8b5d8004 80a5ff00 hal!KeGetCurrentIrql
8b5d8008 80a600b4 hal!KfRaiseIrql
Dumping the entire image contents in memory to find more clues about this driver. Again, I am only showing snippets of the whole output.
0: kd> dc 8b5cc000 L?10000/4
8b5d5e20 76000000 66697265 77252079 00000a5a ...verify %wZ...
8b5d5e30 31000000 35343332 39383736 33323130 ...1234567890123
8b5d5e40 37363534 31303938 35343332 39383736 4567890123456789
8b5d5e50 33323130 37363534 31303938 35343332 0123456789012345
8b5d5e60 39383736 33323130 37363534 31303938 6789012345678901
8b5d5e70 00343332 66000000 646e756f 67697320 234....found sig
8b5d5e80 7574616e 3d206572 0a752520 b4000000 nature = %u.....
Interestingly this image has other images (modules) embedded in it. We can see other PE headers, which again have no name. This looks highly suspicious and resembles behavior used by malicious software.
8b5d0970 f775c085 4848c78b 4dc35f5e 0300905a ..u...HH^_.MZ...
8b5d0980 04000000 ff000000 b80000ff 00000000 ................
8b5d0990 40000000 00000000 00000000 00000000 ...@............
8b5d09a0 00000000 00000000 00000000 00000000 ................
8b5d09b0 00000000 c8000000 0e000000 000eba1f ................
8b5d09c0 21cd09b4 cd4c01b8 69685421 72702073 ...!..L.!This pr
8b5d09d0 6172676f 6163206d 746f6e6e 20656220 ogram cannot be
8b5d09e0 206e7572 44206e69 6d20534f 2e65646f run in DOS mode. 8b5d2200 4d00004e 0300905a 04000000 ff000000 N..MZ...........
8b5d2210 b80000ff 00000000 40000000 00000000 ...........@....
8b5d2220 00000000 00000000 00000000 00000000 ................
8b5d2230 00000000 00000000 00000000 f0000000 ................
8b5d2240 0e000000 000eba1f 21cd09b4 cd4c01b8 ...........!..L.
8b5d2250 69685421 72702073 6172676f 6163206d !This program ca
8b5d2260 746f6e6e 20656220 206e7572 44206e69 nnot be run in D
8b5d2270 6d20534f 2e65646f 240a0d0d 00000000 OS mode....$....
After ensuring my AV definitions were up to date, I decided to dump this memory contents into a file onto my machine. The moment contents were written my AV Microsoft Forefront Endpoint Protection caught a Trojan in this file.
0: kd> .writemem c:\temp\drv.sys 8b5cc000 L?10000 Writing 10000 bytes................................
Hello. It’s Ryan Mangipano again (Ryanman). Today’s blog will be the first in a multi-part post designed to help you understand the output of the !PTE debuger comand along with the basics of hardware Virtual Addressing. To better understand Virtual Addressing, we will use the debugger to manually translate a 4-KByte Page Table PAE virtual address into the actual physical addresses in order to understand what !PTE is displaying. I’ll provide relevant information about Virtual Addresses and Virtual Memory along the way.
We’ll start by translating a non-prototype valid hardware VA from an x86 PAE system
The actual process of manually decoding a virtual addresses is going to vary according to the architecture (x86, x64), size of the page, whether or not the virtual address is a large page, whether the page is marked as valid, whether it’s a hardware or software PTE, and whether PAE is enabled. For simplicity, we will not be going over the table entry (PTE/PDE) flags until part two of the blog. For my first example, I am going to demonstrate how to use the information in the processor manuals together with the debugger to decode a valid non-prototype virtual address into the physical memory that it references. You can then try this on your own using windbg and the !pte command to validate your findings.
Finding an address to translate
To start, we'll need to locate a virtual address that maps to a valid PTE. I am going to use the following highlighted virtual address which I found in a memory dump.
f9a12d0c ff155400a1f9 call dword ptr [sfilter+0x4054 (f9a10054)]
Get out your processor manuals
The AMD and Intel manuals both contain helpful reference material on this subject. PDFs of these manuals are available online. Since my CPU is an Intel, I’m going to refer to the Intel manuals.
1: kd> !cpuinfo
CP F/M/S Manufacturer MHz PRCB Signature MSR 8B Signature Features
0 15,6,4 GenuineIntel 3192 0000000400000000 a0073fff
My Intel manuals arrived on CD via snail mail a few days after placing my free order:
http://www.intel.com/products/processor/manuals/order.htm
Is PAE in use?
On this Intel x86 system, we first need to determine if we are using PAE. In the Intel “System Programming section 3.6.1 Paging Options” I found that the PAE (Physical Address Extension) flag can be found in bit 5 of the CR4 register. The PG (Paging) flag in CR0 which enables paging must of course also be set. Register bit numbering starts at zero so it’s the fifth bit from the right. Let’s examine cr4 and convert the value it contains into binary:
1: kd> r cr4
cr4=000006f9
1: kd> .formats 000006f9
Binary: 00000000 00000000 00000110 11111001
You can use the debugger to check for other flags in the same manner
You can use the above method to check for other flags that you find documented in the processor manuals. For example, you can see that bit 4 is also set in the cr4 register output above. This is the Page Size Extensions (PSE) bit which enables large page sizes.
Terminology
Paging is simply a method of dividing up the linear address space into chunks. Pages are simply the name that we give to the chunks that result. The size of these sections is referred to as the Page Size. On x86 systems, the standard page size is 4-KBytes. A Large Page means that the page is larger than the standard size (2MB on PAE x86 or 4MB on non-PAE x86).
Keep the last three
On a standard 4-KByte page size virtual address, the address f9a10054 can be thought of as being split up as follows:
Information needed to locate the base of the page in physical memory f9a10
Offset into that physical page once it is found 054
This means that the last three hexadecimal digits of our physical address will also be 0x054. Once we find the base of our physical page(which will always end in three zeros), we can simply add it to the offset 0x54 and we will have our physical address.
Are there any commands that work with physical addresses?
We will also need a way to work with these physical addresses. Most of us are familiar with using the dd command with virtual addresses. There is also a command that accepts a physical address instead of a virtual address: !dd (notice the ! before the command). There are also variants of the !dd command, such as !db , !du , !dq.
What data do we need to obtain to perform this conversion?
We need to determine how to use the f9a10 portion of the virtual address to find the physical page base. The Memory Management System sets up and maintains a hierarchy of tables that keep track of the mappings between virtual addresses and physical addresses. You will need the following information to traverse the tables yourself and convert this address.
1. Starting point. This will be in the form of a pointer to the base of the first table that you need to check
2. The number of table levels in use on the system and the size of each entry in the tables on our system
3. The offsets into the different tables and the bases of each table.
Once you have all the above information, you can use the debugger to traverse these tables.
Let’s get our plan together for how we are going to obtain these three pieces of information
1. Starting point. This can be obtained from the PDBR. The CR3 register is known as the Page Directory Base Register (PDBR) which points to the physical address of the base of the first table.
2. The number of table levels in use on the system and the size of each entry in the tables on our system. This information can also be obtained from the Intel processor manuals. As previously mentioned, the number of tables, the size of the entries in the tables, the flags, where the bits are split up, and the names of the tables vary according to platform and if you are using PAE or not. All of these, however, use the same basic concepts:
Register pointing to a àTable with entries pointing to a table à with more entries pointing to à another tableà pointing to the physical address of the page in memory.
Table1: By reading the Intel Processor Manual “System Programming: Section 3.8, Physical Address Extension”, I was able to determine that the first table used for x86 PAE virtual addressing is called the Page-Directory-Pointer Table. It’s a table with only 4 Entries that are 64 bits (which is 8 bytes) wide. Each entry is referred to as a Page Directory Pointer Entry and abbreviated as PPE, PDP or PDPE depending on the source. These entries provide the index into the page directory, which is known as the Page Directory Index (PDI). One of these four pointers will lead you to the physical address of the base of the next table that you need to visit in the x86 PAE hierarchy. Just like the pointer that we used from the CR3 register, some of the bits in these table entries are not used as part of the index (referred to as a pointer in some documentation). We will grab the relevant bits and add the appropriate number of zeros to the index to obtain our physical address pointer. I will cover what these other bits are used for in part two of this blog. We must substitute zeros for these bits.
Table2: The table at the second level of tables in the x86 PAE hierarchy (which is referenced by the pointers in Table Level One) is called the Page Directory Table. Don’t confuse this with the Page-Directory-Pointer Table. Each Page Directory table can hold 512 entries which are 64 bits in size. The entries in this table are called Page Directory Entries (PDE) and they provide Page Table Index (PTI). Just like the last table, these entries contain indexes (which we convert to a pointer by simply adding zeros) to the base of the next table in the hierarchy.
Table3: The last table is referred to as the Page Table. Each Entry in the page table is called a Page Table Entry (PTE) and provides the Page Offset. Just like the last table, not all the bits are used for pointers. Each page table contains up to 512 entries which are also 64 bits in size. Each 64 bits entry in the table contains a pointer to the base of the page in physical memory.
In summary there are 3 levels of tables when using x86 PAE. These are the Page Directory Pointer Table, Page Directory Tables, and the Page Tables.
1. The offsets into the different tables and the bases of each table.
Each table above is a listing of indexes that will be used to locate the base of the next table. However, once we arrive at each table, we will need to know the index or offset into the table in order to know which table to get to next. These offsets into the tables can be obtained from the virtual address itself. Let’s review our virtual address again. However, this time we will break the address down in binary:
Virtual Address: f9a10054
1: kd> .formats 0xf9a10054
Binary: 11111001 10100001 00000000 01010100
Page Directory Pointer Index(PDPI) 11 Index into 1st table(Page Directory Pointer Table)
Page Directory Index(PDI) 111001 101 Index into 2nd table(Page Directory Table)
Page Table Index(PTI) 00001 0000 Index into 3rd table(Page Table)
Byte Index 0000 01010100 0x054, the offset into the physical memory page
So as you can see, the virtual address is nothing more than a bunch of indexes/offsets.
Putting all this data together to find the physical address
Now that we have all the required data, let’s proceed to locate our physical address
1. Obtain our base pointer to the first table. As we discussed earlier, we can obtain this value from the cr3 register. Bits 0-4 of this register are not used for the pointer to the table base. This means that in order to get the base pointer, we will need to replace these 5 least-significant bits with zero. This will result in the table base being located on a physical address that is always aligned to a 32-bit boundary.
Keep in mind that CR3 will have a different value here for each process. You must make sure that you are in the appropriate process context before proceeding. This is because user mode tables are specific to a particular process. Notice that I said processes, not threads. CR3 will not be changed when swapping threads, since each thread in a given process shares the same address space. Tables relating the system address space (kernel mode) are shared between all processes.
We’ll need to dump out the CR3 register (PDBR) in a format where we can view the last 5 bits. As you can see in the .formats output below, the 5 least significant bits are already set to zero. This means that the hexadecimal value located in the cr3 register is a pointer to the base of the first table. So we now have our starting point. Physical address 023406e0 is the base of the first table.
1: kd> .formats @cr3
Hex: 023406e0
Binary: 00000010 00110100 00000110 11100000
The proper way to display the pointer out the value would be to & against the following mask
1: kd> ? 0y11111111111111111111111111100000
Evaluate expression: -32 = ffffffe0
1: kd> .formats (@cr3 & ffffffe0)
You can use the!process command to get and/or verify this value.
1: kd> !process
PROCESS ff981a58 SessionId: 0 Cid: 0d54 Peb: 7ffde000 ParentCid: 0550
DirBase: 023406e0 ObjectTable: e1541510 HandleCount: 30.
You can also obtain this information from the EPROCESS structure.
1: kd> dt nt!_EPROCESS ff981a58 Pcb.DirectoryTableBase
+0x000 Pcb :
+0x018 DirectoryTableBase : [2] 0x23406e0
2. Obtain our base to the second table by finding our index into the first table. The first table (Page Directory Pointer Table) only has 4 entries and each entry is 64 bits wide. We know from the first two bits of the Virtual Address above that our offset into this table is 0y11 (The y tells the debugger the value is binary, instead of eleven). Eleven would be represented in the debugger as 0n11. We can simply multiply the offset (0y11) by the size of each entry and add the result to the base of the table to get our entry. The entries in the table are 8 bytes wide. We shall use sizeof() as shown below to obtain this value. We can pass this math to the !dq command to dump the data at these physical addresses.
1: kd> !dq (@cr3 & 0xffffffe0)+(0y11*@@(sizeof(nt!_MMPTE))) L1
# 23406f8 00000000`05503801
The processor manuals indicate that the first 12 bits of the entry above are not part of the pointer and must be discarded. This will cause alignment on 4KB boundaries. Since each Hex digit above represents 4 bits, this means that we need to change 801 to 000. That gives us our physical address of the base of our second table, 05503000. We will accomplish this in the next step by ANDing our PDE against a mask.
3. Obtain our base to the third table by finding our index into the second table. The Second Table (Page Directory Table) works the same way, except that this table contains up to 512 entries (on PAE systems). Keep in mind that there can be more than one Page Directory Tables for each process on a PAE system, however we are only concerned with the one that contains the data relating to our virtual address. We know from the virtual address that the offset into this table is 0y111001101. Calculate the address in the same manner as before. We will also need to set the last 12 bits to zero, just like before.
1: kd> !dq (0x05503801 & 0xFFFFFF000) +( 0y111001101*@@(sizeof(nt! _MMPTE))) L1
# 5503e68 00000000`0102d963
The last three hexadecimal digits must be changed to zero, since they are not part of the pointer. This gives us the base address of our last table, the page table, 0x0102d000
4. Find the base of the physical page our memory resides in by finding our index into the third table. Using the base of the page table from the previous step, let’s add the index into this table that we obtained from the virtual address, 0y000010000. The last three digits need to be set to zero.
1: kd> !dq (0x102d963&0xFFFFFF000)+(0y000010000*@@(sizeof(nt! _MMPTE))) L1
# 102d080 00000000`02010121
5. So now we have the physical address of the base of the page, 2010000 Add the base of the page to our offset from the virtual address, 0x054.
1: kd> ? (2010121 & 0xFFFFFF000) +0x054
Evaluate expression: 33620052 = 02010054
A shortcut is to simply change the 0x121 to 0x054 in the previous step.
6. Now, let’s dump out the data at our physical address and the virtual address
1: kd> !dd (2010121 & 0xFFFFFF000) +0x054 L2
# 2010054 804ef09c 804ef12c
1: kd> dd 0xf9a10054 L2
f9a10054 804ef09c 804ef12c
You can see that the data displayed by the two commands is the same.
7. Now let’s dump out the virtual address using the !PTE extension. Notice the values that it provides you with. Look above and compare the values displayed above to what is displayed below. You should now understand what the highlighted and bolded fields mean.
1: kd> !pte 0xf9a10054
VA f9a10054
PDE at 00000000C0603E68 PTE at 00000000C07CD080
contains 000000000102D963 contains 0000000002010121
pfn 102d -G-DA--KWEV pfn 2010 -G--A—KREV
The virtual addresses in italics represent the virtual address of the PDE (Page Directory Entry) and the PTE (Page Table Entry). Also, please note that PFN represents Page Frame Number. PFN is the term used to describe what I referred to as “pointers to the base of the next table” in the hierarchy. This is because it really isn’t a pointer; it’s an index into the table.
Hopefully, the output of !PTE makes a lot of sense to you now. In part two of this blog, I’ll discuss what the PDE/PTE flags (-G-DA—KWEV) represent and provide an example of manual conversion of x86 PAE Large Page Virtual Addresses to Physical.
In our previous articles we discussed identifying a pool leak with perfmon, and narrowing the source of the leak with poolmon. These tools are often preferred because they are easy to use, provide verbose information, and can be run on a system without forcing downtime. However, it is not always possible to get perfmon and poolmon data. If a system is experiencing poor performance you may have a business need to get the system up and running as quickly as possible without allowing time to troubleshoot. It is also possible to completely exhaust memory through a pool leak, leaving the system in a state where tools such as perfmon and poolmon will not work. In these scenarios it may be possible to troubleshoot the poor performance by forcing a bugcheck, gathering a memory dump, and performing a post mortem analysis.
Although a dump is not the ideal data to troubleshoot a leak, it can be done. I say less than ideal because a dump is a snapshot of the system memory, and does not provide the historical data which perfmon would provide. The lack of historical data makes it difficult to differentiate between high memory usage due to load and high memory usage due to a leak. It is up to you, as the troubleshooter, to determine if the dump is sufficient evidence of a leak. Sometimes identifying the tag and the driver will help you identify a known issue that causes a leak, or your knowledge of the driver architecture may allow you to determine if the memory usage is normal or not. In some scenarios you may decide to start monitoring with perfmon and collect additional data for a future occurrence.
The first step to debug a pool leak using a dump is to load the dump in windbg, set the symbol path, and reload symbols.
1: kd> .symfix c:\symbols
1: kd> .reload
Loading Kernel Symbols
...............................................................
................................................................
.....
The !vm command will show memory utilization, the 1 flag will limit the verbosity of this command. For the scenario of a pool leak, the significant values are “NonPagedPool Usage:” and “PagedPool Usage:”. If the debugger identifies a value that is out of the normal range it will flag it, and we can see here that the debugger has flagged excessive nonpaged pool usage. This is similar to the information we obtained in Part 1 using perfmon, but unlike perfmon we do not have trend data to indicate if this is temporary high pool usage due to load or if this is a leak.
1: kd> !vm 1
*** Virtual Memory Usage ***
Physical Memory: 403854 ( 1615416 Kb)
Page File: \??\C:\pagefile.sys
Current: 1048576 Kb Free Space: 1015644 Kb
Minimum: 1048576 Kb Maximum: 4194304 Kb
Available Pages: 106778 ( 427112 Kb)
ResAvail Pages: 225678 ( 902712 Kb)
Locked IO Pages: 0 ( 0 Kb)
Free System PTEs: 33533355 ( 134133420 Kb)
Modified Pages: 4844 ( 19376 Kb)
Modified PF Pages: 4838 ( 19352 Kb)
NonPagedPool Usage: 155371 ( 621484 Kb)
NonPagedPool Max: 191078 ( 764312 Kb)
********** Excessive NonPaged Pool Usage *****
PagedPool 0 Usage: 27618 ( 110472 Kb)
PagedPool 1 Usage: 3848 ( 15392 Kb)
PagedPool 2 Usage: 299 ( 1196 Kb)
PagedPool 3 Usage: 283 ( 1132 Kb)
PagedPool 4 Usage: 344 ( 1376 Kb)
PagedPool Usage: 32392 ( 129568 Kb)
PagedPool Maximum: 33554432 ( 134217728 Kb)
Session Commit: 7764 ( 31056 Kb)
Shared Commit: 6371 ( 25484 Kb)
Special Pool: 0 ( 0 Kb)
Shared Process: 5471 ( 21884 Kb)
PagedPool Commit: 32394 ( 129576 Kb)
Driver Commit: 2458 ( 9832 Kb)
Committed pages: 326464 ( 1305856 Kb)
Commit limit: 665998 ( 2663992 Kb)
The debugger can parse the pool tagging database and present similar information as poolmon. The !poolused command will do this, the /t5 option will limit output to the top 5 consumers, the 2 flag will sort by nonpaged pool usage (use the 4 flag if your leak is in paged pool).
1: kd> !poolused /t5 2
..
Sorting by NonPaged Pool Consumed
NonPaged Paged
Tag Allocs Used Allocs Used
Leak 601 615424000 0 0 UNKNOWN pooltag 'Leak', please update pooltag.txt
Pool 6 1717840 0 0 Pool tables, etc.
nVsC 664 1531552 0 0 UNKNOWN pooltag 'nVsC', please update pooltag.txt
netv 4369 1172224 1 144 UNKNOWN pooltag 'netv', please update pooltag.txt
Thre 607 774048 0 0 Thread objects , Binary: nt!ps
TOTAL 43424 634209952 63565 126487760
The above output shows that the tag “Leak” is associated with almost all of the nonpaged pool usage. This is the same information we obtained in Part 2 using poolmon.
Now we must identify what drivers use the pool tag “Leak”. Because we have a snapshot of the system memory we can search the dump for this tag. We can match each address to a module using the command lm a.
1: kd> !for_each_module s -a @#Base @#End "Leak"
fffff880`044b63aa 4c 65 61 6b 3b c1 0f 42-c1 41 8d 49 fd 8b d0 ff Leak;..B.A.I....
fffff880`044b6621 4c 65 61 6b 3b c1 0f 42-c1 33 c9 8b d0 ff 15 cc Leak;..B.3......
1: kd> lm a fffff880`044b63aa
fffff880`044b5000 fffff880`044bc000 myfault (no symbols)
1: kd> lm a fffff880`044b6621
The tag and driver name can be used to search the internet for known problems. If a known issue is found a driver update may be available, and installing this update may prevent a future memory leak.
If there are no updates available for the driver, or if this is your driver and you need to identify the cause of the leak, don’t panic. In future articles we will show techniques for getting call stacks of pool allocations, these call stacks be used to identify under what conditions the driver leaks memory.
Trey Nash here again, and I would like to discuss a scenario we are all too familiar with. You’ve worked your tail off for the past year, and for the past couple of months you even worked evenings and weekends. Management rewarded your team with two weeks off in appreciation of your efforts. But now that you’re back in the office, you’re hearing rumors percolating up from your tech support department that there are some cases where your application is crashing in the field for mysterious reasons. What do you do?
The application happens to have been built with an unhandled exception filter registered via the AppDomain.UnhandledException event. Therefore, you at least know that the application is failing with an InvalidCastException, but you cannot imagine why this is happening.
Wouldn’t it be nice if you could live debug on the affected system? Unless you work on-site for your customer or your software is on a laptop and your customer is willing to send it to you, then I doubt you will get this opportunity. What you need is a tool to capture the state of your application while it is failing. Then the customer could capture this information and send it to you.
Enter ADPlus. ADPlus is a free tool in the Debugging Tools for Windows package that scripts the CDB debugger allowing you to capture dumps for one or multiple processes on a system. It also offers the following advantages:
· ADPlus can monitor desktop applications, services, etc.
· ADPlus can monitor multiple processes on the system. When it collects a dump of those processes, it freezes and dumps them simultaneously. This is essential for tracking down problems with inter-process communications.
· ADPlus supports xcopy deployment meaning the customer does not need to install anything via Windows Installer, etc. This minimizes configuration changes on the machine, and that is music to customers’ ears.
Note: Although ADPlus is xcopy installable, you still have to install the Debugging Tools for Windows package via Windows Installer as that is the only way Microsoft distributes it. However, once you have installed Debugging Tools for Windows once, you can xcopy deploy ADPlus or the entire Debugging Tools for Windows package to another machine. In fact, during development, I find it extremely handy to check in the development tools into the source repository. Debugging Tools for Windows supports this by virtue of the fact that it is xcopy installable.For those of you that are savvy with Windows Installer, you can perform an admin install using the msi for Debugging Tools for Windows and that will allow you to extract the files without actually installing the package on the machine, for example:msiexec /a dbg_x86_6.11.1.404.msi
With all of that said, let’s see how ADPlus can help you diagnose problems with .NET applications.
In the rest of this post, I will reference the C# 3.0 sample application that I have put together to illustrate using ADPlus to capture an unhandled exception in a .NET application. The code is listed below:
using System;
using System.Linq;
using System.Runtime.Serialization;
class A
public void SaySomething() {
Console.WriteLine( "Yeah, Peter...." );
throw new BadDesignException();
class B : A
class C
class EntryPoint
static void Main() {
DoSomething();
static void DoSomething() {
Func<int, object> generatorFunc = (x) => {
if( x == 7 ) {
return new C();
} else {
return new B();
};
var collection = from i in Enumerable.Range( 0, 10 )
select generatorFunc(i);
// Let's iterate over each of the items in the collection
//
// ASSUMING THEY ARE ALL DERIVED FROM A !!!!
foreach( var item in collection ) {
A a = (A) item;
try {
a.SaySomething();
catch( BadDesignException ) {
// Swallow these here. The programmer chose to
// use exceptions for normal control flow which
// is *very* poor design.
public class BadDesignException : Exception
public BadDesignException() { }
public BadDesignException( string msg ) : base( msg ) { }
public BadDesignException( string msg, Exception x ) : base( msg, x ) { }
protected BadDesignException( SerializationInfo si,
StreamingContext ctx )
:base( si, ctx ) { }
You can compile this sample code easily by putting it in a file, such as test.cs, and then from either a Visual Studio Command Prompt or from a Windows SDK Command Shell, execute the following:
csc /debug+ test.cs
Note: The code is contrived and there are many bad things about this code from a design/coding standpoint, but that is intentional for the sake of illustration. For example, one may want to re-think introducing a collection that contains references to System.Object.The code above also uses features that are new to C# 3.0 (for brevity) including lambda expressions and LINQ. If you would like to become more familiar with them, visit the C# website on MSDN or reference one of the excellent books on C# such as Pro LINQ or Accelerated C# 2008.
The notable section of the code to focus on is the foreach loop within the EntryPoint.Main() method. In that foreach loop, we are iterating over a collection of objects where we are assuming that they all derive from type A. Thus the code attempts to cast all instances to a reference of type A and since I have intentionally put an instance of type C within that collection, it will fail at some point with an exception of type System.InvalidCastException.
Using ADPlus in crash mode, we can capture the exception in the customer’s environment. To illustrate this in action, let’s launch ADPlus in crash mode and monitor the test.exe example application built above by using the following command line:
adplus -crash -o c:\temp\test -FullOnFirst -sc c:\temp\test\test.exe
Note: I built and tested this code in a directory named c:\temp\test on my machine as I wrote this. Therefore, you will see references to it throughout this post. Incidentally, I have found ADPlus is a lot easier to use if you feed it fully qualified paths. If you get mysterious errors or behavior, and you are using relative paths with ADPlus, try fully qualifying the paths before you beat your head against the wall for too long trying to figure out what could be going wrong.
ADPlus is easier to launch if you add the directory where the Debugging Tools for Windows was installed to your PATH environment variable. The command line above assumes this has been done. Now, let me explain the command line options that I used above. I highly recommend that you become familiar with all of the ADPlus command line options.
· -crash launches ADPlus in crash mode. This is the mode you want to use if your application is failing because of an unhandled exception.
· -o c:\temp\test tells ADPlus that I want the output to be placed in c:\temp\test, which is the directory in which I built the test.exe application.
· -FullOnFirst is very important for managed applications.. This tells ADPlus to grab a full process dump on first chance exceptions. It is essential that you capture full dumps for managed applications, otherwise, all of the necessary data regarding the execution engine and the managed heap will be absent from the dump making it impossible to debug effectively.
· -sc c:\temp\test.exe is only one of the ways you can point ADPlus to an application to monitor. In this case, we’re instructing ADPlus to tell the debugger to explicitly launch the application. Had it been a service, or if the application you want to monitor is already running, we would probably have used –p to attach to its PID or –pn to attach to the process by name. Notice that I provided the full path to the application.
After you launch ADPlus, you are presented with the following dialog unless you use the –quiet option.
Once the application is finished executing, go to the directory that you specified in the ADPlus –o command line option and you should see a subdirectory named similarly to what you see in the previous dialog snapshot. For the instance of test.exe I just executed, that directory is named Crash_Mode__Date_05-11-2009__Time_21-12-54PM. Under that directory, there are quite a few files that I have listed below:
C:\temp\test\Crash_Mode__Date_05-11-2009__Time_21-12-54PM>dir /bADPlus_report.txtCDBScriptsPID-0__Spawned0__1st_chance_CPlusPlusEH__full_15ac_2009-05-11_21-13-01-332_0eac.dmpPID-0__Spawned0__1st_chance_NET_CLR__full_15ac_2009-05-11_21-12-55-482_0eac.dmpPID-0__Spawned0__1st_chance_NET_CLR__full_15ac_2009-05-11_21-12-56-527_0eac.dmpPID-0__Spawned0__1st_chance_NET_CLR__full_15ac_2009-05-11_21-12-57-370_0eac.dmpPID-0__Spawned0__1st_chance_NET_CLR__full_15ac_2009-05-11_21-12-58-103_0eac.dmpPID-0__Spawned0__1st_chance_NET_CLR__full_15ac_2009-05-11_21-12-58-867_0eac.dmpPID-0__Spawned0__1st_chance_NET_CLR__full_15ac_2009-05-11_21-12-59-663_0eac.dmpPID-0__Spawned0__1st_chance_NET_CLR__full_15ac_2009-05-11_21-13-00-505_0eac.dmpPID-0__Spawned0__1st_chance_NET_CLR__full_15ac_2009-05-11_21-13-02-689_0eac.dmpPID-0__Spawned0__1st_chance_Process_Shut_Down__full_15ac_2009-05-11_21-13-27-743_0eac.dmpPID-0__Spawned0__2nd_chance_NET_CLR__full_15ac_2009-05-11_21-13-04-140_0eac.dmpPID-0__Spawned0__Date_05-11-2009__Time_21-12-54PM.logProcess_List.txt
For each first chance exception, a dump file (.dmp) has been collected. Notice that the dump files can tend to be very large because of the –FullOnFirst option. You can subsequently load these dump files into either Windbg (or its variants) or the Visual Studio Debugger. Personally, I prefer Windbg because I can then load the SOS extension along with the SOSEX extension and dig into the state of the application and the CLR.
Using the ADPlus default configuration as we have above, you can see that ADPlus generated dumps for one first-chance C++ exception, eight first-chance CLR exceptions, one second-chance CLR exception and one dump collected during process shut down.
If you look in your Debugging Tools for Windows directory, you’ll notice that ADPlus is really just a very complex VBScript. It generates quite a bit of useful information along with any problem dump files. ADPlus_report.txt reports the configuration for ADPlus for this run. This is handy if you need to know what it will do for a specific type of exception.
Process_list.txt is generated from executing tlist.exe, another tool that comes with Debugging Tools for Windows.
And finally, the CDBScripts subdirectory contains a .cfg file which is the exact debugger script ADPlus generated and subsequently fed to CDB to get the job done. On my machine, when running against the sample application using the command line from the previous section, this file is named PID‑0__Spawned0.cfg. If you’re ever curious or need to know exactly what ADPlus instructed the debugger to do, this file is the source.
Note: The reason the name contains a zero PID is because we used the –sc option to launch the application. Had we used the –p or –pn option, the PID in the filename would not be zero.
I don’t recommend executing this debugger script in a live debugger as doing so could overwrite the data that you already have collected. Instead, if you need to couple ADPlus with live debugging, you should use the –gs option which I will describe shortly.
You’ll notice that in the previous run of ADPlus, quite a few dump files were generated and each one was fairly large. In reality, there are times where a first chance exception does not always indicate a fatal situation. In Accelerated C# 2008, I go into detail regarding how it is poor design practice to implement any sort of control flow or otherwise reasonably expected behavior using exceptions. This is because exceptions are supposed to be truly exceptional events! For more juicy details on how expensive exceptions are, I invite you to read an excellent blog post by Tess Ferrandez on the topic. At any rate, you may encounter situations where you get a lot of dumps for first chance exceptions you are not interested in as this example shows..
To alleviate this situation, you can create an ADPlus configuration file coupled with the !StopOnException command provided by the SOS extension to instruct ADPlus to filter out only the exceptions you’re interested in. To do this, I created a configuration file named filter.config with the following contents:
<ADPlus>
<!-- Configuring ADPlus to log only exceptions we're interested in -->
<Exceptions>
<Config>
<!-- This is for the CLR exception -->
<Code> clr </Code>
<Actions1> Log </Actions1>
<CustomActions1> .loadby sos mscorwks; !StopOnException System.InvalidCastException 1; j ($t1 = 1) '.dump /ma /u c:\dumps\InvalidCastException.dmp; gn' ; 'gn' </CustomActions1>
<ReturnAction1> VOID </ReturnAction1>
<Actions2> Log </Actions2>
</Config>
</Exceptions>
</ADPlus>
Note: The <CustomActions1> element above is meant to be in one line in the config file.
The <CustomActions1> element is the element of interest in this configuration file. This element allows you to specify which commands the debugger should execute on first chance exceptions. Within this element, you can put any valid debugger commands (except windbg GUI-related commands). If you need to execute multiple commands, as I have above, you simply separate them with semicolons. In the <CustomActions1> element shown above, I first load the SOS extension using the .loadby command. Then, I use the !StopOnException command in predicate mode to set the $t1 pseudo-register to 1 if the exception is of type System.InvalidCastException and 0 otherwise. Then, the following j command is used to create a full dump if $t1 is 1 and do nothing otherwise. The gn command in both paths of the j command tells the debugger to go unhandled so that the exception is propagated up rather than swallowed by the debugger. If you were to go handled, thus swallowing the exception, the exception handlers in the code would never see it and the debugger would alter the behavior of the application. And finally, note that the .dump command used to create the dump indicates the path where the dump will be stored. In this case, I am placing it in the c:\dumps directory on my machine.
Now you can provide this configuration file to ADPlus using the following command:
adplus -crash -o c:\temp\test -FullOnFirst -c c:\temp\test\filter.config -sc c:\temp\test\test.exe
Notice that there are fewer dumps collected. Along with the dump for the InvalidCastException, it also captures a C++ exception as well as when the application shuts down. If you open the C++ exception dump in the debugger and inspect the stack, it shows that the CLR is in the process of generating the InvalidCastException. The CLR catches the C++ exception and converts it into a managed exception. The C++ exception dump was generated because I left the –FullOnFirst option in the previous command line. You can eliminate the C++ exception dump by removing -FullOnFirst.
I already hinted at the –gs ADPlus command line option earlier in this post. This option allows you to create the debugger scripts ADPlus creates without actually running them in the debugger. For example, if you execute the following command (I executed mine from my c:\temp\test directory):
adplus -crash -o c:\temp\test -FullOnFirst -c c:\temp\test\filter.config -gs livedebug
You will notice that ADPlus does not actually launch any debugger or your application. Rather, it creates a subdirectory named livedebug. When you go into that directory, you’ll notice it looks similar in layout to the crash directories created in the previous demonstration. On my machine, I end up with the following two files:
C:\temp\test\livedebug\ADPlus_report.txtC:\temp\test\livedebug\CDBScripts\PID-0__livedebug.cfg
The PID-0__livedebug.cfg file is actually a debugger script file containing debugger commands. All we have to do now is launch our test application in the debugger and then execute this script. Within my c:\temp\test directory, I can launch the debugger using the following command:
windbg test.exe
Once inside the debugger, I can invoke the ADPlus debugger script by executing the following $$< command:
$$<C:\temp\test\livedebug\CDBScripts\PID-0__livedebug.cfg
Once you execute the $$< command, you will notice that the script takes over and performs the same actions as ADPlus does when run conventionally from the command line.
As a further experiment, edit the filter.config file and remove the gn commands. Now the debugger will wait for user input after executing the custom commands rather than continuing execution of the application. This could be handy if you want the opportunity to perform debugging by hand if ADPlus encounters a certain situation.
Throughout this post, I have been focusing on first chance exceptions for the sake of illustration. However, many times, you are only interested in the truly unhandled exceptions. In that case, you would want to capture only second chance exceptions. You certainly would want to capture first chance exceptions if you needed to capture the state at the exact point the exception was thrown and before the OS searches for any handlers. Additionally, if you suspect that the application you are debugging may be handling an exception inappropriately (maybe even blindly swallowing it), then you certainly want to catch first chance exceptions in that case.
In this blog post I have introduced you to ADPlus and the utility it provides when troubleshooting problems in the field. ADPlus lends itself well to capturing exceptional situations in the field since it requires no configuration changes on the affected machine thus making it an easy pill for your customers to swallow. You may also find this very useful when working with your Quality Assurance team during the development phase. For example, how many times have you encountered a situation where a problem only presents itself on a dusty old rarely-used machine in the back corner of a random QA engineer’s office? How many times, in such situations, have you felt like the only way to trouble shoot the problem effectively is to install the Visual Studio debugger and start working on that machine? Furthermore, what if the problem only happens on that dusty old machine about once a week. ADPlus can help you avoid that madness by providing an easy mechanism for capturing full process dumps on the troubled machine so you can then take those dumps to your trusty development machine for further debugging and analysis.
Have fun out there!