Hello - Rob here with the GES team, and I have this nugget to pass on to you. I recently worked an issue where a Windows server rebooted intermittently for no apparent reason. The Windows System Event log did not yield any clues, other than this Event ID 6008-
Log Name: System.evt
Date: 25-8-2008 19:06:58
Event ID: 6008
Task Category: None
Description: The previous system shutdown at 6:54:04 PM on 8/25/2008 was unexpected.
There were no other symptoms or patterns to which the unexpected shutdown could be related. The shutdown could occur anytime of the day. Eventually we attached a debugger to see if we could catch anything, but this wasn’t successful. Next we looked at the manufacturer’s mechanism used to log errors and found this piece of information -
An Unrecoverable System Error has occurred (Error code 0x0000002D, 0x00000000)
Note - each vendor has their own way to handle error codes. We noticed a one to one relationship with the vendor error above and the Event ID 6008 messages in the Windows System Event log. So we engaged the hardware vendor who determined this error indicated an error on the PCI bus. They also informed us that this kind of error asserts an NMI on the bus.
To narrow down which component was causing the error, we set the NMICrashDump DWORD value under the following key in the registry:
This is described in detail in the article, “927069 How to generate a complete crash dump file or a kernel crash dump file by using an NMI on a Windows-based system”
This registry value causes the machine to bugcheck with a STOP 0x80 (NMI_HARDWARE_FAILURE) when Windows detects an NMI, thus producing a dump file, or, if a debugger is attached, it breaks into the debugger
After setting this registry value we hooked up the debugger again and waited... after awhile we got lucky because the debugger intercepted a STOP 0x80!
At that time, I ran “!pci 0x102 ff” to get an overview of the various PCI devices and their respective states. The !pci output showed the following output (VendorID and DeviceID have been removed):
PCI Configuration Space (Segment:0000 Bus:00 Device:1e Function:00)
00: VendorID <vendor>
02: DeviceID <device>
04: Command 0147 IOSpaceEn MemSpaceEn BusInitiate PERREn SERREn
06: Status 4010 CapList SERR
08: RevisionID d9
09: ProgIF 01 Subtractive
0a: SubClass 04 PCI-PCI Bridge
0b: BaseClass 06 Bridge Device
0c: CacheLineSize 0000
0d: LatencyTimer 00
0e: HeaderType 01
0f: BIST 00
10: BAR0 00000000
14: BAR1 00000000
18: PriBusNum 00
19: SecBusNum 01
1a: SubBusNum 01
1b: SecLatencyTmr 20
1c: IOBase 20
1d: IOLimit 30
1e: SecStatus 6280 FB2BCapable InitiatorAbort SERR DEVSELTiming:1
20: MemBase f7e0
22: MemLimit f7f0
24: PrefMemBase d801
26: PrefMemLimit dff1
28: PrefBaseHi 00000000
2c: PrefLimitHi 00000000
30: IOBaseHi 0000
32: IOLimitHi 0000
34: CapPtr 50
38: ROMBAR 00000000
3c: IntLine ff
3d: IntPin 00
3e: BridgeCtrl 000b PERRREnable SERREnable VGAEnable
We couldn't have gone much further without the vendor's assistance. They informed us that the Status shows us SERR, which indicates a PCI System Error has occurred in this PCI-PCI Bridge. At this point I had enough conclusive data to pass my findings to the hardware vendor for full collaboration on the problem. They continued investigating the issue.
It should be noted that a hardware problem is not the only reason for an Event ID 6008. A quick search in the Microsoft Knowledge Base illustrates other things that could cause the event id to appear in the Windows System log.
Windows NT Debugging Blog Live Chat
Microsoft Platform Global Escalation Services is hosting our second live group debug chat session for the debugging community on March 17, 2009 at 11 AM Pacific Time. We will be focusing on debugging techniques and any questions you may have about anything we’ve previously blogged about.
Details about the “PGES-Windows NT Debugging Blog Live Chat” can be found here: http://www.microsoft.com/communities/chats/default.mspx
We’re posting the answers to Debug Fundamentals #4 in this blog. Additionally we posted all of your answers that trickled into the original blog. We deferred the posting of your answers to prevent spoiling it for the others but you should find them up there now. Stay tuned for Debug Fundamentals #5!
Part One – Debugging
1. What caused the access violation?
Code tried to execute with an invalid EIP address.
0:000> !address @eip
ProcessParameters 00280f20 in range 00280000 00284000
Environment 00280808 in range 00280000 00284000
00ac5000 : 00ac5000 - 7513b000
Protect 00000001 PAGE_NOACCESS
State 00010000 MEM_FREE
2. Examine the registers at the time of the crash. Is there anything interesting about the contents?
There are ASCII values in 3 of the registers.
0:000> .formats @eax
Chars: SEGP (little endian would be PGES)
0:000> .formats @eip
0:000> .formats @ebp
3. How did the registers get into this state?
The saved EIP and EBP values on the stack were corrupt and the boom() function executed a leave and ret instruction, loading the corrupt values into the registers.
4. Find the offending data structure which caused this state to occur. What are its contents?
“GAMEOVER Happy debugging courtesy of Platforms GES!” backwards
0:000> dc @esp-38 l34/4
001bfdc8 45532180 6d732047 74666f72 20506c61 .!SEG smroftalP
001bfdd8 79206f66 72746573 20636f75 67696e67 fo ysetruoc gnig
001bfde8 65627567 70792064 20486170 4f564552 gubed yppaH REVO
001bfdf8 47414d45 EMAG
5. Are there security concerns with this access violation? Why?
A malicious user could potentially manipulate EIP to execute arbitrary code and compromise the system.
6. Why is this class of crashes not seen in the wild much anymore?
This exercise was explicitly compiled with the /GS- flag. The Microsoft compiler appropriately enables buffer overrun security checks by default using the /GS compiler switch. This is a randomized “cookie” value placed on the stack during the function prolog right before the saved EIP and EBP. During the function epilog this value is asserted before executing the ret instruction. If it is not correct the program is terminated.
More info : http://msdn.microsoft.com/en-us/library/8dbf701c.aspx
Part Two – Reverse Engineering
Examine the functions main(), snap(), crack(), pop(), and boom().
1. Describe with a sentence or two what each one is doing. (There is no need to comment on every assembly instruction or re-write the code in C here unless you really feel you need to.)
main() – Allocates a buffer on the stack and makes a call to snap() passing it the buffer.
// Allocate a buffer on the stack
char buf[sizeof(string) + 13];
return snap ((char *)buf);
snap() – Does a strcpy() like operation from a global buffer into the buffer parameter and passes it to crack().
// Copy static string to stack
snap(char * buf)
for (int i = 0; string[i]; i++)
buf[i] = string[i];
return crack (buf);
crack() – Does a string reverse operation on the buffer parameter then counts the number of characters before finding a special delimiting character (0xC3) and passes the buffer and character count to pop().
// Reverse the string
crack(char * buf)
char * s = buf;
char * p = s;
while (*p) p++;
while (p > s)
char t = *s;
*s++ = *p;
*p-- = t;
// strlen of the secret
char * len = buf;
unsigned i = 0;
while (*len != '\xc3')
return pop (buf, i);
pop() – Does a memmove() like operation removing the character count parameter from the start of the string buffer and passes it to boom().
// Do a memory move on the string, hiding the secret
int pop(char * buf, unsigned size)
volatile char * a = buf;
char * b = buf + size;
*a++ = *b++;
*a = '\0';
return boom (buf);
boom() – Allocates a stack buffer and does a strcpy() operation from the parameter into it. Takes each char of the new stack buffer and swaps the nibbles of each character so that 0y1101 1001 is 0y1001 1101. Then returns ‘SEGP’; or PGES as a DWORD.
// Copy string to our local buf, decrypt and boom!!!
boom(char * buf)
char new_buf[sizeof(string) - 0x14];
register unsigned int i;
for (i = 0; buf[i]; i++)
new_buf[i+1] = buf[i];
*buf ++ = *buf << 4 | *buf >> 4;
1. If this access violation occurred while the program was running without a debugger present, would there be anything different about the crashing register state or the data structure which caused it? If so what is it and why?
When a debugger is present, an unhandled exception is immediately trapped and the pristine state of the process can be examined. When a debugger is not present, an unhandled exception handler is executed in the context of the original exception before the JIT debug handling takes effect and a debugger is attached. Since the offending corruption was in stack memory which has been popped off the stack before the access violation occurred, the unhandled exception handler reused and overwrote the offending stack memory.
More Info on debugging and unhandled exception filters:
Debugging a custom unhandled exception filter
Don’t perform complicated tasks in your unhandled exception filter
2. The functions from “Part Two – Reverse Engineering” perform operations on a set of data. During this manipulation some data is lost. Find this data and decode its “secret” message.
Today we’ll examine a case where a crash is occurring in a Microsoft process, in core Windows code, but the culprit isn’t the crashing code. In fact, the culprit isn’t even running in the process that crashed! But before I get ahead of myself, let’s start by examining a crash dump that shows the problem...
// The crashing process is the Windows Sidebar in Vista...
. 0 id: 1b3c create name: sidebar.exe
// The exception record shows that we have hit an access violation in RtlInitUnicodeString while reading from 00080000
0:001> .exr -1
ExceptionAddress: 77837e8b (ntdll!RtlInitUnicodeString+0x0000001b)
ExceptionCode: c0000005 (Access violation)
Attempt to read from address 00080000
// Here’s the register context at the time of failure.
// Looks like we are trying to find the end of a unicode string
Last set context:
eax=00000000 ebx=00000000 ecx=ffffffff edx=0088fa88 esi=00000000 edi=00080000
eip=77837e8b esp=0088fa60 ebp=0088fabc iopl=0 nv up ei pl zr na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010246
77837e8b 66f2af repne scas word ptr es:[edi]
// Here’s the call stack that the debugger has tried to assemble for us...
*** Stack trace for last set context - .thread/.cxr resets it
ChildEBP RetAddr Args to Child
0088fad0 77154911 00080000 0088fb1c 7782e4b6 ntdll!RtlInitUnicodeString+0x1b
0088fadc 7782e4b6 00080000 77030f7b 00000000 kernel32!BaseThreadInitThunk+0xe
0088fb1c 7782e489 7713361f 00080000 00000000 ntdll!__RtlUserThreadStart+0x23
0088fb34 00000000 7713361f 00080000 00000000 ntdll!_RtlUserThreadStart+0x1b
// The address we are trying to read is freed memory...
0:003> du 00080000
So we crash because RtlInitUnicodeString attempted to deference an invalid pointer (address 00080000). Where did RtlInitUnicodeString get this value? Let’s unassemble the function and see...
0:012> uf ntdll!RtlInitUnicodeString
77567e70 57 push edi
77567e71 8b7c240c mov edi,dword ptr [esp+0Ch] // edi comes from here
77567e75 8b542408 mov edx,dword ptr [esp+8]
77567e79 c70200000000 mov dword ptr [edx],0
77567e7f 897a04 mov dword ptr [edx+4],edi
77567e82 0bff or edi,edi
77567e84 7422 je ntdll!RtlInitUnicodeString+0x38 (77567ea8)
77567e86 83c9ff or ecx,0FFFFFFFFh
77567e89 33c0 xor eax,eax
77567e8b 66f2af repne scas word ptr es:[edi] // We crash here
We can see from the above assembly that edi came from esp+c, which is the second parameter to this function (MSDN tells us that this is the SourceString parameter). So naturally we’ll want to examine the caller of RtlInitUnicodeString to see if it is at fault. If we are to believe the call stack that the debugger gave us, then caller of RtlInitUnicodeString is BaseThreadInitThunk...but that doesn’t make sense. BaseThreadInitThunk normally is the function that calls the thread’s start function. Why would anyone try to make RtlInitUnicodeString the start function for a thread? Let’s assume that the debugger isn’t showing us the real call stack for a moment, and look at the raw stack values...
0:001> dds esp
0088fa60 77837e70 ntdll!RtlInitUnicodeString
0088fa64 7713312c kernel32!LoadLibraryExW+0x6f
0088fab0 7711e289 kernel32!_except_handler4
0088fac0 77133630 kernel32!LoadLibraryW+0x11
0088fad4 77154911 kernel32!BaseThreadInitThunk+0xe
0088fae0 7782e4b6 ntdll!__RtlUserThreadStart+0x23
0088fae8 77030f7b urlmon!g_StaticLiteralTreeCode <PERF> (urlmon+0xe0f7b)
0088fafc 771af389 kernel32!UnhandledExceptionFilter
0088fb00 771af389 kernel32!UnhandledExceptionFilter
0088fb10 777f9834 ntdll!_except_handler4
0088fb20 7782e489 ntdll!_RtlUserThreadStart+0x1b
0088fb24 7713361f kernel32!LoadLibraryW
0088fb3c 7713361f kernel32!LoadLibraryW
<snip - all zeros to base of stack at 00890000>
We can see from the information above that there is more going on here than the “kb” output would lead us to believe. Notice that the second non-zero value on the stack (working from the bottom up) is the address of the LoadLibraryW function. This is not a return address, it is the start of the function. This is a clue that the start address of this thread, specified by some other thread that called CreateThread, is actually LoadLibraryW. This is a parameter to _RtlUserThreadStart, which is the bottommost function on this thread’s call stack. So we can examine the raw data above, and reconstruct the call stack like so...
Does this make sense? Let’s think it through. In Vista, it is typical to see the bottom 3 functions at the start of a thread. Following this is the start address of the thread, which in this case is LoadLibraryW. It also makes sense that LoadLibraryW would call LoadLibraryExW. That leaves us with the question of whether LoadLibraryW actually calls RtlInitUnicodeString. Let’s find out...
0:011> uf kernel32!LoadLibraryExW
76f7311d ff7508 push dword ptr [ebp+8]
76f73120 8d45cc lea eax,[ebp-34h]
76f73123 50 push eax
76f73124 8b3d6c10f576 mov edi,dword ptr [kernel32!_imp__RtlInitUnicodeString (76f5106c)]
76f7312a ffd7 call edi
76f7312c a14cd50177 mov eax,dword ptr [kernel32!BasepExeLdrEntry (7701d54c)]
76f73131 3bc3 cmp eax,ebx
76f73133 0f8498dafeff je kernel32!LoadLibraryExW+0x78 (76f60bd1)
We can see from the above assembly that LoadLibraryExW does indeed call RtlInitUnicodeString, so our re-assembled call stack makes sense.
Now we know how we got to RtlInitUnicodeString, but where did the bad pointer value come from? Note that the bad value, 00080000, is actually the very first non-zero value on the stack, right before the address of LoadLibraryExW. When a call is made to CreateThread, not only is the thread start address specified, but also an optional parameter to start function is specified. That is what we are seeing here. Both the thread start address and the parameter end up as parameters to _RtlUserThreadStart.
Let’s sum up what we know so far. Some other thread is calling CreateThread with LoadLibraryW as the start function address, and 00080000 as the parameter. This leads LoadLibraryExW to call RtlInitUnicodeString with 00080000 as the SourceString parameter, and when RtlInitUnicodeString attempts to read from this address we crash because 00080000 is the address of freed memory. Incidentally, what should 00080000 point to? It is being used as the parameter to LoadLibrary, so it should be a pointer to a null terminated string that specifies the name of the library file to load.
Why would one want to start a thread with LoadLibraryW as the start address? The most common reason is code injection. Typically a thread outside of the target process will use the CreateRemoteThread function to invoke LoadLibrary against a particular DLL, thus loading their code into the target process. Assuming that this is what is going on, we’ll need to move beyond the crash dump and debug a live system that is having the problem. Fortunately we noticed that the crashing systems we observed all had a particular third party application installed. We set up a test system with the third party application, and we were able to sometimes reproduce the crash. Looks like possibly a timing-related crash, since we didn’t have a 100% repro.
So let’s move forward with the debug of the live system. Note that the memory addresses in the live debug notes will differ from the above. Also, addresses and module names that could identify the application vendor have been changed. First things first: setting a breakpoint on ntdll!NtCreateThreadEx and running through the code showed that the LoadLibraryW thread was starting without any in-process thread starting the thread. It also revealed that the LoadLibraryW thread started without crashing this time, and the library that it was loading was a DLL that belonged to the third party app in question....
0:003> du 00b90000
00b90000 "C:\Program Files\AppVendor\App N"
We now know that the thread that is creating the LoadLibraryW thread isn’t in the sidebar.exe process, and we know that the third party application is definitely using LoadLibraryW to remotely inject their code into sidebar.exe. Sometimes this works, and sometimes it crashes because the DLL name buffer points to freed memory. Next step, let’s debug the third party process to confirm our suspicions, and hopefully figure out why the memory is freed sometimes. We’ll set breakpoints on VirtualAllocEx, WriteProcessMemory, CreateRemoteThread, and VirtualFreeEx in the third party process...
// Third party app calls calls VirtualAllocEx against the sidebar.exe process
ChildEBP RetAddr Args to Child
0200f8c0 4c9864db 000002b0 00000000 00000090 kernel32!VirtualAllocEx
WARNING: Stack unwind information not available. Following frames may be wrong.
0200f914 4c986af9 000aada0 0200f938 000adc20 AppName!Ordinal1+0x411c0
0200f96c 4c986d1f 00000000 0200f9a0 00000003 AppName!Ordinal1+0x417de
0200f8c0 6c9884db kernel32!VirtualAllocEx(
void * hProcess = 0x000002b0,
void * lpAddress = 0x00000000,
unsigned long dwSize = 0x90,
unsigned long flAllocationType = 0x1000,
unsigned long flProtect = 4)
// PID 2836 == sidebar.exe
0:003> !handle 0x000002b0 f
Object Specific Information
Process Id 2836
Parent Process 2944
Base Priority 8
// 00b90000 is the base address of the allocation in this run
eax=00b90000 ebx=0200f9ac ecx=0200f89c edx=77249a94 esi=0200f904 edi=00000090
eip=4c9844db esp=0200f8dc ebp=0200f914 iopl=0 nv up ei pl zr na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000246
// Now the app calls WriteProcessMemory writing the following string:
// "C:\Program Files\AppVendor\App Name\applib.dll" to the new virtual alloc
Breakpoint 5 hit
eax=00000000 ebx=0200f9ac ecx=0200f904 edx=00b90000 esi=00000090 edi=00000090
eip=77371cc6 esp=0200f8bc ebp=0200f8d8 iopl=0 nv up ei pl nz na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000206
77371cc6 8bff mov edi,edi
0200f8b8 6c9885ba kernel32!WriteProcessMemory(
void * lpBaseAddress = 0x00b90000,
void * lpBuffer = 0x000aada0,
unsigned long nSize = 0x90,
unsigned long * lpNumberOfBytesWritten = 0x0200f8e0)
0200f8d8 4c986895 AppName!Ordinal1+0x4129f
0:003> du 0x000aada0
000aada0 "C:\Program Files\AppVendor\App Name\applib.dll"
// Within the sidebar.exe process, we can see that after WriteProcessMemory in the app returns,
// The address has the expected string….
// The app then calls CreateRemoteThread against the sidebar.exe process,
// using LoadLibraryW as the start address, and the address of the virtual alloc as the parameter.
Breakpoint 0 hit
eax=000002b0 ebx=77370000 ecx=7739361f edx=77249a94 esi=0200f9ac edi=00000000
eip=773b46ef esp=0200f8a0 ebp=0200f8d0 iopl=0 ov up ei pl nz na po nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000a02
773b46ef 6858010000 push 158h
0200f89c 6c988714 000002b0 00000000 00000000 kernel32!CreateRemoteThread
0200f8d0 4c9688ea 0200f8f8 00000000 6cade78c AppName!Ordinal1+0x413f9
0200f914 4c968af9 000aada0 0200f938 000adc20 AppName!Ordinal1+0x415cf
0:003> kP L1
0200f89c 6c988714 kernel32!CreateRemoteThread(
struct _SECURITY_ATTRIBUTES * lpThreadAttributes = 0x00000000,
unsigned long dwStackSize = 0,
<function> * lpStartAddress = 0x7739361f,
void * lpParameter = 0x00b90000,
unsigned long dwCreationFlags = 0,
unsigned long * lpThreadId = 0x00000000)
0:003> ln 0x7739361f
kernel32!LoadLibraryW (wchar_t *)
// The application then calls VirtualFreeEx on the virtual alloc.
0200f8cc 6c98851c 000002b0 00b90000 00000000 kernel32!VirtualFreeEx
0200f914 4c988af9 000aada0 0200f938 000adc20 AppName!Ordinal1+0x41201
0200f96c 4c988d1f 00000000 0200f9a0 00000003 AppName!Ordinal1+0x417de
// Within the sidebar.exe process, the thread for LoadLibrary starts...
0412f814 7722e489 7739361f 00b90000 00000000 ntdll!__RtlUserThreadStart
0412f82c 00000000 7739361f 00b90000 00000000 ntdll!_RtlUserThreadStart+0x1b
// But the memory has already been freed by the third party application...
// ...and sidebar.exe crashes once the address is read
(b14.4d8): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
eax=00000000 ebx=00000000 ecx=ffffffff edx=0412f780 esi=00000000 edi=00b90000
eip=77237e8b esp=0412f758 ebp=0412f7b4 iopl=0 nv up ei pl zr na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010246
77237e8b 66f2af repne scas word ptr es:[edi]
We’ve now come full circle, back to the original crash, at the same instruction in RtlInitUnicodeString. So now we know that the problem is that the third party vendor frees the memory that it allocated in sidebar.exe, before it is read. Sometimes the call to VirtualFreeEx happens after the memory is read, and in that scenario the crash doesn’t occur. But if VirtualFreeEx completes before RtlInitUnicodeString has a chance to read it, then sidebar.exe crashes.
This information was passed on to the third party vendor, and hopefully they’ll make a change to their code to avoid this problem. There are multiple ways that this could be addressed, but one potential fix would be for the third party app to wait on the thread handle returned from CreateRemoteThread before freeing the memory.
- Matthew Justice
The conference was held virtually via Live Meeting with Q&A sessions following every presentation.
Each recorded presentation will be accompanied with a link to the corresponding slide deck.
T. Roy founder of www.codemachine.com
T. Roy presented on instrumentation techniques for developers and engineers who instrument binaries to help resolve complex operating system problems.
Mark discussed new features and tools in the Sysinternals Tools suite.
Members from the Escalation Services team share best practices when engaging Microsoft support, included in this video is a demo of the new MPSReports tool and a demo of using Hyper-V to package up problems that can be reproduced and send to Microsoft.
Jeff discusses the different locations where you can find great content from the Escalation Services team to help solve common and difficult problems.
Tate Calhoun - Escalation Engineer
Tate walks through the use of the new XPERF tool and how to analyze the data for specific scenario's.
Citrix - Escalation Engineers
Citrix engineers Nicholas Vasile, Dmitry Vostokov, and Kapil Ramlal, share tools they have created to take advantage of the ETW tracing infrastructure in Windows, and debugging scripts, and best practices for engaging issues that require multi-vendor support.
Dennis Smeltzer - Executive Director TSANet
Dennis discusses a proposed new TSANet offering which would allow Escalation engineers to contact each other to collaborate on issues that require multi-vendor support.
Windows NT Debugging Blog Live Chat
We are back with another addition to the debugging fundamentals series here in 2009! Regardless of your debugging experience we hope you enjoy this one. Don’t worry too much if you can’t answer them all but give them your best shot.
While debugging an application, a developer observes a crash and is puzzled. The developer saves a memory dump of the process using the command “.dump /ma dbgex4.dmp” and sends it to you for debug assistance.
eax=53454750 ebx=00000000 ecx=00000100 edx=000001b0 esi=001bfe20 edi=00ac36bc
eip=47414d45 esp=001bfe00 ebp=4f564552 iopl=0 nv up ei pl zr na pe nc
47414d45 ?? ???
The attached zip, dbgex4.zip, includes three files. The program itself, the program’s symbols file and the saved memory dump file. (NOTE: You do not need to execute the program for the exercise however you may want to in order to solve the bonus’. You will need to rename the file dbgex4.ex_ to dbgex4.exe in order to run it).
Using the accompanying files, answer the following questions:
1. If this access violation occurred while the program was running without a debugger present, would there be anything different about the crashing register state or the data structure which caused it? If so what is it and why? (HINT: If you would like to observe the crashing state like this, you can register windbg.exe to be the Just-In-Time debugger using the –I switch like so “windbg.exe -I”. When you run the dbgex4.exe program, it will crash and windbg.exe will launch automatically and attach to it.)
My name is Ryan Mangipano (ryanman) and I am a Sr. Support Escalation Engineer at Microsoft. Today I will be blogging about how I used the SOS .Net Framework debugging extension (and !analyze -v) to easily troubleshoot a .Net Framework exception. This exception was preventing Event Viewer from displaying properly. Event Viewer was returning an error that provided very little information about what was actually causing the issue. I will demonstrate how, in this case, it was very easy to use windbg to obtain information about what went wrong. I did not have to perform a live debug on the issue. Instead, I used a process dump to obtain very exact information, which was returned by the debugger, relating to the root cause. I was then able to use Process Monitor to identify the file that needed to be examined. These actions led me to the source of the problem, which was easily corrected. Also, the issue discussed in this blog was easily reproduced on Windows Server 2008. This means that you should be able to practice this debug exercise, on your non-production Windows 2008 SP1 Server, for learning purposes if you are interested in doing so.
Issue Reported: The following error was encountered when opening eventvwr.msc (Event Viewer) on a Windows 2008 Server system:
"MMC could not create the snap-in."
MMC could not create the snap-in.
The snap-in might not have been installed correctly
Name: Event Viewer
First Step- Research & Data Gathering: After ensuring I first understood the problem reported, I searched for known issues. I found out that we have seen this error before. It may occur when the following registry key gets deleted or corrupted:
I had the customer export this key and found that it was not corrupted in any way. I verified that all data was as expected
Next, a memory dump of the mmc.exe process was collected. The mmc.exe process is used to host the eventvwr.msc snap-in. This was easily obtained using the built in Windows 2008 Server "Windows Task Manager" feature: "Create Dump File" . If you have several mmc console instances executing on your system, you can use the Task Manager context menu shortcuts "Switch To" and "Go To Process" to help you to identify the correct instance.
Note: We also collected a process monitor logfile during the startup of eventvwr.msc. This log file later proved very helpful in resolving the issue (as I will show below). Process monitor can be obtained at the following URL: http://technet.microsoft.com/en-us/sysinternals/bb896645.aspx
Now let's have a look at the debug.
1. First, I navigated Windows Explorer to the location of the dump file and double-clicked it to open it in windbg.exe.
It opened in windbg because I had previously run the command windbg -IA, which associates .dmp files with windbg. You can read more about the command line options in windbg in the help file that is included with the debugging tools.
2. I noticed the following output from the debugger after it loaded the dump file:
This dump file has an exception of interest stored in it.
The stored exception information can be accessed via .ecxr.
(ff8.a2c): CLR exception - code e0434f4d (first/second chance not available)
3. Next, I wanted to ensure my symbol path was set correctly. I could have set it using the .sympath command:
0:011> .sympath SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols
Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols
Expanded Symbol search path is: srv*c:\websymbols*http://msdl.microsoft.com/download/symbols
However, when your goal is to simply point to the default symbol server, .symfix is a very nice shortcut. It prevents one from having to try to remember the URL. Here’s the syntax:
0:011> .symfix c:\websymbols
4. To ensure that I didn't waste time reviewing the wrong data, I performed a quick check to ensure that we collected a dump of the requested snap-in.
PEB at 000007fffffdb000
CommandLine: '"C:\Windows\system32\mmc.exe" "C:\Windows\system32\eventvwr.msc" '
You could alternatively dump the CommandLine from the nt!_PEB using the dt command
0:005> dt nt!_PEB ProcessParameters->CommandLine 000007fffffdb000
+0x020 ProcessParameters :
+0x070 CommandLine : _UNICODE_STRING ""C:\Windows\system32\mmc.exe" "C:\Windows\system32\eventvwr.msc" "
5. Next, I then dumped out all of the threads in this process and found the following thread contained a stack that was raising a .Net Framework exception
0:011> ~* kL
... (ommitted the non-relevent threads)
# 11 Id: ff8.a2c Suspend: 1 Teb: 7ffd3000 Unfrozen
0691f03c 7343a91c kernel32!RaiseException+0x58
0691f09c 7343d81a mscorwks!RaiseTheExceptionInternalOnly+0x2a8
*** WARNING: Unable to verify checksum for MMCEx.ni.dll
0691f140 6bfe0b5a mscorwks!JIT_Rethrow+0xbf
*** WARNING: Unable to verify checksum for mscorlib.ni.dll
0691f1e8 69926cf6 MMCEx_ni+0xd0b5a
0691f1f4 6993019f mscorlib_ni+0x216cf6
0691f208 69926c74 mscorlib_ni+0x22019f
0691f220 733d1b4c mscorlib_ni+0x216c74
0691f230 733e21b1 mscorwks!CallDescrWorker+0x33
0691f2b0 733f6501 mscorwks!CallDescrWorkerWithHandler+0xa3
0691f3e8 733f6534 mscorwks!MethodDesc::CallDescr+0x19c
0691f404 733f6552 mscorwks!MethodDesc::CallTargetWorker+0x1f
0691f41c 7349d803 mscorwks!MethodDescCallSite::CallWithValueTypes+0x1a
0691f604 733f845f mscorwks!ThreadNative::KickOffThread_Worker+0x192
0691f618 733f83fb mscorwks!Thread::DoADCallBack+0x32a
0691f6ac 733f8321 mscorwks!Thread::ShouldChangeAbortToUnload+0xe3
0691f6e8 733f84ad mscorwks!Thread::ShouldChangeAbortToUnload+0x30a
0691f710 7349d5d4 mscorwks!Thread::ShouldChangeAbortToUnload+0x33e
6. Out of curiosity, I also ran the Get Last Error command
LastErrorValue: (Win32) 0 (0) - The operation completed successfully.
LastStatusValue: (NTSTATUS) 0 - STATUS_WAIT_0
7. After this, I ran analyze -v to see what helpful information the debugger would provide. The debugger did output exception information but informed me that I needed to use the x86 debugger instead.
0:011> !analyze -v
EXCEPTION_RECORD: ffffffff -- (.exr 0xffffffffffffffff)
ExceptionAddress: 771a42eb (kernel32!RaiseException+0x00000058)
ExceptionCode: e0434f4d (CLR exception)
Managed code needs matching platform of sos.dll for proper analysis. Use 'x86' debugger.
8. I fired up the x86 debugger and loaded the appropriate version of the SOS .Net Framework debugger extension. This extension ships in the Operating System along with the .Net Framework. On most occasions, I would have initiated the loading of the extension through the use of the following syntax:
0:011> .load C:\Windows\Microsoft.NET\Framework\v2.0.50727\sos.dll
0:011> .load c:\Windows\Microsoft.NET\Framework64\v2.0.50727\sos.dll
However, once you realize that managed debugging will be necessary and that you need the services of the SOS extension, it’s best to use the .loadby command rather than .load. This is due to the fact that the version of SOS must match the version of the CLR loaded into that process. Here’s the recommended syntax:
0:011>.loadby sos mscorwks
I always verify that my extensions are loaded properly by using the .chain command.
... Extension DLL chain:
C:\Windows\Microsoft.NET\Framework\v2.0.50727\sos.dll: image 2.0.50727.1434, API 1.0.0, built Wed Dec 05 22:42:38 2007
9. Running !help printed out the following helpful information about the SOS extension since sos.dll was at the top of the .chain output:
SOS is a debugger extension DLL designed to aid in the debugging of managed
programs. Functions are listed by category, then roughly in order of
importance. Shortcut names for popular functions are listed in parenthesis.
Type "!help <functionname>" for detailed info on that function.
Object Inspection Examining code and stacks
DumpObj (do) Threads
DumpArray (da) CLRStack
DumpStackObjects (dso) IP2MD
PrintException (pe) COMState
10. Using the exception address, displayed by the debugger when opening the dump, and the !pe command listed above, I obtained more information about the exception:
0:011> !pe 771a42eb
There are nested exceptions on this thread. Run with -nested for details
0:011> !pe -nested 771a42eb
Nested exception -------------------------------------------------------------
Exception object: 040a676c
Exception type: System.Reflection.TargetInvocationException
Message: Exception has been thrown by the target of an invocation.
InnerException: System.Reflection.TargetInvocationException, use !PrintException 040a6a20 to see more
SP IP Function
0:011> !PrintException 040a6a20
Exception object: 040a6a20
InnerException: System.Configuration.ConfigurationErrorsException, use !PrintException 040a6cf8 to see more
0:011> !PrintException 040a6cf8
Exception object: 040a6cf8
Exception type: System.Configuration.ConfigurationErrorsException
Message: Configuration system failed to initialize
InnerException: System.Configuration.ConfigurationErrorsException, use !PrintException 040a7174 to see more
0:011> !PrintException 040a7174
Exception object: 040a7174
Message: Unrecognized configuration section system.web/myInvalidData
11. Based on the exception information listed above, it appeared that a .Net Framework configuration section, system.web, contained an invalid configuration section named myInvalidData inside of it. I then re-ran !analyze -v against the dump again (now that I had loaded the x86 debugger) and found that !analyze -v will load the sos.dll extension and even run the !pe extension automatically. It then automatically displayed the exception record information for me as well. Also, notice that the thread listed by !analyze -v matches the thread I examined earlier.
0:011> !analyze -v
EXCEPTION_MESSAGE: Unrecognized configuration section system.web/myInvalidData.
0 Id: ff8.c84 Suspend: 1 Teb: 7ffdf000 Unfrozen
1 Id: ff8.96c Suspend: 1 Teb: 7ffde000 Unfrozen
2 Id: ff8.d10 Suspend: 1 Teb: 7ffdd000 Unfrozen
3 Id: ff8.d94 Suspend: 1 Teb: 7ffdc000 Unfrozen
4 Id: ff8.a14 Suspend: 1 Teb: 7ffda000 Unfrozen
5 Id: ff8.fbc Suspend: 1 Teb: 7ffd9000 Unfrozen
6 Id: ff8.f88 Suspend: 1 Teb: 7ffd8000 Unfrozen
7 Id: ff8.a64 Suspend: 1 Teb: 7ffd6000 Unfrozen
8 Id: ff8.bf8 Suspend: 1 Teb: 7ffd5000 Unfrozen
9 Id: ff8.d24 Suspend: 1 Teb: 7ffd4000 Unfrozen
10 Id: ff8.ff0 Suspend: 1 Teb: 7ffd7000 Unfrozen
. 11 Id: ff8.a2c Suspend: 1 Teb: 7ffd3000 Unfrozen
12. At this point I was interested in identifying the source of this unrecognized configuration. Instead of engaging our .Net support team, I started with a quick search using www.live.com for
"unrecognized configuration section" system.web site:microsoft.com
This returned the following results http://search.live.com/results.aspx?q=%22unrecognized+configuration+section%22+system.web+site%3Amicrosoft.com&form=QBRE
By quickly reviewing some of the hits returned, I found that others had encountered this exception in their own applications. This is due to invalid entries in the various .config files used in .Net. Looking through the posts, different configuration file names and paths were observed.
So, I opened up the process monitor logfile to see which configuration files we were reading data from. I added filter criterion to match entries from the mmc.exe process, the TID from the FAULTING_THREAD listed in the exception data, path data containing .config, and a successful status result. It's best to be as specific as possible.
I found that we were reading in a large amount of settings over and over again from the .net Framework global configuration file:
(on x64 this would be C:\Windows\Microsoft.NET\Framework64\v2.0.50727\CONFIG\machine.config)
Final Step- Putting it all together, Reproducing the issue, & confirming resolution : Using notepad, a quick search of the suspect xml file (C:\Windows\Microsoft.NET\Framework64\v2.0.50727\CONFIG\machine.config) on my system revealed a <system.web> section. At this point, I suspected that this section contained an invalid section which may have been related to the problem. To verify this, and since I like to break things, I added an invalid configuration setting <myInvalidData/> to my global configuration file. Doing so, I successfully reproduced the issue on my system. I then contacted the customer and asked if they had by any chance added any settings under the <system.web> in the configuration file: c:\Windows\Microsoft.NET\Framework\v2.0.50727\CONFIG\machine.config.
The customer informed me that, per the request of their ASP.net developer, they had in fact added settings to that section of the file. By researching http://msdn.microsoft.com/en-us/library/system.web.aspx and the schema documentation at http://msdn.microsoft.com/en-us/library/dayb112d.aspx, we were able to determine that the settings that were present in this file should not have been present inside of <system.web> . The settings were moved to the proper location per the developer and the issue was resolved.
Here are the steps I used to reproduce the issue in case you are attempting to replicate this at home-
A. Using notepad, open the following configuration file on a non-production Windows Server 2008 SP1 system:
(please make a backup copy first in case you make a mistake)
OR (Open the version that matches the architecture of your platform )
B. Find the section <system.web> in this file (you can use the find function in notepad):
C. Add the following line directly after <system.web> as shown in the example below:
D. Save the file and then open eventvwr.msc and verify that the following error is displayed:
Hopefully this blog has demonstrated an example of how you can use the "create dump file" feature of Windows 2008, windbg, and other related tools in an attempt to gain more specific data when your error message is not revealing the source of the problem. Feel free to post any questions, comments, or concerns.