• Ntdebugging Blog

    Debug puzzler 0x00000002 “Attack of the crazy stack”


    Hi NTDebuggers, I have another puzzler for you.  We started crash2.exe under windbg and it crashed.  Go figure!  Sometimes we have a very limited amount of data available to figure out what went wrong.  That being said, this week’s puzzler only gives you a few clues.    Given this week’s debugger output, what do you suspect the problem is?  What would you do to further isolate the issue or prove your theory?


    If there is more data you need to solve it, post a comment / request and I will provide the data for you.  We will post all comments during the week and provide our answer on Friday.  We look forward to your comments.


    CommandLine: crash2.exe

    Symbol search path is: srv*C:\symbols*\\symbols\symbols

    Executable search path is:

    ModLoad: 00400000 00438000   crash2.exe

    ModLoad: 779b0000 77b00000   ntdll.dll

    ModLoad: 76180000 76290000   C:\Windows\syswow64\kernel32.dll

    (15d0.1688): Break instruction exception - code 80000003 (first chance)

    eax=00000000 ebx=00000000 ecx=cd7b0000 edx=00000000 esi=fffffffe edi=77a90094

    eip=779c0004 esp=0017faf8 ebp=0017fb28 iopl=0         nv up ei pl zr na pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246


    779c0004 cc              int     3


    0:000> g


    (15d0.1688): Access violation - code c0000005 (first chance)

    First chance exceptions are reported before any exception handling.

    This exception may be expected and handled.

    eax=0017feec ebx=7619140f ecx=0042ecc8 edx=00000000 esi=00000002 edi=00001770

    eip=65732074 esp=0017ff00 ebp=6f207473 iopl=0         nv up ei pl nz ac pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010216

    65732074 ??              ???

    0:000> k 123


    ChildEBP RetAddr 

    WARNING: Frame IP not in any known module. Following frames may be wrong.

    0017fefc 66692065 0x65732074

    0017ff3c 0041b9a3 0x66692065

    *** WARNING: Unable to verify checksum for crash2.exe

    0017ffa0 762019f1 crash2!_onexit+0x35

    0017ffac 77a2d109 kernel32!BaseThreadInitThunk+0xe

    0017ffec 00000000 ntdll!_RtlUserThreadStart+0x23


    0:000> lm

    start    end        module name

    00400000 00438000   crash2   C (private pdb symbols)  D:\CPRRAMP\source\crash2\debug\crash2.pdb

    76180000 76290000   kernel32   (private pdb symbols)  C:\symbols\wkernel32.pdb\20F7BB5ED22344A2910B27CA7252AE792\wkernel32.pdb

    779b0000 77b00000   ntdll      (private pdb symbols)  C:\symbols\wntdll.pdb\7099E4B6A6984FD08CBC90A4EDD40FD12\wntdll.pdb


    0:000> db 66692065

    66692065  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    66692075  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    66692085  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    66692095  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    666920a5  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    666920b5  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    666920c5  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    666920d5  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    0:000> db 0x65732074

    65732074  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    65732084  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    65732094  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    657320a4  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    657320b4  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    657320c4  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    657320d4  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    657320e4  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    0:000> db 0041b9a3

    0041b9a3  c3 e8 42 57 ff ff c3 cc-cc cc cc cc cc cc cc cc  ..BW............

    0041b9b3  cc cc cc cc cc cc ff 74-24 04 e8 42 5c ff ff f7  .......t$..B\...

    0041b9c3  d8 1b c0 f7 d8 59 48 c3-cc cc cc cc 8b 44 24 04  .....YH......D$.

    0041b9d3  a3 78 3c 43 00 a3 7c 3c-43 00 a3 80 3c 43 00 a3  .x<C..|<C...<C..

    0041b9e3  84 3c 43 00 c3 cc cc cc-cc cc cc 6a 10 68 58 16  .<C........j.hX.

    0041b9f3  43 00 e8 cc 5d ff ff 33-ff 57 e8 0f 59 ff ff 59  C...]..3.W..Y..Y

    0041ba03  89 7d fc 39 7d 08 75 1c-be 78 3c 43 00 ff 35 78  .}.9}.u..x<C..5x

    0041ba13  3c 43 00 e8 19 5e ff ff-89 45 e4 c7 45 e0 02 00  <C...^...E..E...

    0:000> db 0x66692065

    66692065  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    66692075  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    66692085  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    66692095  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    666920a5  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    666920b5  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    666920c5  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    666920d5  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????


    0:000> dd esp


    0017ff00  66692065 72756f20 61747320 69206b63

    0017ff10  616c2073 72656772 6f6e6520 54686775

    0017ff20  20736968 6d207369 2065726f 61746164

    0017ff30  726f6620 00737520 6f76e0ae fffffffe

    0017ff40  0041b9a3 0041b9c2 00411181 004146d9

    0017ff50  0017ffa0 00412ac4 00000001 002620a0

    0017ff60  002620d8 6f220936 00000000 00000000

    0017ff70  7efde000 0017ff9c 00000000 00000006


    0:000> dds esp


    0017ff00  66692065

    0017ff04  72756f20

    0017ff08  61747320

    0017ff0c  69206b63

    0017ff10  616c2073

    0017ff14  72656772

    0017ff18  6f6e6520

    0017ff1c  54686775

    0017ff20  20736968

    0017ff24  6d207369

    0017ff28  2065726f

    0017ff2c  61746164

    0017ff30  726f6620

    0017ff34  00737520

    0017ff38  6f76e0ae

    0017ff3c  fffffffe

    0017ff40  0041b9a3 crash2!_onexit+0x35

    0017ff44  0041b9c2 crash2!atexit+0x9

    0017ff48  00411181 crash2!ILT+380(__RTC_Terminate)

    0017ff4c  004146d9 crash2!_cinit+0x49

    0017ff50  0017ffa0

    0017ff54  00412ac4 crash2!__tmainCRTStartup+0x15e

    0017ff58  00000001

    0017ff5c  002620a0

    0017ff60  002620d8

    0017ff64  6f220936

    0017ff68  00000000

    0017ff6c  00000000

    0017ff70  7efde000

    0017ff74  0017ff9c

    0017ff78  00000000

    0017ff7c  00000006



    Good luck and happy debugging!



    [Update: more debugger output, per reader request. Posted 4/17/2008]

    Note this cashes without windbg also.


    CommandLine: crash2.exe

    Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    Executable search path is:

    ModLoad: 00400000 00438000   crash2.exe

    ModLoad: 777b0000 77900000   ntdll.dll

    ModLoad: 75a10000 75b20000   C:\Windows\syswow64\kernel32.dll

    (1324.16d0): Break instruction exception - code 80000003 (first chance)

    eax=00000000 ebx=00000000 ecx=73fe0000 edx=00000000 esi=fffffffe edi=77890094

    eip=777c0004 esp=0017faf8 ebp=0017fb28 iopl=0         nv up ei pl zr na pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246


    777c0004 cc              int     3

    0:000> g

    (1324.16d0): Access violation - code c0000005 (first chance)

    First chance exceptions are reported before any exception handling.

    This exception may be expected and handled.

    eax=0017feec ebx=75a2140f ecx=0042ecc8 edx=00000000 esi=00000002 edi=00001770

    eip=65732074 esp=0017ff00 ebp=6f207473 iopl=0         nv up ei pl nz ac pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010216

    65732074 ??              ???

    0:000> db esp

    0017ff00  65 20 69 66 20 6f 75 72-20 73 74 61 63 6b 20 69  e if our stack i

    0017ff10  73 20 6c 61 72 67 65 72-20 65 6e 6f 75 67 68 54  s larger enoughT

    0017ff20  68 69 73 20 69 73 20 6d-6f 72 65 20 64 61 74 61  his is more data

    0017ff30  20 66 6f 72 20 75 73 00-9b 7d 6b d3 fe ff ff ff   for us..}k.....

    0017ff40  a3 b9 41 00 c2 b9 41 00-81 11 41 00 d9 46 41 00  ..A...A...A..FA.

    0017ff50  a0 ff 17 00 c4 2a 41 00-01 00 00 00 98 21 03 00  .....*A......!..

    0017ff60  d0 21 03 00 03 94 3f d3-00 00 00 00 00 00 00 00  .!....?.........

    0017ff70  00 e0 fd 7e 9c ff 17 00-00 00 00 00 06 00 00 00  ...~............

    0:000> ln 0042ecc8

    (0042ecac)   crash2!`string'+0x1c   |  (0042eccc)   crash2!`string'

    0:000> .formats 0017ff044

    Evaluate expression:

      Hex:     017ff044

      Decimal: 25161796

      Octal:   00137770104

      Binary:  00000001 01111111 11110000 01000100

      Chars:   ..D

      Time:    Mon Oct 19 01:23:16 1970

      Float:   low 4.70085e-038 high 0

      Double:  1.24316e-316

    0:000> dd 0017ff044

    017ff044  ???????? ???????? ???????? ????????

    017ff054  ???????? ???????? ???????? ????????

    017ff064  ???????? ???????? ???????? ????????

    017ff074  ???????? ???????? ???????? ????????

    017ff084  ???????? ???????? ???????? ????????

    017ff094  ???????? ???????? ???????? ????????

    017ff0a4  ???????? ???????? ???????? ????????

    017ff0b4  ???????? ???????? ???????? ????????

    0:000> dds crash2!__onexitbegin

    00434354  59c96d56

    00434358  00000001

    0043435c  00000000

    00434360  00000000

    00434364  00000000

    00434368  00000000

    0043436c  00000000

    00434370  00033760

    00434374  00000000

    00434378  00000000

    0043437c  00000000

    00434380  00000000

    00434384  00000000

    00434388  00000000

    0043438c  00000000

    00434390  00000000

    00434394  00000000

    00434398  00000000

    0043439c  00000000

    004343a0  00000000

    004343a4  00000000

    004343a8  00000000

    004343ac  00000000

    004343b0  00000000

    004343b4  00000000

    004343b8  00000000

    004343bc  00000000

    004343c0  00000000

    004343c4  00000000

    004343c8  00000000

    004343cc  00000000

    004343d0  00000000

    0:000> dds crash2!__onexitend

    00434350  59c94d56

    00434354  59c96d56

    00434358  00000001

    0043435c  00000000

    00434360  00000000

    00434364  00000000

    00434368  00000000

    0043436c  00000000

    00434370  00033760

    00434374  00000000

    00434378  00000000

    0043437c  00000000

    00434380  00000000

    00434384  00000000

    00434388  00000000

    0043438c  00000000

    00434390  00000000

    00434394  00000000

    00434398  00000000

    0043439c  00000000

    004343a0  00000000

    004343a4  00000000

    004343a8  00000000

    004343ac  00000000

    004343b0  00000000

    004343b4  00000000

    004343b8  00000000

    004343bc  00000000

    004343c0  00000000

    004343c4  00000000

    004343c8  00000000

    004343cc  00000000


    [Update: our answer. Posted 4/18/2008]

    This week we had a lot of people that realized this was a buffer overrun.   You guys are so good I’m going to make next week’s puzzler a little harder!

    Good work all!   They are not all listed but some of the responses I liked the best were:





    Tal Rosen


    This week’s official response



    Let’s start off with wmain.  You can see here that we push a pointer to a string onto the stack and call fun1 004117a3


    0:000> uf crash2!wmain

    crash2!wmain [d:\cprramp\source\crash2\crash2\crash2.cpp @ 11]:

       11 00412520 55              push    ebp

       11 00412521 8bec            mov     ebp,esp

       11 00412523 83ec40          sub     esp,40h

       11 00412526 53              push    ebx

       11 00412527 56              push    esi

       11 00412528 57              push    edi

       12 00412529 686cec4200      push    offset crash2!`string' (0042ec6c) << Pushing param onto the stack

       12 0041252e e870f2ffff      call    crash2!ILT+1950(?fun1YAXPADZ) (004117a3)  << Making call to fun1

       12 00412533 83c404          add     esp,4

       13 00412536 33c0            xor     eax,eax

       14 00412538 5f              pop     edi

       14 00412539 5e              pop     esi

       14 0041253a 5b              pop     ebx

       14 0041253b 8be5            mov     esp,ebp

       14 0041253d 5d              pop     ebp

       14 0041253e c3              ret


    Lets dump out the value we are passing.


    0:000> da 0042ec6c

    0042ec6c  "This is a test ot see if our sta"

    0042ec8c  "ck is larger enough"


    0:000> uf 004117a3

    crash2!fun1 [d:\cprramp\source\crash2\crash2\crash2.cpp @ 17]:

       17 00412550 55              push    ebp

       17 00412551 8bec            mov     ebp,esp

       17 00412553 83ec4c          sub     esp,4Ch

       17 00412556 53              push    ebx

       17 00412557 56              push    esi

       17 00412558 57              push    edi

       19 00412559 8b4508          mov     eax,dword ptr [ebp+8]  << Here we are moving EBP+8 into eax. This is basically just loading the address of parameter one into eax

       19 0041255c 50              push    eax  << We push it onto the stack to make our call to  strcpy

       19 0041255d 8d4df4          lea     ecx,[ebp-0Ch]  << Now we are loading the address of a local on the stack. 

    Note the local is ebp-c   That means that we can only write 0xC bytes to

    this location before we end up overwriting things on the stack like our base pointer and return address.

       19 00412560 51              push    ecx  << Now we push our local variable address onto the stack for our call to strcpy

       19 00412566 83c408          add     esp,8

       20 00412569 8d45f4          lea     eax,[ebp-0Ch]

       20 0041256c 50              push    eax


       19 00412561 e8c3eeffff      call    crash2!ILT+1060(_strcpy) (00411429)  << This is where things go WRONG, within the call to strcpy we have overwritten our return address with string data.


    Let’s look at the before and after.


    0:000> bp 00412560

    0:000> g

    Breakpoint 2 hit

    eax=0042ec6c ebx=75a2140f ecx=0017feec edx=00000000 esi=00000002 edi=00001770

    eip=00412560 esp=0017fe9c ebp=0017fef8 iopl=0         nv up ei pl nz ac pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000216


    00412560 51              push    ecx

    0:000> k 

    ChildEBP RetAddr 

    0017fef8 00412533 crash2!fun1+0x10 [d:\cprramp\source\crash2\crash2\crash2.cpp @ 19] << The return address is ok here!

    0017ff50 00412ac4 crash2!wmain+0x13 [d:\cprramp\source\crash2\crash2\crash2.cpp @ 12]

    0017ffa0 75a919f1 crash2!__tmainCRTStartup+0x15e [f:\sp\vctools\crt_bld\self_x86\crt\src\crt0.c @ 327]

    0017ffac 7782d109 kernel32!BaseThreadInitThunk+0xe

    0017ffec 00000000 ntdll!_RtlUserThreadStart+0x23


    0:000> p  << Lets step over the call to strcpy and look again.

    eax=0017feec ebx=75a2140f ecx=0042eca0 edx=00686775 esi=00000002 edi=00001770

    eip=00412569 esp=0017fea0 ebp=0017fef8 iopl=0         nv up ei pl nz ac pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000216


    00412569 8d45f4          lea     eax,[ebp-0Ch]


    0:000> k

    ChildEBP RetAddr 

    0017fef8 65732074 crash2!fun1+0x19 [d:\cprramp\source\crash2\crash2\crash2.cpp @ 20]

    WARNING: Frame IP not in any known module. Following frames may be wrong.

    0017ff3c 0041b9a3 0x65732074  << This is not going to be pretty when we ret out of fun1.  We will basically return to nowhere.

    0017ffa0 75a919f1 crash2!_onexit+0x35 [f:\sp\vctools\crt_bld\self_x86\crt\src\onexit.c @ 98]

    0017ffac 7782d109 kernel32!BaseThreadInitThunk+0xe

    0017ffec 00000000 ntdll!_RtlUserThreadStart+0x23


    0:000> da 0017fef8   if we do a da on the location of the stack frame we can see what is at that location.

    0017fef8  "st ot see if our stack is larger"  << It’s our string!

    0017ff18  " enough"


    Finally we have the C code.


    #include <windows.h>


    void fun1(char * szData);

    void fun2(char * szData);


    int _tmain(int argc, _TCHAR* argv[])


                    fun1("This is a test ot see if our stack is larger enough");

                    return 0;



    void fun1(char * szdata)


                    char szData1[10];




    void fun2(char *szData)


                    printf("Hello from fun2");

                    strcat(szData,"This is more data for us");




    Have a great weekend, Good luck, and happy debugging!



  • Ntdebugging Blog

    Windows Performance Toolkit - Xperf


    I recently came across a very interesting profiling tool that is available in Vista SP1 and Server 08 called the Windows Performance Analyzer. You can use this tool to profile and diagnose different kinds of symptoms that the machine is experiencing. This tool is built on top off the Event Tracing for Windows (ETW) infrastructure. It uses the ETW providers to record kernel events and then display them in a graphical format.

    Performance Analyzer provides many different graphical views of trace data including:

    1. CPU Sampling
    2. CPU and Disk utilization by process and thread
    3. Interrupt service routine and deferred procedure call
    4. Hard faults
    5. Disk I/O Detail

    What you will need

    Download the latest version of the Windows Performance Tools Kit, and install it on your machine. (http://www.microsoft.com/whdc/system/sysperf/perftools.mspx : Windows Performance Tools Kit, v.4.1.1 (QFE)) You will need to find the toolkit that corresponds to your processor architecture. Currently there are 3 versions available i.e. X86, IA64, X64.

    After installation you should be able to see 2 new tools. The first one is Xperf, which is a command line tool that is used to capture the trace. The second is called XperfView, which graphically interprets the trace that has been collected by Xperf.

    You will need to run the Xperf and XperfView from an elevated command prompt for all functionality.

    Starting a Trace

    For many tasks all you need for effective analysis is a kernel trace.  For this example, we'll use the –on DiagEasy parameter to enable several kernel events including: image loading; disk I/O; process and thread events; hard faults; deferred procedure calls; interrupts; context switches; and, and performance counters.  From an elevated command prompt launch xperf –on DiagEasy.

    This starts the kernel logger in sequential mode to the default file "\kernel.etl"; uses a default buffer size of 64K, with a minimum of 64 and a maximum of 320 buffers. 

    Stopping a Trace

    To stop a trace, type xperf –d <filename>.etl at the command line. This will stop the trace and output the file.

    Viewing a Trace

    There are 2 ways to view the trace. From an Elevated Command prompt, launch xperf <filename>.etl, or
    launch the XperfView tool and open the file manually. When you open the trace file, you should see something similar like this.

    NOTE - While you need to run xperf from an elevated command prompt in order to record a trace you do not need an elevated command prompt in order to *analyze* a trace.  

    Using the Chart Selector tab, you can select all the graphs that you want to look at. To drill down in each chart, you can select the Summary table. For instance, in the CPU Sampling chart, the summary table gets you the summary of the processes that were running, with information like the amount of CPU time, CPU %, stacks (if the stacks were collected in the trace, see below). When looking at the Summary table for the Disk I/O chart, you can see which processes were writing files (the filename too!) to disk, as well as how much time it took.

    You also have the ability to zoom in on a selected area. Another really cool feature is the ability to overlay multiple graphs on one frame. This way you can correlate different pieces of data together very easily.

    Also, you select which counter instances you want to see in each specific chart. On the top right corner of each chart is a drop down box from where you can select the counter instances. For instance on the Disk I/O chart, you can select Disk 0, Disk 1, or a combination as well.

    You can also view detailed information about the system that the trace was taken on. Click on the Trace menu item, and select System Configuration.

    Xperf Kernel Flags and Groups

    In the first sample Xperf command we ran, xperf –on DiagEasy. I am sure many of you were wondering what DiagEasy means. DiagEasy is a group of kernel events that are predefined by the Windows Performance Toolkit. This group includes Process, Threads, Kernel and User Image Load/Unload, Disk I/O, DPCs and Context Switch events.

    When we used the xperf –on DiagEasy command, we did not specify an individual provider, so we enabled the kernel events for all the ETW providers on the system. If you want to enable events for a specific provider, you can the following format xperf -on: (GUID|KnownProviderName)[:Flags[:Level]]. For more information about ETW providers, Kernel Flags and Groups, you can run the xperf –help providers command.

    Stack Tracing

    One of the most powerful features in Performance Analyzer is the ability to visualize stacks. It's important to note that this requires no special instrumentation in the code – only that you have symbols for the binary components you are interested in analyzing.

    When the trace is setup to collect the stacks, Performance Analyzer will display call stack summary information for the events that had stack walking enabled. Here is an example that takes a trace (with stack tracing enabled) of the entire system while running a "find string" utility.. We can use the Stack Tracing feature of Xperf to record a stack when certain events happen, or take sample at regular intervals over time. See xperf –help stackwalk output for more info.

    Below, we will use the Stack Tracking feature of Xperf to take stack samples at regular intervals. With this output, we will be able to determine where the CPU is spending most of its time within a process.

    xperf -on latency -stackwalk Profile

    Latency is the kernel group to enable certain events, including the profile event which records the CPUs' activity every millisecond. The "-stackwalk Profile" flag tells Xperf to record stack walks on every profile event, which makes the profile information much more useful. In other words, in order to get profile information with stack walks you need to turn on the profile event, and turn on stack walking for that event.

    Note that decoding of stacks requires that symbol decoding be configured. However stacks can be recorded without symbols, and can even be viewed without symbols, although they are much less useful without symbols. I only mention this in the event you're trying to record a trace of a problematic machine with little time to mess around with _NT_SYMBOL_PATH.

    To get a trace with the stack information, do the following:

    1. Run the command, xperf -on latency -stackwalk Profile.
    2. Run the application that you wish to analyze, then
    3. You can end the trace by using the regular xperf –d <filename>.etl command.
    4. Load the trace in the viewer and then select the Load Symbols from the Trace drop down menu.
    5. Once the symbols are loaded, select an interesting region on the CPU sampling graph.
    6. Right click and select Summary Table.

    Click on the selector tab to bring up the column chooser list. Then select "Process name", "Process", "Stack", "Weight" and "%Weight". These are the most useful columns when looking at stacks from the sample profile event. You should get a view similar to this.


    At this point I need to mention a few of the restrictions with stack walking coupled with when and how it works.

    ·                Xperf stack walking is not available on XP

    ·                On Vista stack walking is available for x86, and is available for x64 as of Vista SP1.

    ·                On Windows 7 stack walking is available.

    ·                Stack walking on x64 is complicated. You have to set DisablePagingExecutive in the registry, as documented here:


    REG ADD "HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management" -v DisablePagingExecutive -d 0x1 -t REG_DWORD –f

    Sample Case

    I recently came across a case where the customer was complaining that DPC processing was taking up too much CPU time. We ran Xperf on the machine and drilled down into the DPC activity on the machine.

    From the Xperf graph, I was able to confirm that the customer was actually seeing high DPC usage. I selected the Summary for this chart, and got the list of drivers that were actually taking up CPU time.

    Right off the bat, I could identify the driver that had a lot of DPC activity. I also noted that the average duration for each DPC from that driver was taking 788 microseconds. This is way too high. Each DPC should be taking a maximum of 100 microseconds.


    - Omer



    Performance.Analyzer.QuickStart.xps – This is shipped with the performance toolkit.

    From an elevated command prompt, launch xperf -help




  • Ntdebugging Blog

    Debugging a bluescreen at home



    Hi, my name is Chad. I work as an escalation engineer for Microsoft’s OEM support team.


    A while back, I encountered an interesting crash on one of my computers at home, and I thought I’d post about how I debugged it.


    This particular machine had been humming along quite happily for some time, but one day while I was scanning some photos, it bluescreened. Naturally, I hoped it was just a fluke, but after it happened a few more times while doing the same thing, I decided to debug it.


    Ordinarily, if a machine crashes when performing a specific activity, like scanning photos, my first inclination would be to suspect a bug in one of the drivers involved in that activity, like the scanner driver or the USB driver. But in this case, I had been using this scanner for a long time, with the same drivers, and never had this problem, so this sudden crashing was kind of mysterious.


    Let's see what we can tell from the dump!


    The first order of business when looking at a crash dump is the !analyze -v command. I've trimmed some of it here for brevity, but it goes something like this:


    kd> !analyze -v


    *                                                                             *

    *                        Bugcheck Analysis                                    *

    *                                                                             *




    This is a very common bugcheck.  Usually the exception address pinpoints

    the driver/function that caused the problem.  Always note this address

    as well as the link date of the driver/image that contains this address.

    Some common problems are exception code 0x80000003.  This means a hard

    coded breakpoint or assertion was hit, but this system was booted

    /NODEBUG.  This is not supposed to happen as developers should never have

    hardcoded breakpoints in retail code, but ...

    If this happens, make sure a debugger gets connected, and the

    system is booted /DEBUG.  This will let us see why this breakpoint is



    Arg1: c0000005, The exception code that was not handled

    Arg2: 8738e300, The address that the exception occurred at

    Arg3: b9b3dc7c, Trap Frame

    Arg4: 00000000


    Debugging Details:




    EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at "0x%08lx" referenced memory at "0x%08lx". The memory could not be "%s".




    8738e300 0000             add     [eax],al


    TRAP_FRAME:  b9b3dc7c -- (.trap ffffffffb9b3dc7c)

    ErrCode = 00000002

    eax=00000001 ebx=bc514c68 ecx=0001065e edx=bc510000 esi=00000955 edi=b9b3dd64

    eip=8738e300 esp=b9b3dcf0 ebp=b9b3dd08 iopl=0         nv up ei pl zr na po nc

    cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000             efl=00010246

    8738e300 0000             add     [eax],al                ds:0023:00000001=??

    Resetting default scope






    LAST_CONTROL_TRANSFER:  from 8051d6a7 to 8053331e



    WARNING: Frame IP not in any known module. Following frames may be wrong.

    b9b3dcec bf801619 b9b3dd64 0012eee4 bf81551f 0x8738e300

    b9b3dd08 bf81553c b9b3dd64 0012eee4 bf81551f win32k!ValidateHwnd+0x5c

    b9b3dd50 804de7ec 0001065e 012eb244 00000000 win32k!NtUserInvalidateRect+0x1d

    b9b3dd50 7c90eb94 0001065e 012eb244 00000000 nt!KiFastCallEntry+0xf8

    0012eecc 77d4b601 0105a943 0001065e 012eb244 ntdll!KiFastSystemCallRet

    0012eed0 0105a943 0001065e 012eb244 00000000 USER32!NtUserInvalidateRect+0xc

    77d4b601 90909090 001134b8 0300ba00 12ff7ffe NikonScan4!GetSource+0x21e93



    From the stack trace, we can tell that NikonScan4.dll made a call into User32, which ultimately ended up calling into win32k.sys down in kernel mode. Win32k was in a function called ValidateHwnd() and then tried to call some function at address 0x8738e300, at which point we tried to dereference an invalid pointer and crashed.


    What’s at 0x8738e300?


    kd> dc 8738e300

    8738e300  00000000 00000001 00000000 87360350  ............P.6.

    8738e310  00000000 00000001 f71af9fe f71b0030  ............0...

    8738e320  f71afb0e f71afbb4 f71b0098 f71b0214  ................

    8738e330  f71afef6 f71aff8e 07fef800 00000000  ................

    8738e340  f71afffc 00000000 0a0e000a 644c6d4d  ............MmLd

    8738e350  8732ea58 870303e0 ffffffff ffffffff  X.2.............

    8738e360  00000012 00000000 f797f000 f7989905  ................

    8738e370  0000c000 00500050 e1971458 00160016  ....P.P.X.......


    kd> !pool 8738e300 2

    Pool page 8738e300 region is Nonpaged pool

    *8738e2f8 size:   50 previous size:    8  (Allocated) *NV 

                  Owning component : Unknown (update pooltag.txt)


    Well, that’s bad. 0x8738e300 isn’t actually a valid address of a function. That location contains some user data (specifically, some nonpaged pool).


    So, that's why we blew up: the ValidateHwnd() function in win32k made a call to this bad address which contains data instead of code! Let's see if we can figure out why it did this. We can find the return address in win32k!ValidateHwnd on the stack, and unassemble the instructions leading up to the point where we ran off into the weeds.


    kd> kv L8

    ChildEBP RetAddr  Args to Child             

    WARNING: Frame IP not in any known module. Following frames may be wrong.

    b9b3dcec bf801619 b9b3dd64 0012eee4 bf81551f 0x8738e300

    b9b3dd08 bf81553c b9b3dd64 0012eee4 bf81551f win32k!ValidateHwnd+0x5c (FPO: [Non-Fpo])

    b9b3dd50 804de7ec 0001065e 012eb244 00000000 win32k!NtUserInvalidateRect+0x1d (FPO: [Non-Fpo])

    b9b3dd50 7c90eb94 0001065e 012eb244 00000000 nt!KiFastCallEntry+0xf8 (FPO: [0,0] TrapFrame @ b9b3dd64)

    0012eecc 77d4b601 0105a943 0001065e 012eb244 ntdll!KiFastSystemCallRet (FPO: [0,0,0])

    0012eed0 0105a943 0001065e 012eb244 00000000 USER32!NtUserInvalidateRect+0xc

    77d4b601 90909090 001134b8 0300ba00 12ff7ffe NikonScan4!GetSource+0x21e93

    77d4b605 001134b8 0300ba00 12ff7ffe 900008c2 0x90909090


    The return address from where we made the bad call is bf801619. Let’s unassemble a few instructions backwards from that address (using the handy but often-overlooked “ub” command) and see what the code was doing:


    kd> ub bf801619


    bf8015ff 8d1c82           lea     ebx,[edx+eax*4]

    bf801602 8bc1             mov     eax,ecx

    bf801604 c1e810           shr     eax,0x10

    bf801607 663b430a         cmp     ax,[ebx+0xa]

    bf80160b 75ad             jnz     win32k!ValidateHwnd+0x3d (bf8015ba)

    bf80160d 807b0801         cmp     byte ptr [ebx+0x8],0x1

    bf801611 7573             jnz     win32k!ValidateHwnd+0xff (bf801686)

    bf801613 ff15e0b298bf call dword ptr [win32k!_imp__PsGetCurrentThread (bf98b2e0)]


    OK, so it's pretty simple. Win32k decided which function address to call by reading it from a pointer stored at a hardcoded location (in other words, in a global variable) within Win32k itself. That pointer is located at bf98b2e0. The debugger helpfully tells us that this pointer is intended to contain the address of a function called PsGetCurrentThread, but let’s double-check this and make sure it actually does. At this point, the working assumption would be that this pointer had gotten corrupted somehow.


    kd> dd bf98b2e0 L1

    bf98b2e0  804e4a15


    Interesting. bf98b2e0 contains the value 804e4a15. This is not even close to the bad address the processor actually called! Remember, from the stack trace, we jumped to 8738e300 instead. What does this pointer actually point to?


    kd> ln 804e4a15

    (804e4a15)   nt!PsGetCurrentThread   |  (804e4a51)   nt!KeSetEventBoostPriority

    Exact matches:

        nt!PsGetCurrentThread = <no type information>


    Sure enough, 804e4a15 is the address of nt!PsGetCurrentThread.


    So, basically, win32k was trying to call PsGetCurrentThread(), and even had the correct pointer in memory to get there, but the processor instead jumped to a bogus address located in the middle of some user data.


    At this point it's pretty safe to say that this is a hardware problem, and furthermore, since the relevant memory looks fine, it looks like a problem within the CPU itself. (I’m not an expert on CPU architecture, but if I were to take a wild guess I’d say that maybe the processor had some problem when reading from its L2 cache.)


    Upon discovering this, I decided to crack open my case and take a look. It didn’t take long to spot the problem:


    Yes, that's my CPU fan. Notice that the heatsink is completely clogged up with dust!


    The resolution to this story: I took a can of compressed air and blew all the dust out of the heatsink. This took care of the problem, and the computer happily scanned hundreds of photos and hasn’t crashed again since. Why was it only crashing when I was scanning photos? Most likely because this was causing the CPU to run at 100% utilization for extended periods of time, and it was simply overheating!


  • Ntdebugging Blog

    Microsoft Windows Dynamic Cache Service


    Excessive cached read I/O is a growing problem.  For over one year we have been working on this problem with several companies.  You can read more about it in the original blog post:


    On 32 bit systems, the kernel could address at most 2GB of virtual memory.  This address range is shared and divided up for the many resources that the system needs; one of which is the System File Cache's working set.  On 32 bit systems the theoretical limit is almost 1GB for the cache’s working set; however, when a page is removed from the working set it will end up on the standby page list.  Therefore the system can cache more than the 1 GB limit if there is available memory. The working set; however, is just limited to what can be allocated within the Kernel's 2GB virtual address range.  Since most modern systems have more than 1 GB of physical RAM, the System File Cache's working set's size on a 32 bit system typically isn't a problem.

    With 64 bit systems, the kernel virtual address space is very large and is typically larger than physical RAM on most systems.  On these systems the System File Cache's working set can be very large and is typically about equal to the size of physical RAM.  If applications or file sharing performs a lot of sustained cached read I/O, the System File Cache's working set can grow to take over all of physical RAM.  If this happens, then process working sets are paged out and everyone starts fighting for physical pages and performance suffers.

    The only way to mitigate this problem is to use the provided APIs of GetSystemFileCacheSize() and SetSystemFileCacheSize().  The previous blog post "Too Much Cache" contains sample code and a compiled utility that can be used to manually set the System File Cache's working set size.

    The provided APIs, while offering one mitigation strategy, has a couple of limitations:

    1) There is no conflict resolution between multiple applications.  If you have two applications trying to set the System File Cache's working set size, the last one to call SetSystemFileCacheSize() will win.  There is no centralized control of the System File Cache's working set size.

    2) There is no guidance on what to set the System File Cache's working set size to.  There is no one size fits all solution.  A high cache working set size is good for file servers, but bad for large memory application and a low working set size could hurt everyone's I/O performance.  It is essentially up to 3rd party developers or IT administrators to determine what is best for their server and often times, the limits are determined by a best guesstimate backed by some testing.

    We fully understand that while we provide one way to mitigate this problem, the solution is not ideal. We spent a considerable amount of time reviewing and testing other options.  The problem is that there are so many varied scenarios on how users and applications rely on the System File Cache.  Some strategies worked well for the majority of usage scenarios, but ended up negatively impacting others.  We could not release any code change that would knowingly hurt several applications.

    We also investigated changing some memory manager architecture and algorithms to address these issues with a more elegant solution; however the necessary code changes are too extensive.  We are experimenting with these changes in Windows 7 and there is no way that we could back port them to the current operating systems.  If we did, we would be changing the underlying infrastructure that everyone has been accustomed to.  Such a change would require stress tests of all applications that run on Windows. The test matrix and the chance of regression are far too large.

    So that brings us back to the only provided solution - use the provided APIs.  While this isn't an ideal solution, it does work, but with the limitations mentioned above.  In order to help address these limitations, I've updated the SetCache utility to the Microsoft Windows Dynamic Cache Service.  While this service does not completely address the limitations above, it does provide some additional relief.

    The Microsoft Windows Dynamic Cache Service uses the provided APIs and centralizes the management of the System File Cache's working set size.  With this service, you can define a list of processes that you want to prioritize over the System File Cache by monitoring the working set sizes of your defined processes and back off the System File Cache's working set size accordingly.  It is always running in the background monitoring and dynamically adjusting the System File Cache's working set size.  The service provides you with many options such as adding additional slack space for each process' working set or to back off during a low memory event.

    Please note that this service is experimental and includes sample source code and a compiled binary.  Anyone is free to re-use this code in their own solution.  Please note that you may experience some performance side effects while using this service as it cannot possibly address all usage scenarios. There may be some edge usage scenarios that are negatively impacted.  The service only attempts to improve the situation given the current limitations.  Please report any bugs or observations here to this blog post.  While we may not be able to fix every usage problem, we will try to offer a best effort support.

    Side Effects may include:

    Cache page churn - If the System File Cache's working set is too low and there is sustained cached read I/O, the memory manager may not be able to properly age pages.  When forced to remove some pages in order to make room for new cache pages, the memory manager may inadvertently remove the wrong pages.  This could result in cached page churn and decreased disk performance for all applications.

    Version 1.0.0 - Initial Release


    NOTE: The memory management algorithms in Windows 7 and Windows Server 2008 R2 operating systems were updated to address many file caching problems found in previous versions of Windows. There are only certain unique situations when you need to implement the Dynamic Cache service on computers that are running Windows 7 or Windows Server 2008 R2.  For more information on how to determine if you are experiencing this issue and how to resolve it, please see the More Information section of Microsoft Knowledge Base article 976618 - You experience performance issues in applications and services when the system file cache consumes most of the physical RAM.

  • Ntdebugging Blog

    Event ID 157 "Disk # has been surprise removed"


    Hello my name is Bob Golding and I would like to share information on a new error you may see in the system event log. It is Event ID 157 "Disk <n> has been surprise removed" with Source: disk.  This error indicates that the CLASSPNP driver has received a “surprise removal” request from the plug and play manager (PNP) for a non-removable disk.


    What does this error mean?

    The PNP manager does what is called enumerations.  An enumeration is a request sent to a driver that controls a bus, such as PCI, to take an inventory of devices on the bus and report back a list of the devices.  The SCSI bus is enumerated in a similar manner, as are devices on the IDE bus.


    These enumerations can happen for a number of reasons.  For example, hardware can request an enumeration when it detects a change in configuration.  Also a user can initiate an enumeration by selecting “scan for new devices” in device manager.  


    When an enumeration request is received, the bus driver will rescan the bus for all devices.  It will issue commands to the existing devices as though it was looking for new ones.  If these commands fail on an existing unit, the driver will mark the device as “missing”.  When the device is marked “missing”, it will not be reported back to PNP in the inventory.  When PNP determines that the device is not in the inventory it will send a surprise removal request to the bus driver so the bus driver can remove the device object.


    Since the CLASSPNP driver sits in the device stack and receives requests that are destined for disks, it sees the surprise removal request and logs an event if the disk is supposed to be non-removable.  An example of a non-removable disk is a hard drive on a SCSI or IDE bus.  An example of a removable disk is a USB thumb drive.


    Previously nothing was logged when a non-removable disk was removed, as a result disks would disappear from the system with no indication.  The event id 157 error was implemented in Windows 8.1 and Windows Server 2012 R2 to log a record of a disk disappearing.


    Why does this error happen?

    These errors are most often caused when something disrupts the system’s communication with a disk, such as a SAN fabric error or a SCSI bus problem.  The errors can also be caused by a disk that fails, or when a user unplugs a disk while the system is running.  An administrator that sees these errors needs to verify the heath of the disk subsystem.


    Event ID 157 Example:


  • Ntdebugging Blog

    NTFS Misreports Free Space?


    I have recently seen a number of issues where customers called in to report a significant difference between the “Size on disk” for the root of a volume, and the reported amount of “Used space” in the volume properties.  While considering this, I noticed that my own C: drive had a mysterious 20GB difference.


    Below is the story of how I found the answer.


    Before we begin, there are two methods used for calculating disk usage…


    Method 1 – Volume Bitmap Analysis

    The % used and %free indication shown below with the pie chart is based on volume bitmap analysis.  The hidden $Bitmap:$Data:”” stream is read from the volume via the FSCTL_GET_VOLUME_BITMAP IO control code, and the ones and zeroes are counted.  A binary “1” represents an allocated cluster, while a “0” indicates a free cluster.  The total number of set bits indicates the total number of allocated clusters on the volume.  This is the fastest and most reliable method of calculating allocated and free space.


    Method 2 – File Scanning with FindFirstFileEx  / FindNextFile

    Navigating to the “C:\” folder, we select all files/folders, then right click, and then select properties.  A dialog will appear as the shell enumerates all files that it can find in the root and subfolders.  The “Size” and “Size on disk” fields will eventually display the total size of all primary data streams for all files that are found by Explorer.  In my case, this method only finds 80.25 GB of 102 GB (leaving a 21.75 GB difference).


    Why such a big difference?


    The short answer is - “If you can’t see it with Explorer, it doesn’t get added to the total size.”


    How to find the hidden disk usage…


    Step 1:


    First, check to see if your files are all actually visible with the current user account and Explorer settings.  Uncheck the “Hide protected operating system files” and select “Show hidden files and folders”.  When this is completed, look at the size on disk again. 

    Does it match the pie chart within a few percent?  If not, continue to Step 2.


    Step 2:

    Run CHKDSK and save the output.  The summary information at the end is what we want to see.

    Based on CHKDSK output, we can calculate the total metadata usage by adding up the following…






    .05 GB

    Space used by 22677 indexes.


    0 GB

    Space used by $Badclus file.


    1.29 GB

    Space used by $MFT.


    .06 GB

    Space used by $Loffile.


    1.4 GB

     Metadata Total


    In this example, metadata usage accounts for only 1.4 GB.  If the metadata number were a high percentage of the usage, then we need to take closer a look at how the volume is being used instead of simply looking for hidden files.  High metadata usage will be the subject for part 2 of this blog.


    We can see from the CHKDSK output that the total space occupied by our user files is 106,095,080 KB (101.18 GB).  This is a large percentage of the total usage, so we should be looking at the user files to see why we can’t see these files with Explorer.


    Step 3:

    Do you have permissions to see all files/folders the volume? 


    a.     If this is a system drive, look through the “\Documents and Settings”, or “\users” folder and see if you can browse all of the profile folders.  If not, you may need contact the owner(s) to check their folder size and see what they have saved in their user folder.  If they are like me, they will have 5 years worth of memory dumps, MP3’s, VHD’s, PDF’s, etc.

    b.    Try “CACLS.EXE x:\ /T”, or just browse through the folders on the drive looking for an “Access Denied” error.  Once this happens, give yourself permissions to that part of the subtree (if permitted by your administrative privileges) or have someone who has access to that folder enumerate the contents for you.  Check the number of files and the size on disk for all files in that folder and add it to the total.

    c.     Continue this process until you have a total of all files in all folders.

    d.    Does the total make sense?  If not, then continue to the next step.


    In the case of my mysterious 20 GB difference, I found an old user profile from a previous installation.  Sure enough, I received an access denied error when browsing to it.  To access the folder, I acquired ownership.  This user was in the habit of collecting memory dumps and virtual machines in a subfolder on his desktop.  I recovered quite a bit of free space by using the delete key.  I rescanned the volume, but to my disappointment, there was still a significant difference.


    Step 4.

    Are there alternate streams?  Try using STREAMS.EXE from (http://technet.microsoft.com/en-us/sysinternals/default.aspx).  Streams will recurse subdirectories and report space usage by alternate named streams.  Pipe this to a text file and analyze it with a text editor or spreadsheet program.


    Step 5.

    Hard links can be problematic when calculating the total usage via the file scan method.  Hard links are not extensively used, but this is worth mentioning.  A hard link is an extra index entry that points to an existing file.  These are created via the CreateHardLink function.  Although the file actually exists in one location, each hard link has its own size field.  Because of this, a single file can be added twice to the running total.  Another thing to know is that hard links are not synchronized, so it is possible that only one link will show the actual file size (see the example below).

    Unfortunately, there are few options available for detecting hard link paradoxes, but it is something to consider when the file scan shows more usage than the bitmap.  Since we have the opposite situation here, hard links are not a significant factor.


    Step 6.

    Is Volume Shadow Copy Service maintaining diff area files for snapshots?  Use VSSADMIN LIST SHADOWSTORAGE to find out.  Add shadow storage to the running total.


    VSSVC pre-allocates space for making volume snapshots.  In order to support this feature, diff area files are kept in the “\System Volume Information” folder.  This pre-allocated space is used to maintain point-in-time information for the “Previous Versions” feature and for the “System Restore” application.  If you are the type of user who prefers to minimize the impact of such features, then you can resize your shadow storage with VSSADMIN so it has less impact on disk usage.  I prefer to leave these features at their default settings (and just make a note of how much disk space it is using).

    Step 7.


    If your numbers still don’t make sense, then it’s time to check for actively hidden files / folders.  There are many different rootkit scanners that can help you identify the presence of actively hidden files & folders.   Try using a scanner like Rootkit Revealer.  If you suspect that your machine has been compromised by a rootkit, refer to http://www.microsoft.com/uk/business/security/security-malware.mspx.


    Seeing that I did not have a significant amount of mysterious usage, I was satisfied that there was no malicious logic hiding my files.  After taking into account all of the factors, there was only a 1.57GB difference remaining (which was accounted for by the contents of a user’s recycle bin).


    Whenever you see a difference between the “Used space”, and “Size on disk”, always ask yourself “What am I not seeing?”



    Best regards,


    Dennis Middleton “The NTFS Doctor”


  • Ntdebugging Blog

    Windows Internals Beta Exam 71-660


    Correction, Windows Internals Beta Exam 71-660

    This was previously listed incorrectly as 70-660, the actual exam ID is now 71-660


    Update 7-18-2008: The link to the exam preparation information has been updated below.


    Update 7-21-2008: When registering, there is no need to use a credit card.  Be sure to instead use the following promotional code listed below when registering for the exam: WINT.  If you have already been charged for this exam, please contact Prometric and request a refund, and reference ticket number 3711938.


    Hello Windows Debugging Community,  We are super excited to announce a new certification.   With the help of industry partners, including some of our very talented community members,  we have developed a new Windows Internals Certification.   This Certification targets developers and IT professionals that require knowledge of Windows Internals as part of their profession.   Chance are if you are reading this blog post you are part of the target audience.  I encourage you to register and please take the Beta exam.  Please help us to improve the exam by providing feedback during the beta exam.



    Register for Beta Exam 71-660: TS: Windows® Internals


    You are invited to take beta exam 70-660: TS: Windows® Internals. If you pass the beta exam, the exam credit will be added to your transcript and you will not need to take the exam in its released form.

    By participating in beta exams, you have the opportunity to provide the Microsoft Certification program with feedback about exam content, which is integral to development of exams in their released version. We depend on the contributions of experienced IT professionals and developers as we continually improve exam content and maintain the value of Microsoft certifications.

    Please remember that participation in the beta process is completely voluntary and Microsoft makes no promises or guarantees regarding the beta exam process. You can expect to receive your score on the beta exam within 12 weeks of taking the exam, although in some instances, beta exams may take longer to score and your results may be delayed.

    70-660: TS: Windows® Internals counts as credit towards the following certification.

    ·          Microsoft Certified Technology Specialist (MCTS): Windows® Internals  


    Registration begins: July 17, 2008

    Beta exam period runs: July 21, 2008August 5, 2008


    Receiving this invitation does not guarantee you a seat in the beta; we recommend that you register immediately. Beta exams have limited availability and are operated under a first-come-first-served basis. Once all beta slots are filled, no additional seats will be offered.

    Testing is held at Prometric testing centers worldwide, although this exam may not be available in all countries (see Regional Restrictions).  All testing centers will have the capability to offer this exam in its live version.


    Regional Restrictions: India, Pakistan, China

    Registration Information

    You must register at least 24 hours prior to taking the exam.
    Please use the following promotional code when registering for the exam: WINT
    Receiving this invitation does not guarantee you a seat in the beta; we recommend that you register immediately.

    To register in North America, please call:

    Prometric: (800) 755-EXAM (800-755-3926)



    Outside the U.S./Canada, please contact:

    Prometric: http://www.register.prometric.com/ClientInformation.asp



    Test Information and Support

    You are invited to take this beta exam at no charge.
    You will be given four hours to complete the beta exam. Please plan accordingly.
    Find exam preparation information: http://blogs.msdn.com/ntdebugging/pages/exam-preparation-information-for-exam-71-660.aspx


    Frequently Asked Questions

    For Microsoft Certified Professional (MCP) help and information, you may log in to the MCP Web site at http://www.microsoft.com/learning/mcp/or contact your Regional Service Center: http://www.microsoft.com/learning/support/worldsites.asp.

    What is a beta exam?

    Where can I learn more about the registration process?

    Where can I learn more about the beta exam invitation process?

    Where can I learn more about the new structure of Microsoft Certification?

    Who do I contact for help with this beta exam or other MCP questions?


    Thank you.

    Jeff Dailey, Microsoft Global Escalation Services.

  • Ntdebugging Blog

    Easily Resolving an Event Viewer Error using a Process Memory Dump


    My name is Ryan Mangipano (ryanman) and I am a Sr. Support Escalation Engineer at Microsoft.  Today I will be blogging about how I used the SOS  .Net Framework debugging extension (and !analyze -v) to easily troubleshoot a .Net Framework exception. This exception was preventing Event Viewer from displaying properly. Event Viewer was returning an error that provided very little information about what was actually causing the issue. I will demonstrate how, in this case, it was very easy to use windbg to obtain information about what went wrong. I did not have to perform a live debug on the issue.  Instead, I used a process dump to obtain very exact information, which was returned by the debugger, relating to the root cause. I was then able to use Process Monitor to identify the file that needed to be examined. These actions led me to the source of the problem, which was easily corrected. Also, the issue discussed in this blog was easily reproduced on Windows Server 2008. This means that you should be able to practice this debug exercise, on your non-production Windows 2008 SP1 Server, for learning purposes if you are interested in doing so.


    Issue Reported:  The following error was encountered when opening eventvwr.msc (Event Viewer) on a Windows 2008 Server system:


                           "MMC could not create the snap-in."

    MMC could not create the snap-in.

    The snap-in might not have been installed correctly

    Name: Event Viewer

                   CLSID: FX:{b05566ad-fe9c-4363-be05-7a4cbb7cb510}


    First Step- Research & Data Gathering: After ensuring I first understood the problem reported, I searched for known issues. I found out that we have seen this error before.  It may occur when the following registry key gets deleted or corrupted:


                    HKLM\software\Microsoft\MMC\SnapIns\FX:{ b05566ad-fe9c-4363-be05-7a4cbb7cb510}


    I had the customer export this key and found that it was not corrupted in any way. I verified that all data was as expected


    Next,  a memory dump of the mmc.exe process was collected.  The mmc.exe process is used to host the eventvwr.msc snap-in. This was easily obtained using the built in Windows 2008 Server "Windows Task Manager" feature: "Create Dump File" . If you have several mmc console instances executing on your system, you can use the Task Manager context menu shortcuts "Switch To" and "Go To Process" to help you to identify the correct instance.




    Note: We also collected a process monitor logfile during the startup of eventvwr.msc. This log file later proved very helpful in resolving the issue (as I will show below). Process monitor can be obtained at the following URL:  http://technet.microsoft.com/en-us/sysinternals/bb896645.aspx



     Now let's have a look at the debug.


    1.          First, I navigated Windows Explorer to the location of the dump file and double-clicked it to open it in windbg.exe.


    It opened in windbg because I had previously run the command windbg -IA, which associates .dmp files with windbg. You can read more about the command line options in windbg in the help file that is included with the debugging tools.



    2.            I noticed the following output from the debugger after it loaded the dump file:


    This dump file has an exception of interest stored in it.

    The stored exception information can be accessed via .ecxr.

    (ff8.a2c): CLR exception - code e0434f4d (first/second chance not available)



    3.            Next, I wanted to ensure my symbol path was set correctly. I could have set it using the .sympath command:


    0:011> .sympath SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    Expanded Symbol search path is: srv*c:\websymbols*http://msdl.microsoft.com/download/symbols

    0:011> .sympath

    Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    Expanded Symbol search path is: srv*c:\websymbols*http://msdl.microsoft.com/download/symbols


    However, when your goal is to simply point to the default symbol server, .symfix is a very nice shortcut. It prevents one from having to try to remember the URL.  Here’s the syntax: 


    0:011> .symfix c:\websymbols

    0:011> .sympath

    Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols


    4.            To ensure that I didn't waste time reviewing the wrong data, I performed a quick check to ensure that we collected a dump of the requested snap-in.


    0:005> !peb

    PEB at 000007fffffdb000


       CommandLine:  '"C:\Windows\system32\mmc.exe" "C:\Windows\system32\eventvwr.msc" '


    You could alternatively dump the  CommandLine from the nt!_PEB using the dt command


    0:005> dt nt!_PEB ProcessParameters->CommandLine 000007fffffdb000


       +0x020 ProcessParameters              :

          +0x070 CommandLine                    : _UNICODE_STRING ""C:\Windows\system32\mmc.exe" "C:\Windows\system32\eventvwr.msc" "


    5.            Next, I then dumped out all of the threads in this process and found the following thread contained a stack that was raising a .Net Framework exception


    0:011> ~* kL

    ... (ommitted the non-relevent threads)


    # 11  Id: ff8.a2c Suspend: 1 Teb: 7ffd3000 Unfrozen

    ChildEBP RetAddr 

    0691f03c 7343a91c kernel32!RaiseException+0x58

    0691f09c 7343d81a mscorwks!RaiseTheExceptionInternalOnly+0x2a8

    *** WARNING: Unable to verify checksum for MMCEx.ni.dll

    0691f140 6bfe0b5a mscorwks!JIT_Rethrow+0xbf

    *** WARNING: Unable to verify checksum for mscorlib.ni.dll

    0691f1e8 69926cf6 MMCEx_ni+0xd0b5a

    0691f1f4 6993019f mscorlib_ni+0x216cf6

    0691f208 69926c74 mscorlib_ni+0x22019f

    0691f220 733d1b4c mscorlib_ni+0x216c74

    0691f230 733e21b1 mscorwks!CallDescrWorker+0x33

    0691f2b0 733f6501 mscorwks!CallDescrWorkerWithHandler+0xa3

    0691f3e8 733f6534 mscorwks!MethodDesc::CallDescr+0x19c

    0691f404 733f6552 mscorwks!MethodDesc::CallTargetWorker+0x1f

    0691f41c 7349d803 mscorwks!MethodDescCallSite::CallWithValueTypes+0x1a

    0691f604 733f845f mscorwks!ThreadNative::KickOffThread_Worker+0x192

    0691f618 733f83fb mscorwks!Thread::DoADCallBack+0x32a

    0691f6ac 733f8321 mscorwks!Thread::ShouldChangeAbortToUnload+0xe3

    0691f6e8 733f84ad mscorwks!Thread::ShouldChangeAbortToUnload+0x30a

    0691f710 7349d5d4 mscorwks!Thread::ShouldChangeAbortToUnload+0x33e


    6.            Out of curiosity, I also ran the Get Last Error command


    0:011> !gle

    LastErrorValue: (Win32) 0 (0) - The operation completed successfully.

    LastStatusValue: (NTSTATUS) 0 - STATUS_WAIT_0


    7.            After this, I ran analyze -v to see what helpful information the debugger would provide. The debugger did output exception information but informed me that I needed to use the x86 debugger instead.


    0:011> !analyze -v


                            EXCEPTION_RECORD:  ffffffff -- (.exr 0xffffffffffffffff)

                            ExceptionAddress: 771a42eb (kernel32!RaiseException+0x00000058)

                              ExceptionCode: e0434f4d (CLR exception)

                              ExceptionFlags: 00000001

                            NumberParameters: 1

                            Parameter[0]: 80131604


                                                    Managed code needs matching platform of sos.dll for proper analysis. Use 'x86' debugger.

                            FAULTING_THREAD:  00000a2c

                                                    PRIMARY_PROBLEM_CLASS:  CLR_EXCEPTION


    8.            I fired up the x86 debugger and loaded the appropriate version of the SOS .Net Framework debugger extension.  This extension ships in the Operating System along with the .Net Framework.  On most occasions, I would have initiated the loading of the extension through the use of the following syntax:


     0:011> .load C:\Windows\Microsoft.NET\Framework\v2.0.50727\sos.dll




    0:011> .load  c:\Windows\Microsoft.NET\Framework64\v2.0.50727\sos.dll


    However, once you realize that managed debugging will be necessary and that you need the services of the SOS extension, it’s best to use the .loadby command rather than .load.  This is due to the fact that the version of SOS must match the version of the CLR loaded into that process.  Here’s the recommended syntax: 


                            0:011>.loadby sos mscorwks


    I always verify that my extensions are loaded properly by using the  .chain command.


    0:011> .chain

    ...   Extension DLL chain:

                            C:\Windows\Microsoft.NET\Framework\v2.0.50727\sos.dll: image 2.0.50727.1434, API 1.0.0, built Wed Dec 05 22:42:38 2007


    9.            Running !help printed out the following helpful information about the SOS extension since sos.dll was at the top of the .chain output:


    0:011> !help


    SOS is a debugger extension DLL designed to aid in the debugging of managed

    programs. Functions are listed by category, then roughly in order of

    importance. Shortcut names for popular functions are listed in parenthesis.

    Type "!help <functionname>" for detailed info on that function.


    Object Inspection                  Examining code and stacks

    -----------------------------      -----------------------------

    DumpObj (do)                       Threads

    DumpArray (da)                     CLRStack

    DumpStackObjects (dso)             IP2MD

    DumpHeap                           U

    DumpVC                             DumpStack

    GCRoot                             EEStack

    ObjSize                            GCInfo

    FinalizeQueue                      EHInfo

    PrintException (pe)                COMState

    TraverseHeap                       BPMD


    10.            Using the exception address, displayed by the debugger when opening the dump, and the !pe command listed above, I obtained more information about the exception:


    0:011> !pe 771a42eb

    Invalid object

    There are nested exceptions on this thread. Run with -nested for details


    0:011> !pe -nested 771a42eb

    Invalid object


    Nested exception -------------------------------------------------------------

    Exception object: 040a676c

    Exception type: System.Reflection.TargetInvocationException

    Message: Exception has been thrown by the target of an invocation.

    InnerException: System.Reflection.TargetInvocationException, use !PrintException 040a6a20 to see more

    StackTrace (generated):

        SP       IP       Function


    StackTraceString: <none>

    HResult: 80131604


    0:011> !PrintException 040a6a20

    Exception object: 040a6a20

    Exception type: System.Reflection.TargetInvocationException

    Message: Exception has been thrown by the target of an invocation.

    InnerException: System.Configuration.ConfigurationErrorsException, use !PrintException 040a6cf8 to see more

    StackTrace (generated):


    StackTraceString: <none>

    HResult: 80131604

    There are nested exceptions on this thread. Run with -nested for details


    0:011> !PrintException 040a6cf8

    Exception object: 040a6cf8

    Exception type: System.Configuration.ConfigurationErrorsException

    Message: Configuration system failed to initialize

    InnerException: System.Configuration.ConfigurationErrorsException, use !PrintException 040a7174 to see more

    StackTrace (generated):


    StackTraceString: <none>

    HResult: 80131902

    There are nested exceptions on this thread. Run with -nested for details


    0:011> !PrintException 040a7174

    Exception object: 040a7174

    Exception type: System.Configuration.ConfigurationErrorsException

    Message: Unrecognized configuration section system.web/myInvalidData  

    InnerException: <none>

    StackTrace (generated):


    StackTraceString: <none>

    HResult: 80131902

    There are nested exceptions on this thread. Run with -nested for details


    11.            Based on the exception information listed above, it appeared that a .Net Framework configuration section, system.web, contained an invalid configuration section named myInvalidData  inside of it. I then re-ran !analyze -v against the dump again (now that I had loaded the x86 debugger) and found that !analyze -v will load the sos.dll extension and even run the !pe extension automatically. It then automatically displayed the exception record information for me as well. Also, notice that the thread listed by !analyze -v matches the thread I examined earlier.


                0:011> !analyze -v


                            EXCEPTION_MESSAGE:  Unrecognized configuration section system.web/myInvalidData.

                                                    MANAGED_OBJECT_NAME:  System.Configuration.ConfigurationErrorsException

                                                    FAULTING_THREAD:  00000a2c


    0:011> ~

       0  Id: ff8.c84 Suspend: 1 Teb: 7ffdf000 Unfrozen

       1  Id: ff8.96c Suspend: 1 Teb: 7ffde000 Unfrozen

       2  Id: ff8.d10 Suspend: 1 Teb: 7ffdd000 Unfrozen

       3  Id: ff8.d94 Suspend: 1 Teb: 7ffdc000 Unfrozen

       4  Id: ff8.a14 Suspend: 1 Teb: 7ffda000 Unfrozen

       5  Id: ff8.fbc Suspend: 1 Teb: 7ffd9000 Unfrozen

       6  Id: ff8.f88 Suspend: 1 Teb: 7ffd8000 Unfrozen

       7  Id: ff8.a64 Suspend: 1 Teb: 7ffd6000 Unfrozen

       8  Id: ff8.bf8 Suspend: 1 Teb: 7ffd5000 Unfrozen

       9  Id: ff8.d24 Suspend: 1 Teb: 7ffd4000 Unfrozen

      10  Id: ff8.ff0 Suspend: 1 Teb: 7ffd7000 Unfrozen

    . 11  Id: ff8.a2c Suspend: 1 Teb: 7ffd3000 Unfrozen


    12. At this point I was interested in identifying the source of this unrecognized configuration. Instead of engaging our .Net support team, I started with a quick search using www.live.com for     


                "unrecognized configuration section" system.web site:microsoft.com


    This returned the following results http://search.live.com/results.aspx?q=%22unrecognized+configuration+section%22+system.web+site%3Amicrosoft.com&form=QBRE


    By quickly reviewing some of the hits returned, I found that others had encountered this exception in their own applications.  This is due to invalid entries in the various .config  files used in .Net. Looking through the posts, different configuration file names and paths were observed.


    So, I opened up the process monitor logfile to see which configuration files we were reading data from. I added filter criterion to match entries from the mmc.exe process, the TID from the FAULTING_THREAD listed in the exception data, path data containing .config,  and a successful status result. It's best to be as specific as possible.



    I found that we were reading in a large amount of settings over and over again from the .net Framework global configuration file:             


                    (on x64 this would be     C:\Windows\Microsoft.NET\Framework64\v2.0.50727\CONFIG\machine.config)


    Final Step- Putting it all together, Reproducing the issue, & confirming resolution : Using notepad, a quick search of the suspect xml file (C:\Windows\Microsoft.NET\Framework64\v2.0.50727\CONFIG\machine.config) on my system revealed a <system.web> section. At this point, I suspected that this section contained an invalid section which may have been related to the problem. To verify this, and since I like to break things, I added an invalid configuration setting <myInvalidData/> to my global configuration file. Doing so, I successfully reproduced the issue on my system. I then contacted the customer and asked if they had by any chance added any settings under the <system.web> in the configuration file: c:\Windows\Microsoft.NET\Framework\v2.0.50727\CONFIG\machine.config.


    The customer informed me that, per the request of their ASP.net developer, they had in fact added settings to that section of the file.  By researching http://msdn.microsoft.com/en-us/library/system.web.aspx and the schema documentation at http://msdn.microsoft.com/en-us/library/dayb112d.aspx, we were able to determine that the settings that were present in this file should not have been present inside of <system.web> . The settings were moved to the proper location per the developer and the issue was resolved.


    Here are the steps I used to reproduce the issue in case you are attempting to replicate this at home-


    A. Using notepad, open the following configuration file on a non-production Windows Server 2008 SP1 system:

         (please make a backup copy first in case you make a mistake)  


         OR   (Open the version that matches the architecture of your platform )




    B. Find the section <system.web> in this file (you can use the find function in notepad):


    C.  Add the following line directly after <system.web> as shown in the example below:



    D. Save the file and then open eventvwr.msc and verify that the following error is displayed:



    Hopefully this blog has demonstrated an example of how you can use the "create dump file" feature of Windows 2008, windbg, and other related tools in an attempt to gain more specific data when your error message is not revealing the source of the problem. Feel free to post any questions, comments, or concerns.


  • Ntdebugging Blog

    Windows Hotfixes and Updates - How do they work?


    Today I would like to talk about some of the work the Windows Serviceability (WinSE) team does regarding servicing Windows and releasing updates.

    The operating system is divided into multiple components. Each component can consist of one or more files, registry keys, configuration settings, etc.  WinSE releases updates based on components rather than the entire operating system. This reduces a lot of overhead with having to install updates to components that have not changed. Depending on the severity and applicability of the problem, there are different kinds of release mechanisms. Keep in mind, though, the actual fix still remains the same.

    1.       Updates and Security Updates

    These Updates are typically available on Windows Update. They frequently contain security fixes, and from time to time also contain reliability rollup packages. These updates are thoroughly tested and Microsoft highly recommends that you update your computer with these releases. In fact, most are automatically downloaded to your machine if you have Windows Update turned on. In most cases, Update releases are also available as standalone downloads from the download center.


    2.       Hotfixes

    When an individual customer reports a bug to Microsoft for a specific scenario, the WinSE team releases Hotfixes to address these problems. Hotfixes are not meant to be widely distributed and go through a limited amount of testing due to the customer's need for an urgent fix.  Hotfixes are developed in a separate environment than the regular Updates.  This allows Microsoft to release Updates that do not include the Hotfix files, thereby minimizing risk for the customer.

    Once the Hotfix is ready and packaged by WinSE, a KB article is written describing the problem, with instructions on how to obtain the Hotfix.  Microsoft recommends that only customers experiencing the particular problem install the Hotfix for that problem.

    Note: Hotfixes are also sometimes referred to as LDRs, or QFE's (Quick Fix Engineering). The term QFE is an old term that is mostly no longer used in reference to current versions of Windows.


    3.       SP  - Service Pack

    The service pack is a major update in the life of an OS. It contains a wide variety of fixes as well as all the GDR and LDR fixes that were released since the previous service pack was shipped. This is a thoroughly tested release and highly recommended by Microsoft for installation. This is usually available as a standalone release, and is then released through Windows Update as well.



    GDR vs. LDR branches

    Now that we have described the different kinds of updates, let's take a deeper look into how these fixes are built. When a new OS or service pack is released, 2 branches are created from the release code tree -a GDR (general distribution release) branch and a LDR (limited distribution release) branch. Hotfixes are built solely from the LDR branch, while Updates for broad release are built from the GDR branch.

    Service Packs are built from a third branch that contains all Updates , Hotfixes and additional fixes.  This way the new service pack is shipped with all the fixes from both branches.

    Note – Once the new service pack is shipped, the code branches from the previous release are still active and serviced as necessary.

    Installing a Hotfix

    By default, all components on Windows systems start on the GDR branch following each major release. When you install updates from Windows Update for a GDR component, it gets upgraded with the GDR version.

    When you install a specific Hotfix, the files and components in the Hotfix package are migrated to the LDR branch. At this point, that particular component is marked as a LDR component. If you install a newer Update over this component, the Windows servicing technology will automatically install the appropriate latest version from the LDR branch for you. This is possible because each Update package ships with both the GDR and LDR versions of the component.

    Once a component is marked as a LDR component, the only way to move back to the GDR branch is to uninstall all Hotfixes for that component, or move to the next available service pack.


    What would happen if a user installed a Hotfix, and then sometime later installed the next service pack? Well, in that case it depends on the Hotfix and when it was built.

    1.       If the Hotfix was built before the service pack, then the component will be moved to the GDR version contained in the service pack.

    2.       If the Hotfix was built after the service pack, the component will be migrated to the post-service pack version of the component, and will stay on the same branch that it was originally on.


    In order to make this work, these packages contain both the RTM GDR version, the RTM Hotfix branch, and the SP1 Hotfix and GDR version of each binary.


    All fixes built for Windows are cumulative in nature by branch, i.e. a new update will contain the new fix, as well as all the previous fixes for that branch. Referencing the chart above, installing fix #4 can get you fixes #2 and #4 on the GDR branch. If the component is on the LDR branch, then the user would get fixes #1-4.


    Finally, the servicing technology has to handle the case where you need the functionality of an older Hotfix (e.g. “Fix #1” in the diagram above) but you may already have installed “Fix #4” which might be a critical security update.  What happens is that when the GDR branch of a fix is installed, it also places a copy of the Hotfix version of the same fix on the system.  When you run the installer for Hotfix #1, it detects that a newer version of the file is already installed, but it also detects that it needs to migrate it to the Hotfix version of the binary that was previously stored on the system. The result is that you end up with the Hotfix binary for Fix #4, which has both the Hotfix you need plus the cumulative set of security fixes.


    Stay tuned for more, in the next blog entry, I will talk about the staging mechanism that Windows uses to install Updates and Hotfixes as well as the uninstall process. Also, I will talk about how to determine the branch a file is built from.


    - Omer 


    More Information

    Description of the standard terminology that is used to describe Microsoft software updates

    Description of the contents of Windows XP Service Pack 2 and Windows Server 2003 software update packages


  • Ntdebugging Blog

    NTDebugging Puzzler 0x00000004: This didn’t puzzle the Debug Ninja, how about you?


    Hello NTDebuggers, we have been very impressed with the responses we’ve gotten to our previous puzzlers so far.  We invited the Debug Ninja to come up with a real challenge for this week.

    This server blue screened with a Bug Check 0xD1: DRIVER_IRQL_NOT_LESS_OR_EQUAL.  The challenge this week is to tell us, as specifically as possible, why it crashed.  We’ll give you a hint, the complete answer is not that ebx+5441F815h points to an invalid address.

    1: kd> .bugcheck

    Bugcheck code 000000D1

    Arguments e074281d 00000002 00000001 ba502493


    1: kd> kv

    ChildEBP RetAddr  Args to Child

    f78b6544 ba502493 badb0d00 00000001 00000000 nt!_KiTrap0E+0x2a7 (FPO: [0,0] TrapFrame @ f78b6544)

    f78b65e0 ba50d9d8 00000020 8c32cab8 00000022 tcpip!GetAddrType+0x19f (FPO: [Non-Fpo]) (CONV: stdcall)

    f78b6694 ba50dc56 8c32cab8 8ca71c2c 000005c8 tcpip!IPRcvPacket+0x66c (FPO: [Non-Fpo]) (CONV: stdcall)

    f78b66d4 ba50dd58 00000000 8ca63440 8ca71c0a tcpip!ARPRcvIndicationNew+0x149 (FPO: [Non-Fpo]) (CONV: stdcall)

    f78b6710 bada5550 8c4b53b8 00000000 f78b678c tcpip!ARPRcvPacket+0x68 (FPO: [Non-Fpo]) (CONV: stdcall)

    f78b6764 ba9c614b 8cac2ad0 f78b6784 00000002 NDIS!ethFilterDprIndicateReceivePacket+0x1d2 (FPO: [Non-Fpo]) (CONV: stdcall)


    1: kd> .trap f78b6544

    ErrCode = 00000002

    eax=8c32cab8 ebx=8c323008 ecx=00000001 edx=00000001 esi=8ca71c18 edi=f78b6618

    eip=ba502493 esp=f78b65b8 ebp=f78b65e0 iopl=0         nv up ei pl zr na pe nc

    cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000             efl=00250246


    ba502493 008b15f84154    add     byte ptr [ebx+5441F815h],cl ds:0023:e074281d=??


    The remainder of this puzzler is quite long, click here to see the complete debug notes required to solve this puzzlerPost your responses to this page (and not the complete debug page), so we can keep all the comments together.  Also, we won’t post any of your responses until Friday.

    [Update, posted 4/30/2008]

    Today is Wednesday and so far we haven’t received any complete responses; the partial answers we have received are close but no one has determined the root cause of the problem.  When looking at problems like this it is often useful to ask “why?” when you think you have the answer and think about what other information may be available.  In this situation it is possible to identify a specific problem and even determine appropriate action to fix the problem.

    Perhaps we were a bit overzealous in posting 14 pages of assembly as a puzzler.  Below is an abbreviated debug with less assembly, the link to the complete debug is still available for those of you who want to try the full challenge.  Also, we are going to break our promise not to post responses until Friday and we have posted some responses that we have gotten so far, neither of these responses completely identifies the problem but they are getting close.


    1: kd> dd @ebx+5441F815

    e074281d  ???????? ???????? ???????? ????????

    e074282d  ???????? ???????? ???????? ????????

    e074283d  ???????? ???????? ???????? ????????

    e074284d  ???????? ???????? ???????? ????????

    e074285d  ???????? ???????? ???????? ????????

    e074286d  ???????? ???????? ???????? ????????

    e074287d  ???????? ???????? ???????? ????????

    e074288d  ???????? ???????? ???????? ????????

    1: kd> !pte e074281d

                   VA e074281d

    PDE at 00000000C0603818    PTE at 00000000C0703A10

    contains 000000021B980963  contains E154FC1000000400

    pfn 21b980 -G-DA--KWEV                           not valid

                           Proto: 00000000E154FC10


    1: kd> ub ba50d9d8


    ba50d9c4 51              push    ecx

    ba50d9c5 50              push    eax

    ba50d9c6 53              push    ebx

    ba50d9c7 ff75f4          push    dword ptr [ebp-0Ch]

    ba50d9ca ff75f8          push    dword ptr [ebp-8]

    ba50d9cd ff75ec          push    dword ptr [ebp-14h]

    ba50d9d0 ff7508          push    dword ptr [ebp+8]

    ba50d9d3 e84a040000      call    tcpip!DeliverToUser (ba50de22)


    1: kd> uf tcpip!DeliverToUser



    ba50ded9 8b45f0          mov     eax,dword ptr [ebp-10h]

    ba50dedc 29451c          sub     dword ptr [ebp+1Ch],eax

    ba50dedf 8b471c          mov     eax,dword ptr [edi+1Ch]

    ba50dee2 ff7508          push    dword ptr [ebp+8]

    ba50dee5 2b4510          sub     eax,dword ptr [ebp+10h]

    ba50dee8 57              push    edi

    ba50dee9 8945fc          mov     dword ptr [ebp-4],eax

    ba50deec e817ffffff      call    tcpip!UpdateIPSecRcvBuf (ba50de08)



    ba50def1 807e0906        cmp     byte ptr [esi+9],6

    ba50def5 753e            jne     tcpip!DeliverToUser+0x12f (ba50df35)



    ba50def7 8b430c          mov     eax,dword ptr [ebx+0Ch]

    ba50defa 83b88c00000007  cmp     dword ptr [eax+8Ch],7

    ba50df01 0f857a510000    jne     tcpip!DeliverToUser+0x124 (ba513081)



    ba50df07 837f1000        cmp     dword ptr [edi+10h],0

    ba50df0b 741a            je      tcpip!DeliverToUser+0x114 (ba50df27)



    ba50df0d 640fb61551000000 movzx   edx,byte ptr fs:[51h]

    ba50df15 8b8880010000    mov     ecx,dword ptr [eax+180h]

    ba50df1b 3bca            cmp     ecx,edx

    ba50df1d 0f857145ffff    jne     tcpip!DeliverToUser+0x103 (ba502494)



    ba502494 8b15f84154ba    mov     edx,dword ptr [tcpip!_imp__KeNumberProcessors (ba5441f8)]

    ba50249a 0fbe12          movsx   edx,byte ptr [edx]

    ba50249d 3bca            cmp     ecx,edx

    ba50249f 0f8582ba0000    jne     tcpip!DeliverToUser+0x114 (ba50df27)



    ba50df23 c645f402        mov     byte ptr [ebp-0Ch],2



    ba50df27 640fb60d51000000 movzx   ecx,byte ptr fs:[51h]

    ba50df2f 898880010000    mov     dword ptr [eax+180h],ecx



    ba5024a5 e979ba0000      jmp     tcpip!DeliverToUser+0x110 (ba50df23)



    [Update: our answer, posted 5/2/2008]

    In response:  I would like to congratulate the people who figured this out and say thank you to all of the people who gave this a try and sent in a response.  I’m not sure if we made a challenge that was too easy, too difficult, or too long, but we did not get as many responses as we had expected.  If you looked at this and chose not to post a comment, or you have looked at our previous puzzlers but chose to skip this one, let us know why so we can post better puzzlers in the future.

    Here is the answer…

    Experienced assembly readers will recognize that this instruction is a valid instruction, but it takes action that no assembly writer would ever take.  An offset of 5441F815 is too big to be realistic; no structure is 5441F815 bytes long.

    1: kd> .trap f78b6544

    ErrCode = 00000002

    eax=8c32cab8 ebx=8c323008 ecx=00000001 edx=00000001 esi=8ca71c18 edi=f78b6618

    eip=ba502493 esp=f78b65b8 ebp=f78b65e0 iopl=0         nv up ei pl zr na pe nc

    cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000             efl=00250246


    ba502493 008b15f84154    add     byte ptr [ebx+5441F815h],cl ds:0023:e074281d=??


    Why is this instruction telling us to go to a crazy offset?  One possible cause is that the instruction is corrupt; a good debug method for this problem would be to unassemble the code around this instruction to see if the adjacent instructions look valid, and then to compare these instructions to ones in a known good version of tcpip.sys.  Because this instruction was not corrupt neither of these steps helped, so I started looking at how we got to this instruction.

    Looking at the prior function on the stack, IPRcvPacket, I see that we actually called DeliverToUser and not GetAddrType.  Why is our current instruction in GetAddrType instead of DeliverToUser?  It is possible that the call went to the wrong place, that DeliverToUser called something else which returned to the wrong place, or that somehow the flow of execution did not go as intended.

    1: kd> ub ba50d9d8


    ba50d9c4 51              push    ecx

    ba50d9c5 50              push    eax

    ba50d9c6 53              push    ebx

    ba50d9c7 ff75f4          push    dword ptr [ebp-0Ch]

    ba50d9ca ff75f8          push    dword ptr [ebp-8]

    ba50d9cd ff75ec          push    dword ptr [ebp-14h]

    ba50d9d0 ff7508          push    dword ptr [ebp+8]

    ba50d9d3 e84a040000      call    tcpip!DeliverToUser (ba50de22)


    The next step I took was to analyze the DeliverToUser assembly to see how we might have ended up in GetAddrType instead of DeliverToUser.  What I found, and what several of our readers found, was that DeliverToUser contained a jump that went to 1 byte away from where we encountered the bogus instruction.

    ba50df1d 0f857145ffff    jne     tcpip!DeliverToUser+0x103 (ba502494)


    Why would eip point to ba50d9d3 instead of ba50d9d4?  That is the challenging question in this exercise, because the answer is not obvious.  Some engineers will conclude that the hardware must not have done what we asked it to do, and say that the hardware must be at fault.  However, we don’t really have any proof at this time that the hardware has actually done something wrong.  Hardware problems usually manifest themselves as a bit flip, where one bit is changed from the correct value.  We have gone from a 4 (0100) to a 3 (0011); a 3 bit change is hard to blame on hardware.

    To answer the question “Why would eip point to ba50d9d3 instead of ba50d9d4?” we need to carefully analyze the instruction at ba50df1d.  I broke out my trusty instruction set reference and looked up the inner workings of the jne instruction.  On page 3-542 of the Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 2A: Instruction Set Reference, A-M I found the layout for a jne to a 32 bit offset.  The first two bytes are 0F 85, and the next 8 bytes are the offset we are going to jump too.  This is our opcode:

    0f 85 ff ff 45 71

    Remember that by the time we make this jump the instruction pointer will have been moved to the next instruction, that will be at ba50df23 (ba50df1d plus the 6 bytes that are in this instruction).  If we take this offset (ff ff 45 71) and add it to the next instruction we see that we’ll end up at ba502494 as expected.

    1: kd> ?ba50df23+ffff4571

    Evaluate expression: -1169152876 = ba502494


    If we change just one bit in this opcode, if we change 71 (0111 0001) to be 70 (0111 0000), we end up at ba502493 instead.  Now we can conclude that the hardware must have made a mistake, and now that we know what bit was flipped we can confidently say that we have identified the mistake that was made.

    1: kd> ?ba50df23+ffff4570

    Evaluate expression: -1169152877 = ba502493


    We still have some unanswered questions.  The most common question I get when I’ve found a bit flip is what piece of hardware is broken?  Unfortunately I can’t answer this question with the data I have.  In this case the jne instruction was good when we read it to put it in the dump but obviously it was bad when it was executed.  Thinking about the parts needed to get the instruction from RAM and execute it I could speculate that a processor, riser board, or motherboard could be at fault but I don’t have any data that proves where the problem happened.  It is unlikely that hardware diagnostics will find this type of failure because it is intermittent.  Unless the problem manifests itself while the diagnostics are running, they will likely report that there is no problem.  In the case of this server I wasn’t able to find out specifically what failed, uptime was more important than troubleshooting so a whole new server was put in place of this one and the problem has not been seen since.

    Bit flips can be caused by any number of problems.  Problems with a processor, a processor riser board, a motherboard, or RAM are common causes.  As we have documented before in this blog improper voltage or overheating can also cause these types of problems.  However the potential list of bit-flip causes is nearly endless; power supplies, power cables, incoming power fluctuations, radiation from nearby electronics, even sun spots, can all cause bit flips.

    I should also explain why the stack showed that we were in GetAddrType instead of DeliverToUser.  This is because tcpip.sys has been optimized with Basic Block Tools (BBT) optimization, blocks of code from the same function are located in different pages to reduce the cache hits in common code paths.  This means that blocks of the GetAddrType code are located next to the DeliverToUser code.  As demonstrated below there is a block of GetAddrType code that ends with an unconditional jmp at ba50248f and there is a block of DeliverToUser code that starts at ba502494.

    1: kd> ub ba502494


    ba502474 f6401401        test    byte ptr [eax+14h],1

    ba502478 0f84bd760000    je      tcpip!GetAddrType+0xb4 (ba509b3b)

    ba50247e 85db            test    ebx,ebx

    ba502480 0f84bd6b0200    je      tcpip!GetAddrType+0xa8 (ba529043)

    ba502486 32c0            xor     al,al

    ba502488 e967770000      jmp     tcpip!GetAddrType+0x15a (ba509bf4)

    ba50248d 33c9            xor     ecx,ecx

    ba50248f e993770000      jmp     tcpip!GetAddrType+0x19b (ba509c27)

    1: kd> u


    ba502494 8b15f84154ba    mov     edx,dword ptr [tcpip!_imp__KeNumberProcessors (ba5441f8)]

    ba50249a 0fbe12          movsx   edx,byte ptr [edx]

    ba50249d 3bca            cmp     ecx,edx

    ba50249f 0f8582ba0000    jne     tcpip!DeliverToUser+0x114 (ba50df27)

    ba5024a5 e979ba0000      jmp     tcpip!DeliverToUser+0x110 (ba50df23)

    ba5024aa 90              nop

    ba5024ab 90              nop

    ba5024ac 90              nop


    Congratulations to those who have read this far.  Hopefully you now understand the cause of a problem that has stumped many engineers, and you have some new tools in your toolbox when you’re investigating a tricky problem.  Remember to keep asking “why?” and not to assume anything; those are the approaches I used to find the source of this problem.

Page 2 of 24 (240 items) 12345»