• Ntdebugging Blog

    The Digital DNA of Bugs, Dump Analysis as Forensic Science


    Written by Jeff Dailey:

    As a debugger, have you ever reflected on the interesting parallels between your job and work being done in other industries?  When I think about solving complex computer problems, I think of it as forensics.  The core diagnostics or troubleshooting skill can be applied to anything. I bet if you were to walk around and talk to the people in our group you would find a lot of guys that watch programs like CSI, House, Law and Order etc.  Heck there is even a guy just a few cubicles away that has a House poster on his wall.  Years ago I used to watch Quincy, and other detective shows with the same level of fascination. I started thinking Cops, Coroners, Doctors and Engineers, we all do the same type of work. In fact, at a recent conference we had a forensics expert come in and speak to us. It was really fascinating. Even though the presenter worked in a completely different industry, most of us found each other thinking "Hey I could do that Job!", because in essence, that's what we do already.

    I started having some fun thinking about the parallels between detective shows and our work. 

    Cut to scene:

    It's 2:00AM. The camera zooms in on a pager going off on a night stand…  It can only mean one thing.  Something bad has happened and people are reaching out for help.  The detective wakes up and tells the wife, "Sorry, they need me… I've got to go." 

    Funny,  I've done the same thing, only because someone found a dead server.

    The detective shows up at the scene of the crime.  All the officers on-site are baffled, so they just keep things roped off until the expert gets there.  His years of experience and unique insight will allow him to see things others don't.

    Hmm… This seems familiar only I typically use Live Meeting or Easy Assist...

    Using a combination of specialized tools and methods learned both in school and from other's methods handed down over time, evidence is gathered at the scene so that additional research can be done back at the office.  Witnesses are questioned, "So about what time did this happen?", "Did you hear any unusual noises", "and did you see anyone or anything unusual".  Pictures are taken, Objects are collected, Fibers, and DNA samples are gathered.

    Ok so the scope of the problem is determined and all available data is collected.  Hmm, I do this every day.

    The Mayor calls the office to tell the chief of detectives that we must have this case solved.   It can't happen again.  We must catch the Villain!

    Feel free to substitute Mayor with any high level management figure.  Wow this is either a nasty bad guy or someone's driver is causing pool corruption causing a critical server to crash!

    We now cut to a montage were the detective is in the Lab, using luminal, searching for DNA evidence, reflecting on the core facts of the case, researching past crimes.

    I don't know about you, but I simply refer to this as the debugging process.

    Finally a breakthrough, DNA collected at the scene of the crime identifies a suspect that should not have been at the scene.  In doing additional research the suspect has a history of this type of activity.  The bad guy is hauled in, charges are filed, and the case is solved!

    This would equate to finding root cause, filing a bug, and getting a fix out the door!

    Ultimately that's what we do.  We are all detectives looking for the digital DNA of bugs in the wild affecting our customers.  We hunt them down using tools, expertise, and experience.

    When it comes to collecting critical forensic information and looking for that Digital DNA of a bug it often comes down to getting a dump of the process or system.

    GES (Global Escalation Services, formerly known as CPR) Escalation Engineers have probably looked at more dumps then the average person passes telephone poles in a lifetime.  Don't get me wrong, we do a lot of live debugs also, however dumps are the staple item in our debugging diet.

    To begin, let's go over why we typically ask for a dump.  Customers often think it's drastic to bring down an entire server via a "CRASH DUMP", is it worth it?  The answer is ABSOLUTELY!!!

    About Dumps


    Full User-Mode dump

    This dump file includes the entire memory space of a process, the program's executable image itself, the handle table, and other information that will be useful to the debugger.



    Figure 1: Scope of a Full User-Mode dump


    A Kernel Summary Dump

    This dump contains all the memory in use by the kernel at the time of the crash.
    This kind of dump file is significantly smaller than the Complete Memory Dump.


    Figure 2: Scope of a Kernel Summary dump


    A Full/Complete Memory Dump

    This is the largest dump file. This file contains all the physical memory for the machine at the time of the fault.


    Figure 3: Scope of a Full/Complete Memory dump


    When you open a dump with WinDbg, you can use the || (pipe pipe) command to determine the type of dump that you are analyzing


    So the server is hung, and we request a dump. The dump can be created using a NMI Card, Crash.exe, or CTRL+SCRL LOCK+SCRL LOCK.

    If the problem does not directly involve any user mode processes or we suspect there is a driver at fault we may simply ask for Kernel Summary dump. 

    If the problem involves user mode (application) code along with actions taking place in the kernel, or there are multiple process on the same machine making cross process calls, we will need the full user and kernel mode addresses ranges from the machine (Full Memory dump) so we can debug into the various user mode parts of the code that were being executed at the time the crash dump was captured.

    Now that you have crashed your server and collected a dump for us to look at, you may ask what we can tell from the full / kernel dump.  Well the following is just a small fraction:

    • The state of every thread call stack in every process on the machine.
    • The state of every lock on the machine.
    • The destination for RPC, DCOM, and LPC calls.
    • The CPU Usage of every thread in every process, how much kernel time vs. user time, how long since it last executed.
    • How long has a thread or process been alive?
    • The version, date, size, and checksum for binaries loaded in memory (not paged out) at the time of the dump.
    • What process spawned what process or thread?
    • How much kernel and user mode memory is being used by any given process?
    • What programs are being run in what session?
    • What connections are currently made to what remote resources, things like IP Address and ports?
    • What files are open on the machine, locally and remotely?
    • How long the server has been up.
    • We can reconstruct bitmaps from GDI surfaces.
    • We can dump running scripts.
    • How long any I/O has been outstanding to any device.
    • If any I/O or device has had an I/O error.
    • What the last error was on any thread on the machine.
    • We can figure out who is corrupting memory, if memory corruption detection technologies are employed before the dump is taken.
    • We can at times figure out what driver or process is leaking memory.
    • What handles are being leaked.
    • In most cases we can match the line of source executing during every stack frame of every call to the windows source code and determine why each discreet call in the call stack was made.
    • We can dump parts of the Windows Event log.
    • We can dump out all of the SMB connections info and we can tell what types of activity is happening in the server services.
    • We can integrate the state of the system cache.
    • We can find out what filter drivers are in the I/O chain.

    What about LiveKd?

    Sometimes we need to collect a kernel dump from a server and we are just not able to actually crash the server.  Either the customer does not want us to, or their business will not allow it.  In this case we can use the Sysinternals tool LiveKd.  LiveKd Works with our kernel debugger by installing a device driver and extracting all the memory out of kernel to user-mode so that the debugger can open the rolling snapshot of the kernel.  You can then write this dump file to disk by doing a .dump /f C:\livekd.dmp.  While this will not crash or halt the server, it does give us some kernel information.

    Words of Caution!

    LiveKd does not provide an atomic snap shot of the server because while the driver is collecting the memory from the kernel the operating system is still running. It can take several minutes to get a LiveKd Dump and during this time LiveKd may start reading memory for structures such as lists, arrays, threads and other items that may be changing during the collection of the snapshot.  This being the case, the timing data within the dump is no longer valid.  For example: You are not able to see if one thread has been waiting longer than another based on its idle ticks because the items were not collected simultaneously.  Also, linked lists and pool may appear corrupt or inconsistent in the dump.  This may make some debugger extensions loop endlessly or even crash. However you can use this output to get an idea of what is going on in more general terms in respect to static variables such as handle counts in handle tables.  You can often dump out handles, look at how many threads are in various processes and the types of things they were doing during the window the dump was collected.  It's just very difficult to draw definitive conclusions from this type of dump.  However, in those cases where we have no other option, using LiveKd to get a dump can provide valuable information.

    About User Dumps

    So what about "user dumps"?

    Why, when and how?  We ask for user dumps if we know the problem is limited to just one process having trouble.  User dumps are usually associated with High CPU, Memory Corruption, Memory Consumption, or a Hung process.  A user dump is only providing memory from a single process vs. all processes (See Figure 1 above).  We can only debug the process you are targeting. 

    High CPU: "Three in a Row and we're Good to Go"

    When a process is consuming a lot of CPU we will typically ask for 3-4 dumps, and these are usually taken 10-15 seconds apart.  We recommend using either userdump.exe, adplus –hang (process id) or by attaching windbg.exe to the process and doing a .dump /ma C:\dump1.dmp ,2 ,3, etc.  The /ma switch collects some extra information from the kernel and stores it in the user dump.  Without this extra data in the user dump we would not be able to get things like thread execution time and handle information.  The thread execution time allows the !runaway and .ttime commands to work on the dump.  Without this data we could not tell what thread is consuming CPU. 

    We can then open the successive dumps and check the various states of the threads in the process over time.  If one thread is constantly changing in each dump and consuming more and more CPU this is the thread that is typically at fault.  We then examine the reason each call is being made in that thread's context.  We also check for things like the last error that occurred on that thread using !gle (Get Last Error).

    When it's Hung we Just Need One

    When a process is hung it's typically due to a deadlock, or an application making a call in its window proc that is blocking preventing a repaint event from happening.  In this type of case we need to get a dump once the application has become unresponsive.  You can use WinDbg and dump the process via .dump /ma, use userdump.exe, or adplus –hang.  Once we get this user dump we typically will look for a thread waiting on a critical section, event, semaphore or system call.  You can pretty quickly get an idea of what is going on by doing a ~*kv and looking at the various thread states.  If you see critical sections being waited on you can run !locks and it will tell you what the lock dependency is.

    Adplus –crash and You'll Save Some Cash

    With a process that is crashing, either due to memory corruption, divide by zero, access violations, or any number of potential unhandled exceptions we need to have either a debugger attached ahead of time, or adplus –crash monitoring the process in question.  If we get a dump of a process before it has crashed, or after it has crashed and restarted it will not show us what we need to know.  We need to be watching the process ahead of time so we can catch the dump in state at the time of the event.  Adplus –crash will typically catch a second chance exception.  This is for exceptions that occur in the application code that are not handled by the applications exception handler.  In this case it falls though to the operating system to handle and we then typically tear down the process in question and/or invoke JIT, Just-in-Time debugging (Search for the AEDebug registry key for details).  If you are using windbg.exe and have attached to the process a "Crash" or unhandled exception should halt or break into the debugger.  At this point you can do a .dump /ma C:\my dumpfile.dmp

    What about intermittent problems?

    An intermittent problem can be one of the most difficult to isolate.  Often times a condition may only last a few seconds yet it could critically effect server or application operation.  In this type of scenario we have to get inventive; literally inventing new tools and methods of catching the bug or problem in the act.  A good example is the sample I posted earlier of catching a hung window.  This sample shows how to monitor something; in this case windows message pump responsiveness and take action if the response time falls out of your specified parameters. See "Detecting and automatically dumping hung GUI based windows applications."

    Good luck and happy debugging!

    PS: Doesn't CSI stand for Crashed Server Investigation?

  • Ntdebugging Blog

    Debug puzzler 0x00000002 “Attack of the crazy stack”


    Hi NTDebuggers, I have another puzzler for you.  We started crash2.exe under windbg and it crashed.  Go figure!  Sometimes we have a very limited amount of data available to figure out what went wrong.  That being said, this week’s puzzler only gives you a few clues.    Given this week’s debugger output, what do you suspect the problem is?  What would you do to further isolate the issue or prove your theory?


    If there is more data you need to solve it, post a comment / request and I will provide the data for you.  We will post all comments during the week and provide our answer on Friday.  We look forward to your comments.


    CommandLine: crash2.exe

    Symbol search path is: srv*C:\symbols*\\symbols\symbols

    Executable search path is:

    ModLoad: 00400000 00438000   crash2.exe

    ModLoad: 779b0000 77b00000   ntdll.dll

    ModLoad: 76180000 76290000   C:\Windows\syswow64\kernel32.dll

    (15d0.1688): Break instruction exception - code 80000003 (first chance)

    eax=00000000 ebx=00000000 ecx=cd7b0000 edx=00000000 esi=fffffffe edi=77a90094

    eip=779c0004 esp=0017faf8 ebp=0017fb28 iopl=0         nv up ei pl zr na pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246


    779c0004 cc              int     3


    0:000> g


    (15d0.1688): Access violation - code c0000005 (first chance)

    First chance exceptions are reported before any exception handling.

    This exception may be expected and handled.

    eax=0017feec ebx=7619140f ecx=0042ecc8 edx=00000000 esi=00000002 edi=00001770

    eip=65732074 esp=0017ff00 ebp=6f207473 iopl=0         nv up ei pl nz ac pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010216

    65732074 ??              ???

    0:000> k 123


    ChildEBP RetAddr 

    WARNING: Frame IP not in any known module. Following frames may be wrong.

    0017fefc 66692065 0x65732074

    0017ff3c 0041b9a3 0x66692065

    *** WARNING: Unable to verify checksum for crash2.exe

    0017ffa0 762019f1 crash2!_onexit+0x35

    0017ffac 77a2d109 kernel32!BaseThreadInitThunk+0xe

    0017ffec 00000000 ntdll!_RtlUserThreadStart+0x23


    0:000> lm

    start    end        module name

    00400000 00438000   crash2   C (private pdb symbols)  D:\CPRRAMP\source\crash2\debug\crash2.pdb

    76180000 76290000   kernel32   (private pdb symbols)  C:\symbols\wkernel32.pdb\20F7BB5ED22344A2910B27CA7252AE792\wkernel32.pdb

    779b0000 77b00000   ntdll      (private pdb symbols)  C:\symbols\wntdll.pdb\7099E4B6A6984FD08CBC90A4EDD40FD12\wntdll.pdb


    0:000> db 66692065

    66692065  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    66692075  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    66692085  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    66692095  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    666920a5  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    666920b5  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    666920c5  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    666920d5  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    0:000> db 0x65732074

    65732074  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    65732084  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    65732094  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    657320a4  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    657320b4  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    657320c4  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    657320d4  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    657320e4  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    0:000> db 0041b9a3

    0041b9a3  c3 e8 42 57 ff ff c3 cc-cc cc cc cc cc cc cc cc  ..BW............

    0041b9b3  cc cc cc cc cc cc ff 74-24 04 e8 42 5c ff ff f7  .......t$..B\...

    0041b9c3  d8 1b c0 f7 d8 59 48 c3-cc cc cc cc 8b 44 24 04  .....YH......D$.

    0041b9d3  a3 78 3c 43 00 a3 7c 3c-43 00 a3 80 3c 43 00 a3  .x<C..|<C...<C..

    0041b9e3  84 3c 43 00 c3 cc cc cc-cc cc cc 6a 10 68 58 16  .<C........j.hX.

    0041b9f3  43 00 e8 cc 5d ff ff 33-ff 57 e8 0f 59 ff ff 59  C...]..3.W..Y..Y

    0041ba03  89 7d fc 39 7d 08 75 1c-be 78 3c 43 00 ff 35 78  .}.9}.u..x<C..5x

    0041ba13  3c 43 00 e8 19 5e ff ff-89 45 e4 c7 45 e0 02 00  <C...^...E..E...

    0:000> db 0x66692065

    66692065  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    66692075  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    66692085  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    66692095  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    666920a5  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    666920b5  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    666920c5  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    666920d5  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????


    0:000> dd esp


    0017ff00  66692065 72756f20 61747320 69206b63

    0017ff10  616c2073 72656772 6f6e6520 54686775

    0017ff20  20736968 6d207369 2065726f 61746164

    0017ff30  726f6620 00737520 6f76e0ae fffffffe

    0017ff40  0041b9a3 0041b9c2 00411181 004146d9

    0017ff50  0017ffa0 00412ac4 00000001 002620a0

    0017ff60  002620d8 6f220936 00000000 00000000

    0017ff70  7efde000 0017ff9c 00000000 00000006


    0:000> dds esp


    0017ff00  66692065

    0017ff04  72756f20

    0017ff08  61747320

    0017ff0c  69206b63

    0017ff10  616c2073

    0017ff14  72656772

    0017ff18  6f6e6520

    0017ff1c  54686775

    0017ff20  20736968

    0017ff24  6d207369

    0017ff28  2065726f

    0017ff2c  61746164

    0017ff30  726f6620

    0017ff34  00737520

    0017ff38  6f76e0ae

    0017ff3c  fffffffe

    0017ff40  0041b9a3 crash2!_onexit+0x35

    0017ff44  0041b9c2 crash2!atexit+0x9

    0017ff48  00411181 crash2!ILT+380(__RTC_Terminate)

    0017ff4c  004146d9 crash2!_cinit+0x49

    0017ff50  0017ffa0

    0017ff54  00412ac4 crash2!__tmainCRTStartup+0x15e

    0017ff58  00000001

    0017ff5c  002620a0

    0017ff60  002620d8

    0017ff64  6f220936

    0017ff68  00000000

    0017ff6c  00000000

    0017ff70  7efde000

    0017ff74  0017ff9c

    0017ff78  00000000

    0017ff7c  00000006



    Good luck and happy debugging!



    [Update: more debugger output, per reader request. Posted 4/17/2008]

    Note this cashes without windbg also.


    CommandLine: crash2.exe

    Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    Executable search path is:

    ModLoad: 00400000 00438000   crash2.exe

    ModLoad: 777b0000 77900000   ntdll.dll

    ModLoad: 75a10000 75b20000   C:\Windows\syswow64\kernel32.dll

    (1324.16d0): Break instruction exception - code 80000003 (first chance)

    eax=00000000 ebx=00000000 ecx=73fe0000 edx=00000000 esi=fffffffe edi=77890094

    eip=777c0004 esp=0017faf8 ebp=0017fb28 iopl=0         nv up ei pl zr na pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246


    777c0004 cc              int     3

    0:000> g

    (1324.16d0): Access violation - code c0000005 (first chance)

    First chance exceptions are reported before any exception handling.

    This exception may be expected and handled.

    eax=0017feec ebx=75a2140f ecx=0042ecc8 edx=00000000 esi=00000002 edi=00001770

    eip=65732074 esp=0017ff00 ebp=6f207473 iopl=0         nv up ei pl nz ac pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010216

    65732074 ??              ???

    0:000> db esp

    0017ff00  65 20 69 66 20 6f 75 72-20 73 74 61 63 6b 20 69  e if our stack i

    0017ff10  73 20 6c 61 72 67 65 72-20 65 6e 6f 75 67 68 54  s larger enoughT

    0017ff20  68 69 73 20 69 73 20 6d-6f 72 65 20 64 61 74 61  his is more data

    0017ff30  20 66 6f 72 20 75 73 00-9b 7d 6b d3 fe ff ff ff   for us..}k.....

    0017ff40  a3 b9 41 00 c2 b9 41 00-81 11 41 00 d9 46 41 00  ..A...A...A..FA.

    0017ff50  a0 ff 17 00 c4 2a 41 00-01 00 00 00 98 21 03 00  .....*A......!..

    0017ff60  d0 21 03 00 03 94 3f d3-00 00 00 00 00 00 00 00  .!....?.........

    0017ff70  00 e0 fd 7e 9c ff 17 00-00 00 00 00 06 00 00 00  ...~............

    0:000> ln 0042ecc8

    (0042ecac)   crash2!`string'+0x1c   |  (0042eccc)   crash2!`string'

    0:000> .formats 0017ff044

    Evaluate expression:

      Hex:     017ff044

      Decimal: 25161796

      Octal:   00137770104

      Binary:  00000001 01111111 11110000 01000100

      Chars:   ..D

      Time:    Mon Oct 19 01:23:16 1970

      Float:   low 4.70085e-038 high 0

      Double:  1.24316e-316

    0:000> dd 0017ff044

    017ff044  ???????? ???????? ???????? ????????

    017ff054  ???????? ???????? ???????? ????????

    017ff064  ???????? ???????? ???????? ????????

    017ff074  ???????? ???????? ???????? ????????

    017ff084  ???????? ???????? ???????? ????????

    017ff094  ???????? ???????? ???????? ????????

    017ff0a4  ???????? ???????? ???????? ????????

    017ff0b4  ???????? ???????? ???????? ????????

    0:000> dds crash2!__onexitbegin

    00434354  59c96d56

    00434358  00000001

    0043435c  00000000

    00434360  00000000

    00434364  00000000

    00434368  00000000

    0043436c  00000000

    00434370  00033760

    00434374  00000000

    00434378  00000000

    0043437c  00000000

    00434380  00000000

    00434384  00000000

    00434388  00000000

    0043438c  00000000

    00434390  00000000

    00434394  00000000

    00434398  00000000

    0043439c  00000000

    004343a0  00000000

    004343a4  00000000

    004343a8  00000000

    004343ac  00000000

    004343b0  00000000

    004343b4  00000000

    004343b8  00000000

    004343bc  00000000

    004343c0  00000000

    004343c4  00000000

    004343c8  00000000

    004343cc  00000000

    004343d0  00000000

    0:000> dds crash2!__onexitend

    00434350  59c94d56

    00434354  59c96d56

    00434358  00000001

    0043435c  00000000

    00434360  00000000

    00434364  00000000

    00434368  00000000

    0043436c  00000000

    00434370  00033760

    00434374  00000000

    00434378  00000000

    0043437c  00000000

    00434380  00000000

    00434384  00000000

    00434388  00000000

    0043438c  00000000

    00434390  00000000

    00434394  00000000

    00434398  00000000

    0043439c  00000000

    004343a0  00000000

    004343a4  00000000

    004343a8  00000000

    004343ac  00000000

    004343b0  00000000

    004343b4  00000000

    004343b8  00000000

    004343bc  00000000

    004343c0  00000000

    004343c4  00000000

    004343c8  00000000

    004343cc  00000000


    [Update: our answer. Posted 4/18/2008]

    This week we had a lot of people that realized this was a buffer overrun.   You guys are so good I’m going to make next week’s puzzler a little harder!

    Good work all!   They are not all listed but some of the responses I liked the best were:





    Tal Rosen


    This week’s official response



    Let’s start off with wmain.  You can see here that we push a pointer to a string onto the stack and call fun1 004117a3


    0:000> uf crash2!wmain

    crash2!wmain [d:\cprramp\source\crash2\crash2\crash2.cpp @ 11]:

       11 00412520 55              push    ebp

       11 00412521 8bec            mov     ebp,esp

       11 00412523 83ec40          sub     esp,40h

       11 00412526 53              push    ebx

       11 00412527 56              push    esi

       11 00412528 57              push    edi

       12 00412529 686cec4200      push    offset crash2!`string' (0042ec6c) << Pushing param onto the stack

       12 0041252e e870f2ffff      call    crash2!ILT+1950(?fun1YAXPADZ) (004117a3)  << Making call to fun1

       12 00412533 83c404          add     esp,4

       13 00412536 33c0            xor     eax,eax

       14 00412538 5f              pop     edi

       14 00412539 5e              pop     esi

       14 0041253a 5b              pop     ebx

       14 0041253b 8be5            mov     esp,ebp

       14 0041253d 5d              pop     ebp

       14 0041253e c3              ret


    Lets dump out the value we are passing.


    0:000> da 0042ec6c

    0042ec6c  "This is a test ot see if our sta"

    0042ec8c  "ck is larger enough"


    0:000> uf 004117a3

    crash2!fun1 [d:\cprramp\source\crash2\crash2\crash2.cpp @ 17]:

       17 00412550 55              push    ebp

       17 00412551 8bec            mov     ebp,esp

       17 00412553 83ec4c          sub     esp,4Ch

       17 00412556 53              push    ebx

       17 00412557 56              push    esi

       17 00412558 57              push    edi

       19 00412559 8b4508          mov     eax,dword ptr [ebp+8]  << Here we are moving EBP+8 into eax. This is basically just loading the address of parameter one into eax

       19 0041255c 50              push    eax  << We push it onto the stack to make our call to  strcpy

       19 0041255d 8d4df4          lea     ecx,[ebp-0Ch]  << Now we are loading the address of a local on the stack. 

    Note the local is ebp-c   That means that we can only write 0xC bytes to

    this location before we end up overwriting things on the stack like our base pointer and return address.

       19 00412560 51              push    ecx  << Now we push our local variable address onto the stack for our call to strcpy

       19 00412566 83c408          add     esp,8

       20 00412569 8d45f4          lea     eax,[ebp-0Ch]

       20 0041256c 50              push    eax


       19 00412561 e8c3eeffff      call    crash2!ILT+1060(_strcpy) (00411429)  << This is where things go WRONG, within the call to strcpy we have overwritten our return address with string data.


    Let’s look at the before and after.


    0:000> bp 00412560

    0:000> g

    Breakpoint 2 hit

    eax=0042ec6c ebx=75a2140f ecx=0017feec edx=00000000 esi=00000002 edi=00001770

    eip=00412560 esp=0017fe9c ebp=0017fef8 iopl=0         nv up ei pl nz ac pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000216


    00412560 51              push    ecx

    0:000> k 

    ChildEBP RetAddr 

    0017fef8 00412533 crash2!fun1+0x10 [d:\cprramp\source\crash2\crash2\crash2.cpp @ 19] << The return address is ok here!

    0017ff50 00412ac4 crash2!wmain+0x13 [d:\cprramp\source\crash2\crash2\crash2.cpp @ 12]

    0017ffa0 75a919f1 crash2!__tmainCRTStartup+0x15e [f:\sp\vctools\crt_bld\self_x86\crt\src\crt0.c @ 327]

    0017ffac 7782d109 kernel32!BaseThreadInitThunk+0xe

    0017ffec 00000000 ntdll!_RtlUserThreadStart+0x23


    0:000> p  << Lets step over the call to strcpy and look again.

    eax=0017feec ebx=75a2140f ecx=0042eca0 edx=00686775 esi=00000002 edi=00001770

    eip=00412569 esp=0017fea0 ebp=0017fef8 iopl=0         nv up ei pl nz ac pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000216


    00412569 8d45f4          lea     eax,[ebp-0Ch]


    0:000> k

    ChildEBP RetAddr 

    0017fef8 65732074 crash2!fun1+0x19 [d:\cprramp\source\crash2\crash2\crash2.cpp @ 20]

    WARNING: Frame IP not in any known module. Following frames may be wrong.

    0017ff3c 0041b9a3 0x65732074  << This is not going to be pretty when we ret out of fun1.  We will basically return to nowhere.

    0017ffa0 75a919f1 crash2!_onexit+0x35 [f:\sp\vctools\crt_bld\self_x86\crt\src\onexit.c @ 98]

    0017ffac 7782d109 kernel32!BaseThreadInitThunk+0xe

    0017ffec 00000000 ntdll!_RtlUserThreadStart+0x23


    0:000> da 0017fef8   if we do a da on the location of the stack frame we can see what is at that location.

    0017fef8  "st ot see if our stack is larger"  << It’s our string!

    0017ff18  " enough"


    Finally we have the C code.


    #include <windows.h>


    void fun1(char * szData);

    void fun2(char * szData);


    int _tmain(int argc, _TCHAR* argv[])


                    fun1("This is a test ot see if our stack is larger enough");

                    return 0;



    void fun1(char * szdata)


                    char szData1[10];




    void fun2(char *szData)


                    printf("Hello from fun2");

                    strcat(szData,"This is more data for us");




    Have a great weekend, Good luck, and happy debugging!



  • Ntdebugging Blog

    Closing the Loop: CPU Spike in Winlogon.exe


    We recently dealt with an interesting issue that I would like to share, hope you enjoy. - Jason

    Issue Summary

    Customer reports that within their Citrix server farm (running on Windows Server 2003), when a user logs into or out of a session (seems more pronounced on logoff), ALL users connected to the server experience a ~5-20 second hang. The hang is described as follows:

    • Application in the session (i.e. Outlook and Word) stop accepting keyboard input. When the hang subsides, the typed characters show up all at once.
    • If applications are moved, they do not redraw
    • Cannot click the start menu
    • If the user was running performance monitor (in an attempt to diagnose), there would be a gap in the perfmon data that directly correlated to the hang duration

    Customer has found that during the timeframe of the hang, Winlogon.exe (in session 0) is showing 25% CPU usage.

    Troubleshooting & Analysis

    Where to start… to begin, given the details above, we can assume we are not experiencing a hard hang where the server is completely unresponsive. The user can still move windows and the connection to the server is still active. Keyboard input is working, but seems to be buffered during the issue. The windows not redrawing indicates the message pump for each of the affected applications is stuck waiting for something.

    In this scenario, I focused on the CPU usage for Winlogon under the premise that the other symptoms were side effects of whatever the CPU was working so hard on. On to the 25%; this is an interesting number to spike at. Generally speaking a thread has work to do, or it doesn't. If there was work to do it would normally take the CPU to 100%. So why 25%? In this scenario, each of the servers within the server farm had 4 processors. So we pegged 1 out of the 4 @ 100% resulting in task manager displaying 25% CPU utilization for the Winlogon in question.

    So now we have a CPU spike in Winlogon. Why only the Winlogon in session 0? Going back to the issue summary, the customer reported that if ANY user logs off, ALL users experience a hang, and Winlogon in session 0 spikes. First, let's talk about what winlogon does for a living.

    Sidebar: Winlogon on Windows Server 2003.

    Any time you ask what a feature is or how it works, you should begin your research with the Windows Technical Reference (try a Live search like "winlogon site:http://technet2.microsoft.com"). In this case I focused on the "How Interactive Logon Works" article.

    This article starts with:

    The Windows Server 2003 interactive logon architecture includes the following components:

    • Winlogon
    • Graphical Identification and Authentication (GINA) dynamic-link library (DLL)
    • Local Security Authority (LSA)
    • Authentication packages (NTLM and Kerberos)

    So Winlogon has something to do with Interactive logon.

    What else can we find out about the responsibilities of Winlogon

    • Registers the SAS (the secure attention sequence, a.k.a CTRL-ALT-DEL) during the boot process to keep other programs and processes from using it
    • SAS routine dispatching - When Winlogon recognizes a SAS event or the GINA delivers a SAS, Winlogon calls one of the SAS processing functions of the GINA.
    • Desktop lockdown - Winlogon helps prevent unauthorized user's from gaining access to system resources by locking down the computer desktop. At any time, Winlogon is in one of three possible states: logged on, logged off, or workstation locked
    • User profile loading - After a successful logon, Winlogon loads user profiles into the HKEY_CURRENT_USER registry key.
    • Screen saver control - Winlogon monitors keyboard and mouse activity to determine when to activate screen savers.
    • Multiple network provider support - If there are multiple network providers installed on a Windows-based system, they can be included in the authentication process and in password-updating operations

    Looking at this list of tasks performed by Winlogon, we need to try to determine which task aligns to our symptoms (hang during logon and logoff). I decided to focus on User Profiles due to the relationship between profiles and login and logoff.

    What is a User Profile? Back to Live with "user profiles site:http://technet2.microsoft.com". Out of the results, I went with the "User Profile Structure" link. The page provides the following detail:

    A user profile consists of:

    • A registry hive. The registry is a database used to store computer- and user-specific settings. Portions of the registry can be saved as files, called hives. These hives can then be reloaded for use as necessary. User profiles take advantage of the hive feature to provide roaming profile functionality. The user profile registry hive is the NTuser.dat in file form, and is mapped to the HKEY_CURRENT_USER portion of the registry when the user logs on. The NTuser.dat hive maintains the users environment preferences when the user is logged on. It stores those settings that maintain network connections, Control Panel configurations unique to the user (such as the desktop color and mouse), and application-specific settings. The majority of the settings stored in the registry are opaque to user profiles settings are owned and maintained by individual applications and operating system components.
    • A set of profile folders stored in the file system. User profile files are stored in the file system in the Documents and Settings directory, in a per user folder. The user profile folder is a container for applications and other operating system components to populate with subfolders and per-user data, such as shortcut links, desktop icons, startup applications, documents, configuration files and so forth. Windows Explorer uses the user profile folders extensively for special folders such as the users desktop, start menu and my documents folder.

    With this we can now look at the Winlogon that is spiking the CPU. My step when looking at a CPU spike for a process is to determine if the time spent is in User-mode or Kernel mode. In task manager on the Performance tab you can monitor CPU usage. If you select the View menu and chose to "Show Kernel Times" you can get additional detail showing you if the time is associated with a User-mode module or a Kernel module. In this case the Winlogon spike showed the spike to be in Kernel time. This means the application (Winlogon) asked the OS to do something and it is trying to get it done.

    My second step is to determine what user-mode request led to the spike in Kernel time. To answer this on a production system I chose to use Sysinternal's Process Explorer. I downloaded the tool and set it up so it would be able to obtain symbols from the public MS symbols server. During the repro I monitored the thread CPU time in Winlogon. Here is a screen shot of what Process Explorer looks when looking at the thread activity of an idle Winlogon.

    Figure 1 – Process Properties in Process Explorer

    When the CPU spike occurred we looked at the stack for the spiked thread (double click on the thread with the high CPU time or click the "Stack" button with the thread highlighted):



    Just to make sure, we repro'd again and looked at the call stack



    We can see here that both call stacks for the thread in question are RPC threads waiting for incoming requests. The request that came in was from userenv module running in Winlogon associated with the session of the user logging off and came into the Winlogon in session 0. The request is to unload the User Profile (highlighted above). This led to a call to RegUnloadKey call which removes a hive from the registry but does not modify the file containing the registry information.

    We now know the issue is related to unloading the user profile and specifically the registry portion of the user profile. Now we need to determine where we are spending the time in kernel mode.


    Kernrate is a tool included with the Windows 2003 resource kit. This will let us dig into the kernel mode side of what is going on. I collected the following data from kernrate during the hang condition (Summarized):

    Parsing the output, we see that 72.66% of the Kernel time was spent in the idle process. This is attributed to the 3 idle threads that are on the 3 processors not being used by the Winlogon thread. We see that the Winlogon process accounts for 20.39% of the kernel time (the CPU spike on the 4th processor). If we move into the module summary, you can see the correlated CPU times for the modules, INTELPPM (CPU driver) running on the idle processors and NTOSKRNL running on the 4th processor (this is where we will focus). In the final drill down, we see the function summary for NTOSKRNL and we can identify the function in use 82% of the time was CmpCleanUpKCBCacheTable. The Cm function prefix lets us know this is related to Configuration Manager (the Registry – see Table 2-7 from Microsoft Windows Internals 4th Edition for Commonly used Kernel Prefixes).

    What do we know? We know the issue manifests most during user logoff. We know during the logoff we are trying to unload the profile. We know that in the process of doing this we are trying to unload a registry hive. We know that leads us to spend a lot of CPU time in the Kernel doing CmpCleanUpKCBCacheTable.

    Why does this hang the machine? The registry is protected / synchronized with an ERESOURCE named CmpRegistryLock. While this is held exclusively during this cleanup function, all registry access is blocked (both read and write). This explains all of our symptoms. Applications freeze and do not redraw due to operations that need registry access being done on the GUI thread of a process. Perfmon cannot report due to its dependency on the registry.


    Now that we know the exact behavior, we could align it to a known issue that was introduced with the release of Windows Sever 2003 SP1 (applies to both SP1 and SP2)


    From the KB cause section:

    This problem occurs because of the way that Windows Server 2003 Service Pack 1 (SP1) cleans up user registry information. When a user logs off, there is a five-second delay before Windows Server 2003 SP1 flushes the user registry hive. If the terminal server experiences heavy registry activity, the cached registry information on the terminal server may increase significantly. For example, this behavior may occur when a program scans the user profile during the logoff process. When the terminal server experiences heavy registry activity, it takes Windows longer to flush the data.

    The fix applies to Windows Server 2003 with SP1 or SP2.

  • Ntdebugging Blog

    Hardware bitflipping


    Hello all; my name is Scott Olson and I work as an Escalation Engineer for Microsoft Global Escalation Services team in Platforms support, and I wanted to share an interesting problem that came up recently. A co-worker was running Windows Vista Ultimate x64 on their home machine and ran into a problem where the system would get random bugchecks after upgrading the RAM from 2GB to 4GB. Any combination of the RAM with 2GB was fine; however with 4GB of RAM installed the system would bugcheck within 10 minutes of booting. Once I heard about this I wanted to look at the memory dump in kernel debugger.

    Here's is what I found:

    The system got the following bugcheck:

    0: kd> .bugcheck
    Bugcheck code 000000D1
    Arguments fffff800`03a192d0 00000000`00000002 00000000`00000000 fffff980`064aa8b6

    Tip: The help file included with the Debugging Tools For Windows contains a Bug Check Code Reference that includes details on how to parse the Bug Check code and its arguments. See: Help > Debugging Techniques > Bug Checks (Blue Screens) > Bug Check Code Reference

    !analyze -v provides the following information for this bugcheck:

    An attempt was made to access a pageable (or completely invalid) address at an interrupt request level (IRQL) that is too high. This is usually caused by drivers using improper addresses. If kernel debugger is available get stack backtrace.
    Arg1: fffff80003a192d0, memory referenced
    Arg2: 0000000000000002, IRQL
    Arg3: 0000000000000000, value 0 = read operation, 1 = write operation
    Arg4: fffff980064aa8b6, address which referenced memory

    Debugging Details:

    READ_ADDRESS: fffff80003a192d0


    So with this data I can say that the system took a page fault on a read operation trying to reference the memory at fffff80003a192d0 at DISPATCH_LEVEL. OK, so let's get the trap frame so we can get into context of the system when the crashed happened:

    0: kd> kv 3
    Child-SP RetAddr : Args to Child : Call Site
    fffff800`03218f28 fffff800`0204da33 : 00000000`0000000a fffff800`03a192d0 00000000`00000002 00000000`00000000 : nt!KeBugCheckEx
    fffff800`03218f30 fffff800`0204c90b : 00000000`00000000 fffffa80`0a3c6cf0 00000000`00000000 00000000`00000000 : nt!KiBugCheckDispatch+0x73
    fffff800`03219070 fffff980`064aa8b6 : 00000000`00000002 00000000`00000000 00000000`000005e0 fffff800`03219220 : nt!KiPageFault+0x20b (TrapFrame @ fffff800`03219070)

    Here is the trap frame and it looks like system crashed while trying to reference memory at an offset from the stack pointer, rsp+0xD0 (highlighted above)

    0: kd> .trap fffff800`03219070
    NOTE: The trap frame does not contain all registers.
    Some register values may be zeroed or incorrect.
    rax=0000000000000000 rbx=0000000000000010 rcx=0000000000000011
    rdx=0000000000000002 rsi=0000000000000000 rdi=0000000000000001
    rip=fffff980064aa8b6 rsp=fffff80003219200 rbp=00000000000071d6
    r8=fffff80003219280 r9=00000000000071d6 r10=0000000000000000
    r11=0000000000000000 r12=0000000000000000 r13=0000000000000000
    r14=0000000000000000 r15=0000000000000000
    iopl=0 nv up ei pl zr na po nc
    fffff980`064aa8b6 440fb78c24d0000000 movzx r9d,word ptr [rsp+0D0h] ss:0018:fffff800`032192d0=8c13

    As you can see above fffff800`032192d0 looks like valid memory and shouldn't normally cause a page fault on a read operation. At this point, I want to make sure the system did what it was told. I want to know what happened when the system trapped. To verify the faulting address I dumped the CR2 register to see what address was referenced when the page fault happened; this is also the first parameter in the bugcheck code for a stop 0xd1.

    0: kd> r cr2

    Looking at this address it is clear that the trap frame does not exactly match, so let's look at how these addresses are different. Here is the stack pointer from the trap frame and the page fault converted into varying formats (focusing on the binary)

    0: kd> .formats fffff800`032192d0
    Evaluate expression:
    Hex: fffff800`032192d0
    Decimal: -8796040490288
    Octal: 1777777600000310311320
    Binary: 11111111 11111111 11111000 00000000 00000011 00100001 10010010 11010000
    Chars: .....!..
    Time: ***** Invalid FILETIME
    Float: low 4.74822e-037 high -1.#QNAN
    Double: -1.#QNAN

    0: kd> .formats fffff800`03a192d0
    Evaluate expression:
    Hex: fffff800`03a192d0
    Decimal: -8796032101680
    Octal: 1777777600000350311320
    Binary: 11111111 11111111 11111000 00000000 00000011 10100001 10010010 11010000
    Chars: ........
    Time: ***** Invalid FILETIME
    Float: low 9.49644e-037 high -1.#QNAN
    Double: -1.#QNAN

    Notice that there is a one bit difference between these 2 addresses

    11111111 11111111 11111000 00000000 00000011 00100001 10010010 11010000

    11111111 11111111 11111000 00000000 00000011 10100001 10010010 11010000

    Since the software asked the system to do one thing and it did something different this is clearly some type of hardware problem (most likely with the processor). I reported this back to the co-worker and they contacted their hardware vendor. This must have been a common problem with this vendor because I found out later that they replied back within 10 minutes of contacting them with a recommendation to change the memory voltage in the BIOS. The memory voltage was set to Auto, which is a default. They recommended it be changed from 1.85 volts to 2.1 volts. After making the change the system was stable with 4GB of RAM.

  • Ntdebugging Blog

    NTDebuggers Debug Puzzler 0x00000001 "Where did my process go?"


    Hello NTDebuggers, in the spirit of Click and Clack (The Tappet brothers), a favorite troubleshooting show of mine, we thought it would be fun to offer up some Debug puzzlers for our readers.


    That said, this week’s Debug Puzzler is in regard to Dr. Watson.  I’m sure most of you have seen Dr. Watson errors.  This typically means your application has crashed due to an unhandled exception.  Sometimes however the process just seems to disappear.  The Just-in-Time (JIT) debugging options configured via the AEDebug key does not catch the crash…  Does anyone know why this may happen?  


    We will post reader’s comments as they respond during the week, and next Monday will post our answer and recognize some of the best answers we received from our readers.


    Good luck and happy debugging!


    -  Jeff Dailey

    [Update: our answer, posted 4/11/2008]

    Hello NTDebuggers.  Let me start of by saying that we were very impressed by our reader’s answers.    Our two favorite answers were submitted by Skywing and molotov.

    When a thread starts, the ntdll Run Time Library (RTL) for the process inserts an exception hander before it calls the BaseThreadInit code to hand control over to the executable or DLL running in the process (notepad in the example below).  If anything goes wrong with the chain of exception handlers, the process can’t make it back to the RTL exception handler and the process will simply terminate.  See http://www.microsoft.com/msj/0197/Exception/Exception.aspx for details.

    ChildEBP RetAddr 

    000ef7ac 75fbf837 ntdll!KiFastSystemCallRet

    000ef7b0 75fbf86a USER32!NtUserGetMessage+0xc

    000ef7cc 00b21418 USER32!GetMessageW+0x33

    000ef80c 00b2195d notepad!WinMain+0xec

    000ef89c 76e24911 notepad!_initterm_e+0x1a1

    000ef8a8 7704e4b6 kernel32!BaseThreadInitThunk+0xe

    000ef8e8 7704e489 ntdll!__RtlUserThreadStart+0x23  << Exception Handler is inserted here.

    000ef900 00000000 ntdll!_RtlUserThreadStart+0x1b


    Secondly, the process that crashes is actually responsible for starting the debugger via the RTL exception handler.   The debugger is registered under the AeDebug registry key.    Even if you are able to unwind to the RTL exception handler you may still run into trouble.  If the computer is low on system resources such as desktop heap, you may not be able to create a new process and thus will not be able to launch the debugger.   As SkyWing stated, it’s a relatively heavyweight operation.   Applications may also call TerminateProcess from within their own code based on an error condition.    If we have a customer that sees this symptom on a regular basis we typically recommend having them attach a debugger to monitor the process.   Simply run via ADPLUS -crash  -p (PROCESSID).

    Good work folks!  We’ll have another puzzler ready next Monday.

    Good Luck and happy debugging!

    - Jeff

  • Ntdebugging Blog

    Windows Performance Toolkit - Xperf


    I recently came across a very interesting profiling tool that is available in Vista SP1 and Server 08 called the Windows Performance Analyzer. You can use this tool to profile and diagnose different kinds of symptoms that the machine is experiencing. This tool is built on top off the Event Tracing for Windows (ETW) infrastructure. It uses the ETW providers to record kernel events and then display them in a graphical format.

    Performance Analyzer provides many different graphical views of trace data including:

    1. CPU Sampling
    2. CPU and Disk utilization by process and thread
    3. Interrupt service routine and deferred procedure call
    4. Hard faults
    5. Disk I/O Detail

    What you will need

    Download the latest version of the Windows Performance Tools Kit, and install it on your machine. (http://www.microsoft.com/whdc/system/sysperf/perftools.mspx : Windows Performance Tools Kit, v.4.1.1 (QFE)) You will need to find the toolkit that corresponds to your processor architecture. Currently there are 3 versions available i.e. X86, IA64, X64.

    After installation you should be able to see 2 new tools. The first one is Xperf, which is a command line tool that is used to capture the trace. The second is called XperfView, which graphically interprets the trace that has been collected by Xperf.

    You will need to run the Xperf and XperfView from an elevated command prompt for all functionality.

    Starting a Trace

    For many tasks all you need for effective analysis is a kernel trace.  For this example, we'll use the –on DiagEasy parameter to enable several kernel events including: image loading; disk I/O; process and thread events; hard faults; deferred procedure calls; interrupts; context switches; and, and performance counters.  From an elevated command prompt launch xperf –on DiagEasy.

    This starts the kernel logger in sequential mode to the default file "\kernel.etl"; uses a default buffer size of 64K, with a minimum of 64 and a maximum of 320 buffers. 

    Stopping a Trace

    To stop a trace, type xperf –d <filename>.etl at the command line. This will stop the trace and output the file.

    Viewing a Trace

    There are 2 ways to view the trace. From an Elevated Command prompt, launch xperf <filename>.etl, or
    launch the XperfView tool and open the file manually. When you open the trace file, you should see something similar like this.

    NOTE - While you need to run xperf from an elevated command prompt in order to record a trace you do not need an elevated command prompt in order to *analyze* a trace.  

    Using the Chart Selector tab, you can select all the graphs that you want to look at. To drill down in each chart, you can select the Summary table. For instance, in the CPU Sampling chart, the summary table gets you the summary of the processes that were running, with information like the amount of CPU time, CPU %, stacks (if the stacks were collected in the trace, see below). When looking at the Summary table for the Disk I/O chart, you can see which processes were writing files (the filename too!) to disk, as well as how much time it took.

    You also have the ability to zoom in on a selected area. Another really cool feature is the ability to overlay multiple graphs on one frame. This way you can correlate different pieces of data together very easily.

    Also, you select which counter instances you want to see in each specific chart. On the top right corner of each chart is a drop down box from where you can select the counter instances. For instance on the Disk I/O chart, you can select Disk 0, Disk 1, or a combination as well.

    You can also view detailed information about the system that the trace was taken on. Click on the Trace menu item, and select System Configuration.

    Xperf Kernel Flags and Groups

    In the first sample Xperf command we ran, xperf –on DiagEasy. I am sure many of you were wondering what DiagEasy means. DiagEasy is a group of kernel events that are predefined by the Windows Performance Toolkit. This group includes Process, Threads, Kernel and User Image Load/Unload, Disk I/O, DPCs and Context Switch events.

    When we used the xperf –on DiagEasy command, we did not specify an individual provider, so we enabled the kernel events for all the ETW providers on the system. If you want to enable events for a specific provider, you can the following format xperf -on: (GUID|KnownProviderName)[:Flags[:Level]]. For more information about ETW providers, Kernel Flags and Groups, you can run the xperf –help providers command.

    Stack Tracing

    One of the most powerful features in Performance Analyzer is the ability to visualize stacks. It's important to note that this requires no special instrumentation in the code – only that you have symbols for the binary components you are interested in analyzing.

    When the trace is setup to collect the stacks, Performance Analyzer will display call stack summary information for the events that had stack walking enabled. Here is an example that takes a trace (with stack tracing enabled) of the entire system while running a "find string" utility.. We can use the Stack Tracing feature of Xperf to record a stack when certain events happen, or take sample at regular intervals over time. See xperf –help stackwalk output for more info.

    Below, we will use the Stack Tracking feature of Xperf to take stack samples at regular intervals. With this output, we will be able to determine where the CPU is spending most of its time within a process.

    xperf -on latency -stackwalk Profile

    Latency is the kernel group to enable certain events, including the profile event which records the CPUs' activity every millisecond. The "-stackwalk Profile" flag tells Xperf to record stack walks on every profile event, which makes the profile information much more useful. In other words, in order to get profile information with stack walks you need to turn on the profile event, and turn on stack walking for that event.

    Note that decoding of stacks requires that symbol decoding be configured. However stacks can be recorded without symbols, and can even be viewed without symbols, although they are much less useful without symbols. I only mention this in the event you're trying to record a trace of a problematic machine with little time to mess around with _NT_SYMBOL_PATH.

    To get a trace with the stack information, do the following:

    1. Run the command, xperf -on latency -stackwalk Profile.
    2. Run the application that you wish to analyze, then
    3. You can end the trace by using the regular xperf –d <filename>.etl command.
    4. Load the trace in the viewer and then select the Load Symbols from the Trace drop down menu.
    5. Once the symbols are loaded, select an interesting region on the CPU sampling graph.
    6. Right click and select Summary Table.

    Click on the selector tab to bring up the column chooser list. Then select "Process name", "Process", "Stack", "Weight" and "%Weight". These are the most useful columns when looking at stacks from the sample profile event. You should get a view similar to this.


    At this point I need to mention a few of the restrictions with stack walking coupled with when and how it works.

    ·                Xperf stack walking is not available on XP

    ·                On Vista stack walking is available for x86, and is available for x64 as of Vista SP1.

    ·                On Windows 7 stack walking is available.

    ·                Stack walking on x64 is complicated. You have to set DisablePagingExecutive in the registry, as documented here:


    REG ADD "HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management" -v DisablePagingExecutive -d 0x1 -t REG_DWORD –f

    Sample Case

    I recently came across a case where the customer was complaining that DPC processing was taking up too much CPU time. We ran Xperf on the machine and drilled down into the DPC activity on the machine.

    From the Xperf graph, I was able to confirm that the customer was actually seeing high DPC usage. I selected the Summary for this chart, and got the list of drivers that were actually taking up CPU time.

    Right off the bat, I could identify the driver that had a lot of DPC activity. I also noted that the average duration for each DPC from that driver was taking 788 microseconds. This is way too high. Each DPC should be taking a maximum of 100 microseconds.


    - Omer



    Performance.Analyzer.QuickStart.xps – This is shipped with the performance toolkit.

    From an elevated command prompt, launch xperf -help




  • Ntdebugging Blog

    The Case of the Low Hanging Filter Driver Fruit



    Written By Jeff Dailey:


    Not all our cases are crashes, leaks, or high CPU.  Sometimes the problems we are faced with are purely a question of why a given application runs slow on a particular version of Windows versus another version of windows.  In other cases an application may just start running slow for no reason.   OK, not likely.  There is ALWAYS SOME RESASON.  Something changed!  In this case, the customer reported that an application started running slow when booted into “Normal Mode”, but when the OS was booted in safe mode, the application would run fast.  In this particular case the customer reported that a given operation went from taking just a few seconds (safe mode) to several minutes (normal mode).   Further research found that the problem was related to accessing the registry and registry performance in general.  At this point I’m already thinking, “Registry Access?” and “Safe Mode”.  What could affect registry access that does not run in safe mode?  Well lots of services DO NOT start in safe mode.  What kind of services could affect registry calls?  Antivirus?  Maybe…  Let’s look deeper.


    One of the first things I typically do in such cases is to ask for a kernrate log of the slow and fast scenario. http://download.microsoft.com/download/8/e/c/8ec3a7d8-05b4-440a-a71e-ca3ee25fe057/rktools.exe  Kernrate is a sampling profiler.   It basically checks the location of the instruction pointer at regular intervals and stores the results in a hash table.   We can then get a breakdown of the %time spent in each module that is executing.    Even better you can zoon in to each module.  Zooming in shows utilization at a function level within the module and requires symbols to be present in a flat symbol directory on the machine being profiled.  I recommend downloading the symbol pack for this (http://www.microsoft.com/whdc/devtools/debugging/symbolpkg.mspx) or use symchk.exe (included in the debugging tools) to download the symbols.  We’ll talk more about symbols and symchk.exe in an upcoming post.


    In a lot of cases kernrate data is only a starting point.   We will find some code that is running a lot longer in one case verses another and that in turn requires a follow up code review and multiple debugs to further isolate the problem.  This case however was different.  The following is output from beyond compare that shows a comparison between the module execution time in kernel.   The slow test run is on the right, and the fast test run is on the left.  Keeping in mind that I was looking for something different between safe mode and normal mode, I simply started by looking at the modules listed on the slow side (Right) there were not on the fast side (Left).   What was loaded during the normal run that was not a factor during the safe mode run.  Right away FILTERDRVXYZ just above the HAL jumped off the page.  (Some names were changed to protect the innocent. J)  I did a http://www.live.com search to find out what this driver was.  It was a file system filter driver for an antivirus program.


    To understand why a filter driver can slow things down you need to understand that a filter driver basically installs itself between the application and the underlying file system and intercepts nearly every call being made to the underlying I/O subsystem.   Most of the time filter drivers are just fine and run with no problems at all.   However, what do they do?  An antivirus filter driver may have several tasks that it needs to do for each I/O.  This may involve looking up information about a process or file when it’s being accessed.   If the data source associated with this information changes it can change the amount of time that these operations take.   The interesting thing is that the timing change may only be a hundred or so extra milliseconds per call to the registry or file system.  Under most cases a user would not even notice this.  If however you have some aspect of your application needs to make many thousands of calls and they each take 250+ milliseconds vs. 10 milliseconds, all of a sudden the problem compounds.  The classic example of this is a VIRUS PROFILE gets uploaded or pushed to servers and all of a sudden the customer starts seeing performance problems because the data set or comparison for the filter driver changes.   In this case I simply recommended that the customer remove the suspect filter driver to see if it was the root of the problem.  IT WAS…  They contacted the product vendor to get an update and the case was resolved.


    Good luck and happy debugging….

  • Ntdebugging Blog

    Debugging a bluescreen at home



    Hi, my name is Chad. I work as an escalation engineer for Microsoft’s OEM support team.


    A while back, I encountered an interesting crash on one of my computers at home, and I thought I’d post about how I debugged it.


    This particular machine had been humming along quite happily for some time, but one day while I was scanning some photos, it bluescreened. Naturally, I hoped it was just a fluke, but after it happened a few more times while doing the same thing, I decided to debug it.


    Ordinarily, if a machine crashes when performing a specific activity, like scanning photos, my first inclination would be to suspect a bug in one of the drivers involved in that activity, like the scanner driver or the USB driver. But in this case, I had been using this scanner for a long time, with the same drivers, and never had this problem, so this sudden crashing was kind of mysterious.


    Let's see what we can tell from the dump!


    The first order of business when looking at a crash dump is the !analyze -v command. I've trimmed some of it here for brevity, but it goes something like this:


    kd> !analyze -v


    *                                                                             *

    *                        Bugcheck Analysis                                    *

    *                                                                             *




    This is a very common bugcheck.  Usually the exception address pinpoints

    the driver/function that caused the problem.  Always note this address

    as well as the link date of the driver/image that contains this address.

    Some common problems are exception code 0x80000003.  This means a hard

    coded breakpoint or assertion was hit, but this system was booted

    /NODEBUG.  This is not supposed to happen as developers should never have

    hardcoded breakpoints in retail code, but ...

    If this happens, make sure a debugger gets connected, and the

    system is booted /DEBUG.  This will let us see why this breakpoint is



    Arg1: c0000005, The exception code that was not handled

    Arg2: 8738e300, The address that the exception occurred at

    Arg3: b9b3dc7c, Trap Frame

    Arg4: 00000000


    Debugging Details:




    EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at "0x%08lx" referenced memory at "0x%08lx". The memory could not be "%s".




    8738e300 0000             add     [eax],al


    TRAP_FRAME:  b9b3dc7c -- (.trap ffffffffb9b3dc7c)

    ErrCode = 00000002

    eax=00000001 ebx=bc514c68 ecx=0001065e edx=bc510000 esi=00000955 edi=b9b3dd64

    eip=8738e300 esp=b9b3dcf0 ebp=b9b3dd08 iopl=0         nv up ei pl zr na po nc

    cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000             efl=00010246

    8738e300 0000             add     [eax],al                ds:0023:00000001=??

    Resetting default scope






    LAST_CONTROL_TRANSFER:  from 8051d6a7 to 8053331e



    WARNING: Frame IP not in any known module. Following frames may be wrong.

    b9b3dcec bf801619 b9b3dd64 0012eee4 bf81551f 0x8738e300

    b9b3dd08 bf81553c b9b3dd64 0012eee4 bf81551f win32k!ValidateHwnd+0x5c

    b9b3dd50 804de7ec 0001065e 012eb244 00000000 win32k!NtUserInvalidateRect+0x1d

    b9b3dd50 7c90eb94 0001065e 012eb244 00000000 nt!KiFastCallEntry+0xf8

    0012eecc 77d4b601 0105a943 0001065e 012eb244 ntdll!KiFastSystemCallRet

    0012eed0 0105a943 0001065e 012eb244 00000000 USER32!NtUserInvalidateRect+0xc

    77d4b601 90909090 001134b8 0300ba00 12ff7ffe NikonScan4!GetSource+0x21e93



    From the stack trace, we can tell that NikonScan4.dll made a call into User32, which ultimately ended up calling into win32k.sys down in kernel mode. Win32k was in a function called ValidateHwnd() and then tried to call some function at address 0x8738e300, at which point we tried to dereference an invalid pointer and crashed.


    What’s at 0x8738e300?


    kd> dc 8738e300

    8738e300  00000000 00000001 00000000 87360350  ............P.6.

    8738e310  00000000 00000001 f71af9fe f71b0030  ............0...

    8738e320  f71afb0e f71afbb4 f71b0098 f71b0214  ................

    8738e330  f71afef6 f71aff8e 07fef800 00000000  ................

    8738e340  f71afffc 00000000 0a0e000a 644c6d4d  ............MmLd

    8738e350  8732ea58 870303e0 ffffffff ffffffff  X.2.............

    8738e360  00000012 00000000 f797f000 f7989905  ................

    8738e370  0000c000 00500050 e1971458 00160016  ....P.P.X.......


    kd> !pool 8738e300 2

    Pool page 8738e300 region is Nonpaged pool

    *8738e2f8 size:   50 previous size:    8  (Allocated) *NV 

                  Owning component : Unknown (update pooltag.txt)


    Well, that’s bad. 0x8738e300 isn’t actually a valid address of a function. That location contains some user data (specifically, some nonpaged pool).


    So, that's why we blew up: the ValidateHwnd() function in win32k made a call to this bad address which contains data instead of code! Let's see if we can figure out why it did this. We can find the return address in win32k!ValidateHwnd on the stack, and unassemble the instructions leading up to the point where we ran off into the weeds.


    kd> kv L8

    ChildEBP RetAddr  Args to Child             

    WARNING: Frame IP not in any known module. Following frames may be wrong.

    b9b3dcec bf801619 b9b3dd64 0012eee4 bf81551f 0x8738e300

    b9b3dd08 bf81553c b9b3dd64 0012eee4 bf81551f win32k!ValidateHwnd+0x5c (FPO: [Non-Fpo])

    b9b3dd50 804de7ec 0001065e 012eb244 00000000 win32k!NtUserInvalidateRect+0x1d (FPO: [Non-Fpo])

    b9b3dd50 7c90eb94 0001065e 012eb244 00000000 nt!KiFastCallEntry+0xf8 (FPO: [0,0] TrapFrame @ b9b3dd64)

    0012eecc 77d4b601 0105a943 0001065e 012eb244 ntdll!KiFastSystemCallRet (FPO: [0,0,0])

    0012eed0 0105a943 0001065e 012eb244 00000000 USER32!NtUserInvalidateRect+0xc

    77d4b601 90909090 001134b8 0300ba00 12ff7ffe NikonScan4!GetSource+0x21e93

    77d4b605 001134b8 0300ba00 12ff7ffe 900008c2 0x90909090


    The return address from where we made the bad call is bf801619. Let’s unassemble a few instructions backwards from that address (using the handy but often-overlooked “ub” command) and see what the code was doing:


    kd> ub bf801619


    bf8015ff 8d1c82           lea     ebx,[edx+eax*4]

    bf801602 8bc1             mov     eax,ecx

    bf801604 c1e810           shr     eax,0x10

    bf801607 663b430a         cmp     ax,[ebx+0xa]

    bf80160b 75ad             jnz     win32k!ValidateHwnd+0x3d (bf8015ba)

    bf80160d 807b0801         cmp     byte ptr [ebx+0x8],0x1

    bf801611 7573             jnz     win32k!ValidateHwnd+0xff (bf801686)

    bf801613 ff15e0b298bf call dword ptr [win32k!_imp__PsGetCurrentThread (bf98b2e0)]


    OK, so it's pretty simple. Win32k decided which function address to call by reading it from a pointer stored at a hardcoded location (in other words, in a global variable) within Win32k itself. That pointer is located at bf98b2e0. The debugger helpfully tells us that this pointer is intended to contain the address of a function called PsGetCurrentThread, but let’s double-check this and make sure it actually does. At this point, the working assumption would be that this pointer had gotten corrupted somehow.


    kd> dd bf98b2e0 L1

    bf98b2e0  804e4a15


    Interesting. bf98b2e0 contains the value 804e4a15. This is not even close to the bad address the processor actually called! Remember, from the stack trace, we jumped to 8738e300 instead. What does this pointer actually point to?


    kd> ln 804e4a15

    (804e4a15)   nt!PsGetCurrentThread   |  (804e4a51)   nt!KeSetEventBoostPriority

    Exact matches:

        nt!PsGetCurrentThread = <no type information>


    Sure enough, 804e4a15 is the address of nt!PsGetCurrentThread.


    So, basically, win32k was trying to call PsGetCurrentThread(), and even had the correct pointer in memory to get there, but the processor instead jumped to a bogus address located in the middle of some user data.


    At this point it's pretty safe to say that this is a hardware problem, and furthermore, since the relevant memory looks fine, it looks like a problem within the CPU itself. (I’m not an expert on CPU architecture, but if I were to take a wild guess I’d say that maybe the processor had some problem when reading from its L2 cache.)


    Upon discovering this, I decided to crack open my case and take a look. It didn’t take long to spot the problem:


    Yes, that's my CPU fan. Notice that the heatsink is completely clogged up with dust!


    The resolution to this story: I took a can of compressed air and blew all the dust out of the heatsink. This took care of the problem, and the computer happily scanned hundreds of photos and hasn’t crashed again since. Why was it only crashing when I was scanning photos? Most likely because this was causing the CPU to run at 100% utilization for extended periods of time, and it was simply overheating!


  • Ntdebugging Blog

    NonPagedPool Depletion


    I recently was engaged on an issue where a server was depleting NonPagedPool over a period of a few days.  Ordinarily, we would just use a tool like PoolMon to identify the offending pool tag and then find the driver that uses that pool tag using the method in this article


    However, what made this case interesting was the pool tag and that we were unable to identify the driver using the normal methodology.  You’ll see what I mean in a moment.  The engineer supplied me with a kernel dump of the server while it was in-state and this is what I found.


    Let’s start by taking a look at the virtual memory usage:


    2: kd> !vm


    *** Virtual Memory Usage ***

           Physical Memory:      851420 (   3405680 Kb)

           Page File: \??\C:\pagefile.sys

             Current:   3584000 Kb  Free Space:   3568552 Kb

             Minimum:   3584000 Kb  Maximum:      3584000 Kb

           Available Pages:      573277 (   2293108 Kb)

           ResAvail Pages:       800628 (   3202512 Kb)

           Locked IO Pages:        1067 (      4268 Kb)

           Free System PTEs:      25102 (    100408 Kb)

           Free NP PTEs:            335 (      1340 Kb)

           Free Special NP:           0 (         0 Kb)

           Modified Pages:           22 (        88 Kb)

           Modified PF Pages:        22 (        88 Kb)

           NonPagedPool Usage:    31369 (    125476 Kb)    ß Very high

           NonPagedPool Max:      31986 (    127944 Kb)

           ********** Excessive NonPaged Pool Usage *****

           PagedPool 0 Usage:     19071 (     76284 Kb)

           PagedPool 1 Usage:       735 (      2940 Kb)

           PagedPool 2 Usage:       747 (      2988 Kb)

           PagedPool 3 Usage:       720 (      2880 Kb)

           PagedPool 4 Usage:       746 (      2984 Kb)

           PagedPool Usage:       22019 (     88076 Kb)

           PagedPool Maximum:     38912 (    155648 Kb)


           ********** 3 pool allocations have failed **********


    So we can see that NPP usage is very high given the server is using the /3GB switch which limits NPP to 128MB by default.  We need to identify what pool tag is associated with the high NPP usage:


    2: kd> !poolused /t2 2

       Sorting by  NonPaged Pool Consumed


      Pool Used:

                NonPaged            Paged

     Tag    Allocs     Used    Allocs     Used

     None   246479 50827424         0        0      call to ExAllocatePool

     MmCm     1198 18462512         0        0      Calls made to MmAllocateContiguousMemory , Binary: nt!mm


    Interesting, so the offending tag is “None”.  This means that these allocations were made by calling the function ExAllocatePool instead of ExAllocatePoolWithTag.  ExAllocatePool is obsolete and should no longer be used. 


    Now, I need to find out which driver is calling this function.  First, I need to know where ExAllocatePool lives:


    2: kd> x nt!ExAllocatePool

    e0894d1f nt!ExAllocatePool


    Next, I need to search all the drivers to see which one is importing this function:


    2: kd> !for_each_module s-d @#Base @#End e0894d1f

    f50b8058  e0894d1f e0828e04 e089b708 e084011b  .M..............


    Hmm, looks suspiciously like an import table, let’s see:


    2: kd> dps f50b8058

    f50b8058  e0894d1f nt!ExAllocatePool

    f50b805c  e0828e04 nt!_wcsnicmp

    f50b8060  e089b708 nt!ExFreePoolWithTag

    f50b8064  e083e30a nt!KeInitializeEvent



    Yep, that’s an import table.  You can also verify that this is the import table of a particular module by checking the header (!dh on the module’s base address and look for “Import Address Table Directory”).


    As you can see, we have only one driver that imports ExAllocatePool.  Let’s see which driver this is:


    2: kd> !lmi f50b8058

    Loaded Module Info: [f50b8058]


       Base Address: f50b3000




    I’ve removed the incriminating identifiers from module information displayed above to protect the guilty.  It is interesting to note that the driver name was a GUID and that this driver did not exist on the disk.  This was because the driver is dynamically created when its parent program loads.  


    The software package was removed and the server was happy again. 


    -         David

  • Ntdebugging Blog

    Wanted: Windows Internals subject matter experts


    Microsoft is looking for five Windows Internals subject matter experts to come work on a very special five to eight day project on the Redmond campus during the month of May 2008.  Candidates must have good communications skills, be non Microsoft employees,  have 5+ years experience with windows, be familiar with the Windows Internals book, have kernel and user mode debugging experience, and be C literate.  


    We prefer candidates strongly represent the Windows IT Professional or Windows development industry.    These typically would be 2nd or 3rd tier escalation resources or Windows developers (developing Win32 code, or device drivers) that are working for Gold Partners, ISVs or are independent consultants or MVPs.


    A Microsoft NDA (Non Disclosure Agreement) will be required.


    Please contact Jeff Dailey at jeffda@microsoft.com with the subject line “Windows Internals Project”

    Please include a copy of your resume and details explaining your Windows background, along with your contact info, and your location.


    We will cover air travel, lodging and food for non local candidates attending from within the US.


    Thank you.


    More Info:

    The Widows Internals experts selected for this project will be brought to Redmond to aid in developing a new certification program based on the Windows Internals.  This is not a hiring effort or product testing session.  Though the candidates should have C literacy they will not be writing or reviewing any Microsoft source code. 


Page 21 of 24 (239 items) «1920212223»