• Ntdebugging Blog

    Expanding Netmon to aid in debugging

    • 2 Comments

    A walk-through of creating a Netmon parser in the context of a real case

     

    As is obvious to frequent readers of our blog, our team logs a lot of time in our debugger of choice (for some windbg, for others kd).  However, a debugger is not always the best tool for the job, and sometimes even the best tool has limitations. I found this to especially true when working a few Internet Printing Protocol (IPP) cases recently.

     

    Probably the biggest challenge of many IPP cases is the mixed environments you usually find IPP running in.  The benefit customers see in IPP over other print providers is it works natively or with minimal configuration on Windows, Mac, and Linux clients.  This makes it popular in places like college computer labs, where there isn’t one standard client system.   Unfortunately, this also means that we can’t really debug both sides of the communication when something goes wrong.

     

    In a recent case, a customer was having problems printing from their Windows Vista clients to a Linux Common Unix Printing System (CUPS) IPP server.  If the CUPS server was set to allow anonymous connections, everything worked perfectly. When the administrator enabled authentication, he found that most print jobs failed to print.  After a bit more testing, he found that small jobs (perhaps a page of printed text) worked fine, but larger, multi-page documents failed.

     

    For situations like this, I prefer to use a network trace to get a feeling for where the problem is occurring.  The problem was – IPP wasn’t one of the protocols built in to Netmon (and I find Wireshark’s IPP parser to not always work well – especially with Windows clients/servers).  I decided that the amount of time it would take to decode the traffic by hand could be better spent creating a Netmon IPP parser that I could use every time I ran into one of these issues.

     

    One of the great things about Netmon is you can view the source of every included parser.  This was a big help, as I hadn’t written a parser before.  [Note: all steps noted are written using Netmon 3.4.There might be slight differences in Netmon 3.3.]  To do this, open Netmon and click the Parsers tab. Under Object View, expand parser files and double click any of the .npl files.  The source will appear on the right.

     

    The language for Netmon parsers is similar to C++, with a limited set of statements.  These are all documented in the Netmon help file, but the ones I found useful are described below.  To begin, I started by defining a number of tables.  The basic idea of a table is to provide a way to convert a value to a string.  For example, one field in an IPP packet is the status of a printer, which is represented by an integer.  In order to allow Netmon to show printer states in a readable form, I created a table to convert the values as seen in Figure 1 below.

     

    Table IPPPrinterState //2911-4.4.11

    {

          switch(value)

          {

                case 3 : "idle";

                case 4 : "processing";

                case 5 : "stopped";

                default : "Unknown Code";

          }

    }

    Figure 1: Netmon Table

     

    Each table is defined with the Table keyword, followed by a name for the table.  It may optionally be followed by a list of parameters, which I’ll use later.  In this case, I added a comment that specified which RFC and section this information comes from.  A table consists of a switch statement with a case for each value, and a default for all other cases, much like other programming languages.  I created tables like IPPPrinterState for each field that could be represented in an IPP packet from information I found in each of IPP’s RFCs.

     

    Once the tables were complete, I moved on to creating the Protocol portion of the parser.  This section of the code provides the logic that iterates through a packet and calls the tables for the appropriate data.  This section starts with either the RegisterBefore or RegisterAfter keyword.  These are used to determine how your parser is called.  Essentially, Netmon takes all of the parsers it has, and compiles them into one continuous binary (.npb) and registration tells Netmon where your parser fits.  For my case, I used the following registration code.

     

    [ RegisterAfter (HTTPPayloadData.OCSP, Ipp, Property.HTTPContentType.Contains("application/ipp")) ]

     

    This tells Netmon that, when compiling the parser, it should insert my code right after the code for the OCSP protocol in its HTTPPayloadData parser, my protocol should be called IPP, and it should enter my code path if the HTTP Payload is of content type “application/ipp”.  This allows my parser to work a bit differently than the Wireshark IPP parser – Wireshark uses a port number (631) to identify IPP traffic, whereas my code looks at HTTP content types.  The advantage of this, for me, is that Windows servers use port 80 for IPP by default, not 631, so in cases with a Windows IPP server, this parser should correctly identify the packets.  You may be wondering how or why I chose to register after OCSP.  Basically, I knew I needed my code to be registered in the section of code where HTTP does its payload content type processing.  So I opened up HTTP’s parser, and searched for the content type analysis. OCSP was the first protocol I found in HTTP’s content type logic, so I used that as the place to insert my protocol.

     

    After the registration comes the Protocol statement. I chose the following.

     

    Protocol IPP = FormatString("Status/OpCode = %s", IPPOperationsSupported(OpID))

     

    This names my protocol IPP and specifies that I want the description of the protocol to display the IPP status code.  This way, a user doesn’t need to drill down to find out if this is a print job or a printer status request.  You’ll notice FormatString is a function in Netmon that is similar to printf.  In this case, I am passing a variable (OpID, which is defined lower in my code) to my IPPOperationsSupported table to determine what this OpCode means.  Before I had a parser, I would need to look up the operations supported values in the IPP RFC for each packet I examined.

     

    Next is the body of the protocol.  Basically, this consists of a series of fields (like variables) that define how a packet is laid out.  Creating a field is similar to declaring a variable in C++. You start by choosing a data type that matches the size of the data in the packet and provide a name for that field.  For example, Figure 2 shown below contains the first seven lines of my Protocol.

     

    struct Version = FormatString("%i.%i", Major, Minor)

    {

          INT8 Major;

          INT8 Minor;

    }

    INT16 OpId = FormatString("%s (0x%X)", IPPOperationsSupported(this), this);

    INT32 RequestId;

    Figure 2: Code in the protocol block

     

    The IPP specification states that all packets begin with two 8-bit values, the first value specifies the major protocol version in use, and the second value specifies the minor.  In this case, I wrapped both in a struct so Netmon will display them as “Version: 1.0”, instead of separately as “Major: 1” “Minor: 0” on two lines.  After the version is a 16-bit field that specifies the operation requested (for example, print-job or get-printer-state).  I choose to display this value by looking it up in the IPPOperationsSupported table, then printing it as the string, followed by the hex value (e.g. “Get-Printer-Attributes (0xB)”).  The ‘this’ keyword simply uses the value of the current field, which in this case is the OpId.  Even though Netmon parses through the packet sequentially, this kind of use of a Field before its value is retrieved is allowed.  Finally, I set the RequestId field, which is a 32-bit int value.  Since this field is just a transaction ID for this conversation, I don’t need to do any formatting to it.

     

    After that, things got a little more complicated.  IPP’s structure allows for a variable number of attribute groups, each of which can contain a variable number of attributes.  For example, in response to the request “Get-Printer-Attributes” from the client, the server responds with the Printer Attributes group, which contains a number of attributes like printer-state, queued-job-count, and so on.  First, I needed to deal with the attribute groups in a loop until I’d read each one.  IPP specifies that the end of the attribute groups is specified with the value of 0x03, so I wrote a while loop to create attribute groups until FrameData[FrameOffset] is equal to 3 (See Figure 3 below).  FrameData and FrameOffset are special values provided by Netmon.  FrameData is an array of the entire contents of the frame, and FrameOffset is Netmon’s current location in the packet.  I use this instead of declaring a field here because referencing FrameData[FrameOffset] does not advance the parser frame location.  This is important because I want to consume that value further down.

     

    Inside that loop, I declared another struct that contains an attribute group.  Much like the Protocol IPP line above, we reference a field here that will be declared lower down.  This line does not advance the FrameOffset, since we don’t declare a field here.  The first line of this struct is the field declaration line that finally consumes the attribute group tag.  Below that is another While loop to process all attributes in the attribute group.  IPP differentiates between attributes and attribute groups by making all attribute group identifiers smaller than 0x10, and all attribute identifiers 0x10 or higher.  I use this as the condition for my loop.  Finally, I declare an Attribute struct inside this loop.  This struct is displayed after looking up how to properly print based on the Attribute Name and Value in the AttribDisplayTable.

     

    IPP declares attributes as an 8-bit type identifier (int, bool, string, etc.), a 16-bit int specifying the attribute name’s length, the name (a string), a 16-bit in value length, and a value.  Since I want to look up the value in various tables, depending on the Attribute Name, I store the Attribute Name as “AttName” in a property.  This way, I can continue to reference it while processing continues.  Properties are declared in brackets just above the field they will store.  In my case, I prepend the ‘Post.’ evaluation keyword to the property name.  This instructs Netmon to use the end result of the next line as its value, but before advancing the FrameOffset.  I do this again for the actual value, which I call Val.  If I did not use the Post evaluation keyword, Val would contain the unsigned int32 value of printer state, instead of the formatted string result I get by looking up printer state in its table.

     

    While [FrameData[FrameOffset] != 0x03]

    {

          struct AttributeGroup = FormatString("%s", IPPTags(TagGroup))

          {

                INT8 TagGroup = FormatString("%s (0x%X)", IPPTags(this), this);

                While [FrameData[FrameOffset] >= 0x10]

                {

                      struct Attribute = AttribDisplayTable(AttName, Val)

                      {

                            INT8 Type = FormatString("%s (0x%X)", IPPTags(this), this);

                            INT16 NameLength;

                            [Post.AttName]

                            AsciiString(NameLength) AttributeName;   

                            INT16 ValueLength;

                            switch(AttName)

                            {

                                  case "printer-state" :

                                  [Post.Val]

                                  UINT32 PrinterState = FormatString("%s (0x%X)", IPPPrinterState(this),this);

                                  ...

    Figure 3: Loops in protocol block

     

    My case statements continue like printer-state for all possible attributes of IPP.  At the very end of the protocol block, after I’ve closed my switch, structs, and whiles, I have one more line, which consumes any data remaining in the packet.  This would contain document data if the packet was a print job, and is required so all the packet data is consumed before Netmon moves on to the next packet.  That line is:

     

    BLOB (FrameLength - FrameOffset) Data;

     

    As you can see, it is a binary blob data type, set to the size of the frame, less our current location.

     

    Finally, after my Protocol block, I needed to define my own data type.  IPP defines its own data format to specify printer uptime, so I created a data type for it as shown below in Figure 4.

     

    //Uptime format spec

    Number UPTIME

    {

          Size = 4;

          DisplayFormat = (this != 0) ? FormatString("%i days %02i:%02i:%02i (%i seconds)",

                (this/86400), //Days

                (this-((this/86400)*86400))/3600, //Hours

                (this-(((this/86400)*86400)+(((this-((this/86400)*86400))/3600)*3600)))/60, //Minutes

                (this%60), //Seconds

                this) : "0"

    }

    Figure 4: Custom data type

     

    The first line of Figure 4 specifies this will be a data type composed of numeric data named UPTIME.  Size specifies how many bytes the type uses.  DisplayFormat is what Netmon displays for this type. In this case, I use the x ? y : z syntax.  Netmon doesn’t have if/then/else keywords, but instead uses this ternary operator.  I use a special case for 0 since it seems to be a common return value in the traces I’ve looked at, and having ‘Uptime: 0 days 00:00:00 (0 seconds)’ seemed excessive.

     

    Figures 5 and 6 below show what the result looks like in Netmon.

    Figure5

    Figure 5: Frame Summary

     

    Figure6

    Figure 6: Frame Detail

     

    So what did the trace show?  Windows attempts to send IPP requests with no authentication first, then if it receives an access denied, retries with authentication.  This is by design, as the IPP server replies with the authentication types it supports in the access denied message.  In the case of print jobs that are too large to fit in a single packet, IPP’s spec allows servers to either issue the access denied message as soon as it receives the first packet, or after it has received the entire job.  It turns out that the IPP Print Provider on Windows was designed to send the entire job before listening for a response, so it missed the access denied message that CUPS sent after it received the first packet.  See http://support.microsoft.com/kb/976988/ for related information.  Want a copy of the IPP parser? It will be included in a future release of the Netmon Parser Update.

     

    I hope this post have given you a better idea of how Netmon works, how IPP works, and helps if you ever need to write a parser for your protocol.

    -Matt Burrough

  • Ntdebugging Blog

    Debugging a Bugcheck 0x109

    • 4 Comments

    My name is Nader Khonsari. I am an escalation engineer in Platforms Global Escalation Services. I want to share with you a recent experience where 64-bit Windows Server 2008 servers at a customer location were encountering bugcheck 0x109 blue screen crashes.

     

    In 64-bit versions of the Windows kernel PatchGuard is present. If any driver or application attempts to modify the kernel the PatchGuard will generate the bugcheck (CRITICAL_STRUCTURE_CORRUPTION) mentioned below. PatchGuard protects the kernel from modification by malicious or badly written drivers or software.

     

    To further investigate this bugcheck you need to compare the impacted kernel function with a known reliable one. For instance, if the machine encountering this was running Windows Server 2008 service pack 2 with a post SP2 hotfix kernel you need to compare the impacted kernel function with that of service pack 2 kernel function. Usually you do not need to download and extract the post SP2 hotfix, because the vast majority of the kernel code has not been modified since the service pack.

     

    If you already have service pack 2 for Windows Server 2008 handy, expand the package using instructions included in KB928636:

     

    Windows6.0-KB948465-X64.exe /x

    expand.exe -f:* C:\WS08\SP2\windows6.0-kb948465-X64.cab C:\WS08\SP2\Expanded

     

    Locate the kernel binary from the expanded binaries and then open it up with your debugger just like you open a crash memory dump.

     

    windbg -z C:\WS08\SP2\Expanded\amd64_microsoft-windows-os-kernel_31bf3856ad364e35_6.0.6002.18005_none_ca3a763069a24eea\ntoskrnl.exe

     

    This is the bugcheck data from the dump:

     

    CRITICAL_STRUCTURE_CORRUPTION (109)

    This bugcheck is generated when the kernel detects that critical kernel code or

    data have been corrupted. There are generally three causes for a corruption:

    1) A driver has inadvertently or deliberately modified critical kernel code

     or data. See http://www.microsoft.com/whdc/driver/kernel/64bitPatching.mspx

    2) A developer attempted to set a normal kernel breakpoint using a kernel

     debugger that was not attached when the system was booted. Normal breakpoints,

     "bp", can only be set if the debugger is attached at boot time. Hardware

     breakpoints, "ba", can be set at any time.

    3) A hardware corruption occurred, e.g. failing RAM holding kernel code or data.

    Arguments:

    Arg1: a3a039d89b456543, Reserved

    Arg2: b3b7465eedc23277, Reserved

    Arg3: fffff80001778470, Failure type dependent information

    Arg4: 0000000000000001, Type of corrupted region, can be

            0 : A generic data region

            1 : Modification of a function or .pdata

            2 : A processor IDT

            3 : A processor GDT

            4 : Type 1 process list corruption

            5 : Type 2 process list corruption

            6 : Debug routine modification

            7 : Critical MSR modification

     

    Next, check the address at Arg3. This will give you the function that was modified, but not the offset of the modified instruction.

     

    3: kd> ln fffff80001778470

    (fffff800`01778470)   nt!KeSetSystemTime   |  (fffff800`01778790)   nt!BiLoadSystemStore

    Exact matches:

        nt!KeSetSystemTime = <no type information>

     

    Unassemble the same function in the SP2 kernel binary you expanded from the SP2 package. Do the same with the function of the crashed kernel and compare the two. You will find the modified opcode compared to that of the unmodified kernel.

     

    Below is the comparison of the nt!KeSetSystemTime code of the crashed kernel and that of the service pack 2 kernel respectively. They match fine except for the highlighted byte in the prefetch instruction which has been overwritten with a 0x1f.  This changed the instruction to a nop, which is done to prevent the prefetch operation from occurring on processors that don't support prefetch.

     

    nt!KeSetSystemTime+0x156:

    fffff800`017785c6 0f1f0f          nop     dword ptr [rdi]

    fffff800`017785c9 488b07          mov     rax,qword ptr [rdi]

    fffff800`017785cc 493bc7          cmp     rax,r15

    fffff800`017785cf 7516            jne     nt!KeSetSystemTime+0x177

     

    ntoskrnl!KeSetSystemTime+0x156:

    00000001`4012e5b6 0f0d0f          prefetchw [rdi]

    00000001`4012e5b9 488b07          mov     rax,qword ptr [rdi]

    00000001`4012e5bc 493bc7          cmp     rax,r15

    00000001`4012e5bf 7516            jne    

     

    After further investigation this turned out to be a known issue in the VMware environment when the VM is moved from a non-prefetch to a prefetch architecture and even then, only in a live-migration case.  The issue is documented on VMWare's site at http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=1008749&sliceId=1&docTypeID=DT_KB_1_1&dialogID=74787167&stateId=0 .

  • Ntdebugging Blog

    Debugging services startup in Svchost from a kernel mode debug session

    • 3 Comments

    Windows shared services allow us to run system services together in a single service by having multiple DLLs run in a single process called Svchost. This allows Windows to have many services to run with the overhead of a single process.   You can find more information about shared services here and here.

     

    I like to do a lot of my user mode debugging from a kernel mode debugger. I can just attach to a Hyper-V VM though a virtual serial port and all my tools are available to me without having to set up a user mode debugger on the target, plus if my application calls into kernel mode I’m all set.

     

    The Svchost process won’t load the DLL until it’s time to start the service, so it can be tricky to set the correct breakpoint to catch the service starting up. This is complicated by the fact that the dll entry point will usually be <module>!ServiceMain, however this is not necessarily the case.

     

    We can leverage the fact that svchost.exe exists in the same shared memory across all the Svchost processes, and a change to this shared memory will affect all of the svchost process.  When you set a user mode breakpoint from a kernel mode debugger the debugger changes the user mode code by inserting an int 3 (the debugger changes the code back when it is executed).  We can take advantage of this functionality to set one breakpoint that will fire from any svchost process.

     

    If we can get past the Svchost process loading the service dll, we can find the host process main function or any other function we want to set a breakpoint on.

     

    First, find the process address of any Svchost. I’m just showing the first of a long list of processes:

     

    kd> !process  0 0 svchost.exe

    PROCESS fffffa80098c5970

        SessionId: 0  Cid: 0270    Peb: 7fffffd8000  ParentCid: 0204

        DirBase: 35986000  ObjectTable: fffff8a001115b20  HandleCount: 343.

        Image: svchost.exe

     

    Now let’s get into the context of this Svchost (it doesn’t matter which one) and set a breakpoint.  .process –i sets an invasive breakpoint so we can get into the active context of the process.

     

    kd> .process -i fffffa80098c5970

    You need to continue execution (press 'g' <enter>) for the context

    to be switched. When the debugger breaks in again, you will be in

    the new process context.

     

    Then go the debugger to enter the active process.

     

    kd> g

    Break instruction exception - code 80000003 (first chance)

    nt!DbgBreakPointWithStatus:

     

    GetServiceMainFunctions()  is where Svchost both loads the service DLL and gets the service main function so let’s set a breakpoint for it:

     

    kd> .reload /user

    Loading User Symbols

    .......................................................

    kd> bp svchost!GetServiceMainFunctions

     

    Now I just go the debugger and start my service.  I’ll use the workstation service as an example (I had previously stopped the workstation service for this test).

     

    kd> g

     

    At a command prompt on the target I enter:

     

    C:\> net start workstation

     

    In the debugger I get:

     

    Breakpoint 0 hit

    svchost!GetServiceMainFunctions:

    0033:00000000`ffc813cc fff3            push    rbx

    kd> k

    Child-SP          RetAddr           Call Site

    00000000`0339f7f8 00000000`ffc8128d svchost!GetServiceMainFunctions

    00000000`0339f800 000007fe`fe38a82d svchost!ServiceStarter+0xba

    00000000`0339f890 00000000`77a5f56d sechost!ScSvcctrlThreadW+0x25

    00000000`0339f8c0 00000000`77b93281 kernel32!BaseThreadInitThunk+0xd

    00000000`0339f8f0 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

     

    svchost!GetServiceMainFunctions is called from Svchost!ServiceStarter.  I just leave GetServiceMainFunctions with gu (Go Up). Now the service dll will have been loaded and I can load symbols for wkssvc.dll and get the service main() function.

     

    kd> gu

    svchost!ServiceStarter+0xba:

    0033:00000000`ffc8128d 488b7c2430      mov     rdi,qword ptr [rsp+30h] ss:002b:00000000`0339f830=000007fefb443a7c

    kd> k

    Child-SP          RetAddr           Call Site

    00000000`0339f800 000007fe`fe38a82d svchost!ServiceStarter+0xba

    00000000`0339f890 00000000`77a5f56d sechost!ScSvcctrlThreadW+0x25

    00000000`0339f8c0 00000000`77b93281 kernel32!BaseThreadInitThunk+0xd

    00000000`0339f8f0 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

     

    Now loading my user symbols I can get all the symbols for wkssvc.dll

     

    kd> .reload /user

    Loading User Symbols

    ................................................................

    kd> dps 00000000`0339f800  l7

    00000000`0339f800  00000000`00000000

    00000000`0339f808  00000000`00000000

    00000000`0339f810  00000000`00000000

    00000000`0339f818  00000000`00000000

    00000000`0339f820  00000000`00000000

    00000000`0339f828  00000000`00000000

    00000000`0339f830  000007fe`fb443a7c wkssvc!ServiceMain <- Here is the Service Main function.

     

    I can now set a breakpoint on wkssvc!ServiceMain,  and as a bonus I now have access to all available breakpoints in wkssvc.dll

     

    kd> bp wkssvc!ServiceMain

    kd> g

    Breakpoint 1 hit

    wkssvc!ServiceMain:

    0033:000007fe`fb443a7c 48895c2408      mov     qword ptr [rsp+8],rbx ss:002b:00000000`0339f800={wkssvc!SvchostPushServiceGlobals (000007fe`fb442ff4)}

    kd> k

    Child-SP          RetAddr           Call Site

    00000000`0339f7f8 00000000`ffc81344 wkssvc!ServiceMain

    00000000`0339f800 000007fe`fe38a82d svchost!ServiceStarter+0x1e8

    00000000`0339f890 00000000`77a5f56d sechost!ScSvcctrlThreadW+0x25

    00000000`0339f8c0 00000000`77b93281 kernel32!BaseThreadInitThunk+0xd

    00000000`0339f8f0 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

     

    Now I can debug the service on startup, not caring what instance of svchost it’s in with full user and kernel mode access.

  • Ntdebugging Blog

    I Want A Debugger Robot

    • 1 Comments

    Hi,

     

    My name is Sabin from the Platforms Global Escalation Services team at Microsoft, and today I want to share with you a recent experience I had debugging an issue reported by an hardware manufacturer.

     

    The customer was doing a reboot test for their new server product line. They found that after hundreds of continuous reboots there was always a single instance that the server took more than 20 minutes to start up, when compared to an average 2 minute normal startup time. This happened only once every 300+ to 1000+ reboots. The number of reboots it took before the problem happened again varied randomly so it was difficult to predict when the problem would occur.

     

    Although they setup a live kernel debugging environment, they didn’t want to watch the computer screen for 10+ hours waiting for the problem to happen so they could manually hit Ctrl+Break in windbg. So instead they setup a video camera to film the computer screen 24x7, and they managed to find that when the “mysterious delay” happened the computer showed a gray screen with “Microsoft (R) Windows (R) version 5.1 (Build 3790: Service Pack 2)”.

     

    The case came to me and the customer even shipped a problematic server to our office to troubleshoot the cause of the delay. The problem was that I didn’t want to stare at the computer screen for 10+ hours either!

     

    The first thing I thought was that it would be fantastic if there were a robot sitting in front of Windbg, watching the elapsed time for each reboot, so it could hit Ctrl+Break in windbg if the server takes more than 10 minute to start. Then I asked myself, “Why not?”

     

    I decided to build such a “robot” myself.  I went around and checked the Debuggers SDK document (which can be found in the windbg help document debugger.chm), and I realized that what I needed was a customized debugger. The functionality of the debugger is simple, it should be able to recognize the time when the server first starts and the time when the server reboots. If there is more than 10 minutes between these two times the customized debugger automatically breaks in to the server. The event callback interface IDebugEventCallbacks::SessionStatus and the client interface IDebugControl::SetInterrupt can meet my needs perfectly.

     

    It is not that difficult to build such a customized debugger, which I called DBGRobot. I would like to share some code snippets which you may find helpful when building a customized debugger for a special debugging scenario, or as the basis for building a more complicated debugging robot.

     

    First, we need to download and install the Windows Driver Kit Version 7.1.0. When installing the WDK be sure to select Debugging Tools for Windows.

     

     

     

    If you install the WDK to its default folder, which for version 7.1.0 is C:\WinDDK\7600.16385.1, the C:\WinDDK\7600.16385.1\Debuggers\sdk\samples folder will contain the sample code from the Debugger SDK. The dumpstk sample is the one particularly interesting to us. We can copy some common code from it, such as the out.cpp and out.hpp which is the implementation of the IDebugOutputCallbacks interface.

     

    Now let’s do some coding.  The common code is copied from the Debuggers SDK sample Dumpstk. I also listed it here for clarity.

     

    The first step is to create the IDebugClient, IDebugControl and IDebugSymbols interfaces (although IDebugSymbols is not used in this case). You need to call the DebugCreate() function to create the IDebugClient interface, and then use IDebugClient->QueryInterface() to query the IDebugControl and IDebugSymbols interfaces.

     

    void

    CreateInterfaces(void)

    {

        HRESULT Status;

     

        // Start things off by getting an initial interface from

        // the engine.  This can be any engine interface but is

        // generally IDebugClient as the client interface is

        // where sessions are started.

        if ((Status = DebugCreate(__uuidof(IDebugClient),

                                  (void**)&g_Client)) != S_OK)

        {

            Exit(1, "DebugCreate failed, 0x%X\n", Status);

        }

     

        // Query for some other interfaces that we'll need.

        if ((Status = g_Client->QueryInterface(__uuidof(IDebugControl),

                                               (void**)&g_Control)) != S_OK ||

            (Status = g_Client->QueryInterface(__uuidof(IDebugSymbols),

                                               (void**)&g_Symbols)) != S_OK)

        {

            Exit(1, "QueryInterface failed, 0x%X\n", Status);

        }

    }

     

    If you want to see the output from the debugging engine, you also need to implement the IDebugOutputCallbacks interface. The main function to be implemented is IDebugOutputCallbacks::Output(), which is quite simple as we only need to see the output in the command prompt stdout stream:

     

    STDMETHODIMP

    StdioOutputCallbacks::Output(

        THIS_

        IN ULONG Mask,

        IN PCSTR Text

        )

    {

        UNREFERENCED_PARAMETER(Mask);

        fputs(Text, stdout);

        return S_OK;

    }

     

    Here comes our main code logic: we need to implement the IDebugEventCallbacks interface and monitor the SessionStatus events. In order for the debugger engine to deliver the SessionStatus events to us we need to set the DEBUG_EVENT_SESSION_STATUS mask in IDebugEventCallbacks::GetInterestMask():

     

    STDMETHODIMP_(HRESULT)

    RobotEventCallBacks::GetInterestMask(

            THIS_

            OUT PULONG Mask

            )

    {

     

        // Here we monitor all the event types, although it is not necessary

        *Mask = DEBUG_EVENT_SESSION_STATUS | DEBUG_EVENT_EXCEPTION | DEBUG_EVENT_CHANGE_DEBUGGEE_STATE | DEBUG_EVENT_CHANGE_ENGINE_STATE;    

        return S_OK;

     

    }

     

    In the RobotEventCallBacks::SessionStatus() function, if we receive the DEBUG_SESSION_ACTIVE session status we know that Windows has started and connected to the debug session. Once the OS has started we create a new thread that waits for hEvent, an unnamed global event. In the new thread we send a timeout value for 10 minutes to the WaitForSingleObject() call.  The hEvent will be set when the session status is DEBUG_SESSION_END or DEBUG_SESSION_REBOOT, which means that the server rebooted and disconnected from the debug session. If WaitForSingleObject() returns WAIT_TIMEOUT, indicating that the server has not rebooted after 10 minutes, we send a break to the server via g_Control->SetInterrupt(DEBUG_INTERRUPT_ACTIVE).

     

    STDMETHODIMP_(HRESULT)

    RobotEventCallBacks::SessionStatus(

        THIS_

        IN ULONG Status

        )

    {

                   

        printf("Session Status changed, status = %d \n", Status);

     

        if (Status == DEBUG_SESSION_ACTIVE)

        {

                    //bSessionActive = TRUE;

                    ::CreateThread(NULL, 0, (LPTHREAD_START_ROUTINE)TimerThread, NULL, 0, NULL);

        }

        else if ((Status == DEBUG_SESSION_END) || (Status == DEBUG_SESSION_REBOOT) )

        {

                    //bSessionActive = FALSE;

                    SetEvent(hEvent);

     

        }

     

        return S_OK;

    }

     

    STDMETHODIMP_(HRESULT)

    RobotEventCallBacks::Exception(

          THIS_

          IN PEXCEPTION_RECORD64 Exception,

          IN ULONG FirstChance

          )    

    {

          printf("Captured exception 0x%x, FirstChance = %d", Exception->ExceptionCode, FirstChance);

          if (FirstChance && Exception->ExceptionCode == 0x80000003)

          {

                bInterrupted = TRUE;

          }

         

          return DEBUG_STATUS_NO_CHANGE;

    }

     

    void TimerThread()

    {

                    HRESULT Status;

                    DWORD result;

     

                    result = WaitForSingleObject(hEvent,10*60*1000);

     

                    if (result == WAIT_OBJECT_0) //the close event is signled within 10 minutes

                    {

                                    printf("Target reboots within 10 minutes \n");             

                                   

                    }

                    else if (result == WAIT_TIMEOUT)  //the close event is not singled within 10 minute, it means the server hangs

                    {

                        if ((Status = g_Control->SetInterrupt(DEBUG_INTERRUPT_ACTIVE)) !=S_OK)  //force the target to break

                        {

                                    Exit(1, "SetInterrupt failed, 0x%x\n", Status);

                        }

                                       

                    }

                   

    }

     

    Below is our main function. In the main function we need to create the interfaces, and set the callback interfaces g_Client->SetOutputCallbacks and g_Client->SetEventCallbacks. Then we call g_Control->WaitForEvent() in the while(TRUE) loop, which is a typical main loop for a debugger program. If WaitForEvent() returns it means that a debugging event happened, such an exception, and then you can send some debug commands via g_Control->Execute. Here we simply send a “.time” command to record the problematic time. The bInterrupted boolean is set in the RobotEventCallBacks::Exception() function when it detects a first chance 80000003 exception, which means a debugger break-in.

     

    Note: Initially I wanted to send a “.crash” command to crash the server and generate the memory dump.  When I tried this I found that although the “.crash” command did trigger a blue screen it failed to generate a memory dump because the problem happened so early in the boot process and the paging file was not initialized yet. So I changed it to just send a “.time” command and performed live debugging after the problem was captured by the tool. Initially I had forgotten to change the printf output, which is why you see the output “Succeeded in capturing the problem and crashing the server!” in the below screenshot.

     

    void __cdecl

    main(int argc, char* argv[])

    {

     

        HRESULT Status;

       

        hEvent = CreateEvent(NULL, FALSE, FALSE, NULL);  //create the global hEvent

     

        CreateInterfaces();  //create the global interfaces

     

        if ((Status = g_Client->SetOutputCallbacks(&g_OutputCb)) != S_OK)  //set the output callback interface

        {

            Exit(1, "SetOutputCallbacks failed, 0x%X\n", Status);

        }

     

        if ((Status = g_Client->SetEventCallbacks(&g_EventCb)) != S_OK)  //Set the event callback interface

        {

            Exit(1, "SetEventCallbacks failed, 0x%X\n", Status);

        }

     

     

     

        Status = g_Client->AttachKernel(DEBUG_ATTACH_KERNEL_CONNECTION, argv[1]); //Attach to the kernel debugging, the argv[1] should be “COM:Port=COM1,baud=115200”

     

        if (Status != S_OK)

        {

                    Exit(1, "AttachKernel failed, 0x%X\n", Status);              

     

        }

     

        while(TRUE)  //This is the main loop of the debugger program, it calls g_Control->WaitForEvent() repeatedly waiting for the debugging events

        {

     

                        Status = g_Control->WaitForEvent(DEBUG_WAIT_DEFAULT,

                                                              INFINITE);  //wait for 10 minutes

                                                             

                        printf("WaitForEvent returned. Status = 0x%x\n", Status);

                       

                       

                        if (Status == S_OK && bInterrupted)

                        {

                                    if ((Status = g_Control->Execute(DEBUG_OUTCTL_THIS_CLIENT, ".time", DEBUG_EXECUTE_ECHO)) !=S_OK)  //do .time to record the time the problem happens

                                        {

                                                    Exit(1, "Execute failed, 0x%x\n", Status);

                   

                                        }

                                       

                                    Exit(0, "Succeeded in capturing the problem!");

                        }

        }

       

        Exit(0, NULL);

    }

     

    Now that we have the code we can build DBGRobot.exe using the WDK 7.1.0 build environment. Remember to set the following environment variables, as mentioned in the readme.txt under \WinDDK\7600.16385.1\Debuggers\samples

     

    SET DBGLIB_LIB_PATH=C:\WinDDK\7600.16385.1\Debuggers\sdk\lib

    SET DBGSDK_INC_PATH=C:\WinDDK\7600.16385.1\Debuggers\sdk\inc

    SET DBGSDK_LIB_PATH=C:\WinDDK\7600.16385.1\Debuggers\sdk\lib

     

    Now I have the “robot” watching the problem for me. I set up the live kernel debugging environment as usual and then ran the DBGRobot.exe tool as the debugger instead of windbg.  I used the command: dbgrobot.exe com:port=COM1,baudrate=115200.

     

    This is the screenshot when the server starts and reboots normally:

    image

     

    I let the DBGRobot.exe run over night, and in the morning when I came back to the office, the “robot” had done its job perfectly and captured the problem for me! The problematic server was left in the break-in state so I just needed to launch windbg to attach to the server, and then I could do the debugging in windbg as normal.

    image

     

    The cause of the delay turned out to be a hardware problem, but the actual problem is not important here. I hope this article can give you some new ideas about debugging, and using the debug tools in new ways.

     

  • Ntdebugging Blog

    Part 3: Understanding !PTE - Non-PAE and X64

    • 1 Comments

    Hello, Ryan Mangipano (ryanman) again with part three of my series on understanding the output of the !PTE command. In this last installment I’ll continue our manual conversion of Virtual Addresses by converting a Non-PAE VA. Afterwards I’ll convert a VA from X64 Long Mode. Then I’ll discuss the TLB. If you haven’t read part one and two of this series I recommend taking a looking before jumping into the rest of this article.

    Non-PAE system VA to Physical Address conversion

    In part 1 and part 2 I discussed using windbg to manually x86 PAE virtual to physical address conversion with 4KB and 2MB pages. Now I’ll mention the same thing for Non-PAE systems. The official processor manuals explain how systems not using PAE have only two level of tables instead of the three used by PAE. This fact is because PAE added the Page Directory Pointer Table. This means the CR3 register will contain a pointer to the Page Directory Table. Also note the size of the table entries will be 32-bit, not the 64-bit size observed in the PAE tables. PAE expanded the table size to allow for more bits for the purpose of allowing the addressing of more physical memory.

    The PAE bit is bit number five, which is the sixth bit due to bit numbering starting at zero. You can see PAE is not enabled on this system.

           0: kd> .formats @cr4

      Binary:  00000000 00000000 00000110 11010001

     Another method of checking this:

                   1: kd> j ((@cr4 & 0y00000000000000000000000000100000) != 0) '.echo PAE flag Enabled';'.echo PAE flag Disabled'

    PAE flag Enabled

     Now I'll use the following random virtual address which is valid:

     

    0: kd> !pte f72c5c00

    F72C5C00  - PDE at C0300F70        PTE at C03DCB14

              contains 01014963      contains 06CE7963

            pfn 1014 G-DA--KWV    pfn 6ce7 G-DA—KWV

     

    0: kd> .formats f72c5c00

      Binary:  11110111 00101100 01011100 00000000

     Let’s break it down:

     

    PD offset                        11110111 00

    Page Table Offset                       101100 0101

    Offset in the Physical Page             1100 00000000 (since it’s a even 12 bits,  just refer to it in hex as c00)

     

    0: kd> !dd @cr3 + (0y1111011100 * @@(sizeof(nt!HARDWARE_PTE)))L1

    # a07df70 01014963

     

    0: kd> !dd 1014000 + (0y1011000101 * @@(sizeof(nt!HARDWARE_PTE)))L1

    # 1014b14 06ce7963

     Now that I have the physical page base, I'll place  the last 3 hex digits (c00)  from the Virtual Address onto the address base.

     

    0: kd> !dd 6ce7c00 L4

    # 6ce7c00 00000001 c0000005 00000000 00000000

     

    0: kd> dd f72c5c00 L4

    f72c5c00  00000001 c0000005 00000000 00000000

    X64 VA to Physical Address Conversion

    Just as PAE added a third level to the non-PAE two-level system, x64 Long mode adds a fourth level to the hierarchy.  This table is called the Page-Map Level-4 (PML4 table). AMD refers to the entries in this table as PML4E (Page-Map Level-4 Entry).  Intel refers to each entry as PML4-Table Entry. Internally we refer to this as the eXtended Page directory Entry (PXE).  Regardless of how you refer to these entries they contain indexes into the PDP table (Page Directory Pointer Table).

     Here is the output of the !pte command against a 64-bit address:

                   7: kd> !pte fffffade`c24eb7c0

                                     VA fffffadec24eb7c0

    PXE @ FFFFF6FB7DBEDFA8     PPE at FFFFF6FB7DBF5BD8    PDE at FFFFF6FB7EB7B090    PTE at FFFFF6FD6F612758

    contains 0000000111800863  contains 0000000119826863  contains 0000000119839963  contains 0000000001FF6121

    pfn 111800     ---DA--KWEV  pfn 119826     ---DA--KWEV  pfn 119839     -G-DA--KWEV  pfn 1ff6       -G--A—KREV

    I'll break it down in binary and use data from the processor manuals to separate the bits

     

    7: kd> .formats fffffade`c24eb7c0

      Binary:  11111111 11111111 11111010 11011110 11000010 01001110 10110111 11000000

     

    Sign extend               11111111 11111111

    PML4 offset               11111010 1

    PDP offset                1011110 11

    PD offset                 000010 010

    Page-Table offset         01110 1011

    Physical Page Offset      0111 11000000

     

     Now that I have the numbers, I'll plug them in and find the physical address. If you are having problems following along, refer to part one of this blog and the AMD x64 System Programming manual. You should be comparing the output below to the !pte output above

     

    7: kd> !dq @cr3 + ( 0y111110101 * @@(sizeof(ntkrnlmp!HARDWARE_PTE))) L1

    #  147fa8 00000001`11800863

     

    7: kd> !dq 0x00111800000 + (  0y101111011  * @@(sizeof(ntkrnlmp!HARDWARE_PTE))) L1

    #111800bd8 00000001`19826863

     

    7: kd> !dq 0x119826000 + ( 0y000010010  * @@(sizeof(ntkrnlmp!HARDWARE_PTE))) L1

    #119826090 00000001`19839963

     

    7: kd> !dq 0x119839000 + ( 0y011101011  * @@(sizeof(ntkrnlmp!HARDWARE_PTE))) L1

    #119839758 00000000`01ff6121

     

    7: kd> !dc 1ff67c0 L4

    # 1ff67c0 5085ff48 48000005 68244c8b 04a8f633 H..P...H.L$h3...

     

    7: kd> dc fffffade`c24eb7c0 L4

    fffffade`c24eb7c0  5085ff48 48000005 68244c8b 04a8f633  H..P...H.L$h3...

    TLB- Translation Lookaside Buffer and Conclusion

    The CPU’s memory management unit performs these operations to translate virtual addresses to physical. Wouldn’t it be great if we could cache the virtual address to physical page information in a location that can be accessed very quickly so that the CPU doesn’t have to look this up for future references to this page?  That is just what the Translation Lookaside Buffer (TLB) does. Hopefully this will shed some light on some basic memory structures like Large Pages, Flags, and the TLB so I encourage you to read more about these topics from the following sources-

    How PAE x86 works (on MSDN): http://technet.microsoft.com/en-us/library/cc736309(WS.10).aspx

    Intel  & AMD processor manuals: http://www.intel.com/products/processor/manuals/index.htm and http://developer.amd.com/documentation/guides/Pages/default.aspx#manuals

    “Windows Internals, 5th Edition” Mark E. Russinovich and David A. Solomon with Alex Ionescu  -Chapter 9: Memory Management

     

     

  • Ntdebugging Blog

    ‘Hidden’ ETW Stack Trace Feature – Get Stacks All Over the Place!

    • 0 Comments

     

    I’ve heard Mark R. say that “sometimes Microsoft gave me a gift” when coming across an interesting feature not in mainstream documentation.  How true that often the odd find can prove ultimately very valuable.  Such was the case when I read Bruce Dawson’s Stack Walking in Xperf blog entry.

    Here’s the excerpt that blew me away:

    You can also record call stacks for manifest-based ETW (Event Tracing for Windows) events, but the syntax is quite different, and it only works on Windows 7 and above. When you specify your ETW provider to xperf after “-on” you can specify extra parameters after the provider name, separated by colons. These are flags, a level, and, for manifest-based providers, a list of extra data to record, which can include call stacks. You can leave the flags and level fields blank and just specify ‘stack’ (in single quotes) after three colons like this:

         xperf -on Latency -stackwalk profile -start browse -on Microsoft-IE:::'stack'  
         rem Your scenario goes here...     
         xperf -stop browse -stop -d mytrace.etl

    This sounds rather ordinary at first,  but looking at the syntax you may realize that this means for just about every manifest based provider (Microsoft-*) one can generate a stack trace on all ETW events therein.  This is really cool because if you are only interested in stacks, you don’t have to monkey around with any other trace level or flags which is often the most frustrating part about manual ETW tracing.  Recall from Ivan’s blog that there are over 600+ inbox providers to explore so we can go crazy here exploring components!  Let’s demonstrate.

    First, remember to use the Win7 SDK version of XPerf.  The XPerf(WPT) install *.msi files are found in the Bin directory after installing Win32 Development Tools, so you have to do a two-step at present to get the tools installed.

    Let’s list all the providers…

    C:\utils\xperf7sdk>xperf -providers | findstr "Microsoft-"
          .
           .
           .
           93c05d69-51a3-485e-877f-1806a8731346:0x0001000000000000:0x4       : Microsoft-Windows-LUA
           9580d7dd-0379-4658-9870-d5be7d52d6de:0x0001000000000000:0x4       : Microsoft-Windows-WLAN-AutoConfig
           959f1fac-7ca8-4ed1-89dc-cdfa7e093cb0:0x0001000000000000:0x4       : Microsoft-Windows-HealthCenterCPL
           96ac7637-5950-4a30-b8f7-e07e8e5734c1:0x0001000000000000:0x4       : Microsoft-Windows-Kernel-BootDiagnostics
           96f4a050-7e31-453c-88be-9634f4e02139:0x0001000000000000:0x4       : Microsoft-Windows-UserPnp
           98583af0-fc93-4e71-96d5-9f8da716c6b8:0x0001000000000000:0x4       : Microsoft-Windows-Dwm-Udwm
           a50b09f8-93eb-4396-84c9-dc921259f952:0x0001000000000000:0x4       : Microsoft-Windows-Diagnosis-MSDE<
           ad5162d8-daf0-4a25-88a7-01cbeb33902e:0x0001000000000000:0x4       : Microsoft-Windows-WPDClassInstaller
           ae4bd3be-f36f-45b6-8d21-bdd6fb832853:0x0001000000000000:0x4       : Microsoft-Windows-Audio
           af2e340c-0743-4f5a-b2d3-2f7225d215de:0x0001000000000000:0x4       : Microsoft-Windows-Netshell
           b03d4051-3564-4e93-93db-3c34f1b5b503:0x0001000000000000:0x4       : Microsoft-Windows-Diagnosis-TaskManager
           b1f90b27-4551-49d6-b2bd-dfc6453762a6:0x0001000000000000:0x4       : Microsoft-Windows-PowerCpl
           b2a40f1f-a05a-4dfd-886a-4c4f18c4334c:0x0001000000000000:0x4       : Microsoft-PerfTrack-IEFRAME
           c89b991e-3b48-49b2-80d3-ac000dfc9749:0x0001000000000000:0x4       : Microsoft-Windows-Documents
           c9bdb4eb-9287-4c8e-8378-6896f0d1c5ef:0x0001000000000000:0x4       : Microsoft-Windows-HomeGroup-ProviderService
           dbe9b383-7cf3-4331-91cc-a3cb16a3b538:0x0001000000000000:0x4       : Microsoft-Windows-Winlogon
           de513a55-c345-438b-9a74-e18cac5c5cc5:0x0001000000000000:0x4       : Microsoft-Windows-Help
           ded165cf-485d-4770-a3e7-9c5f0320e80c:0x0001000000000000:0x4       : Microsoft-Windows-DeviceUx
           e978f84e-582d-4167-977e-32af52706888:0x0001000000000000:0x4       : Microsoft-Windows-TabletPC-InputPanel
           ed6b3ba8-95b2-4cf5-a317-d4af7003884c:0x0001000000000000:0x4       : Microsoft-Windows-Sidebar
           f3f14ff3-7b80-4868-91d0-d77e497b025e:0x0001000000000000:0x4       : Microsoft-Windows-WMP
           fbcfac3f-8459-419f-8e48-1f0b49cdb85e:0x0001000000000000:0x4       : Microsoft-Windows-NetworkProfile
           ffdb9886-80f3-4540-aa8b-b85192217ddf:0x0001000000000000:0x4       : Microsoft-PerfTrack-MSHTML

    We have a lot to play with as you can see, but let’s say you are trying to track down a socket leak.  Since we know sockets are implemented in afd.sys, let’s search for that provider…

    C:\utils\xperf7sdk>xperf -providers | findstr "AFD"        
    e53c6823-7bb8-44bb-90dc3f86090d48a6                               : Microsoft-Windows-Winsock-AFD       
    Microsoft-Windows-Winsock-AFD


    Now, we can form the XPerf syntax to trace socket operations during the leak.  The “AFDSession” is just the session name we are making up, the rest is the prescriptive syntax. Just to explain the syntax we are starting multiple sessions at the time here – kernel logger session + AFDSession. Similarly we are stopping both sessions at the same time and then merging both into the one file AFD.etl file.

    xperf -on Latency -stackwalk profile -start AFDSession -on Microsoft-Windows-Winsock-AFD:::'stack'
    <reproduce leak>
    xperf -stop AFDSession -stop -d AFD.etl

    Run xperfview AFD.etl and switch on “Stack counts by type graph”. If you don’t see this frame, make sure you are on the 7SDK version

    clip_image001

    The beauty here is the Stack Counts by Type frame which allows us to explore which stack traces where collected, notice all the AFD stack count sources!

    clip_image002

    Let’s say we have the simple case of a massive number of Create’s happening and thus handle counts rocketing up (leaking) for a given process.  Perhaps exhausting all ephemeral ports on the system, ultimately resulting in any number of hang symptoms.

    Simply filter the Events for AfdCreate as follows.

    clip_image003

    Now, we have a filtered Frame of just those stack counts….

    clip_image004

    You can imagine that if there were a lot more creates here that the frequency distribution shown above would be much higher, but for the sake of just exploring the feature, let’s just look at these small number of creates anyway.  In fact, let’s select them all by just choosing Select View in the context menu which will get all of these events over the trace sample.

    clip_image005

    The frame area is now highlighted in light purple indicating the selected area.

    clip_image006

    Because we are interested in Stacks, make sure you have configured your symbol path and have checked Load Symbols

    (be patient, this could take a while to build up your symbol cache)

    clip_image007

    clip_image008

    Next, invoke the Summary Table

    clip_image009

    Click the flyout control on the left to select only the Creates again.

    clip_image010

    Next, make it easy on yourself by Filtering on the interesting process only.

    clip_image011

    After, much better.

    clip_image012

    Now, here is the fruit of our labor in the full stack trace inside iexplore.exe allocating the socket handles.

    clip_image013

    Continued…

    clip_image001[5]

     

    Consider that what we see above is read from the top down.  Iexplore.exe is doing this work through wininet, jsproxy, jscript, etc. all the way down through the calling of ws2_32.dll!socket which ultimately creates the socket and returned our file handle.  I’ve seen so many cases of ephemeral port exhaustion which is essentially a handle leak against sockets, these could have been solved in minutes with this tracing!  Keep in mind that this data isn’t perfectly suited for tracking handle leaks, especially small ones since there’s no accounting for outstanding handles.  However, sometimes the leak is egregious enough that you just need to look at activity over time to be pretty certain of the culprit.  Really, what we are doing here is just probing a particular ETW instrumented API, which is incredibly powerful.

    Typically, on 64-bit computers, you cannot capture the kernel stack in certain contexts when page faults are not allowed. To enable walking the kernel stack on AMD64, set the DisablePagingExecutive Memory Management registry value to 1 and reboot. The DisablePagingExecutive registry value is located under the following registry key: HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management

    In summary, this isn’t really a hidden feature since it’s documented in EnableTraceEx2’s Enable Parameters parameter, but I hope with this additional context you see what powerful analysis is possible.  Previously, this would have required live debugging a process, so now we have another example of the power of XPerf to decode and view key system activity, non-invasively!

    -Tate

  • Ntdebugging Blog

    x64 Manual Stack Reconstruction and Stack Walking

    • 2 Comments

    My name is Trey Nash and I am an Escalation Engineer on the Core OS team.  My experience is as a software developer, and therefore my blog posts tend to be slanted in the direction of helping developers during the feature development, testing and the support phases.

    In this installment I would like to expand a bit on a previous post of mine called Challenges of Debugging Optimized x64 Code.  In that post I discussed the nuances of the x64 calling convention (thankfully of which there is only one) and how it is used in optimized builds of software.  The calling convention is sometimes referred to as the Application Binary Interface (ABI).  In this post, I would like to discuss the x64 unwind metadata and how you can use it in the debugger to manually walk a stack.

    In some cases, you may have a corrupted stack that the debugger simply cannot effectively walk for you.  This often happens because the debugger walks a stack from the top down (assuming the stack grows upwards as if it were a stack of places on a table), and if the stack is sufficiently trashed then the debugger cannot find its bearing.  In the x86 world, a large percentage of the time, you can spot the stack frames by following the chain of base pointers and then build a crafty stack backtrace command to display the stack at some point in time.  But in the x64 calling convention there is no base pointer.  In fact, once a function’s prolog code has executed the rsp register generally never changes until the epilog code.  To read more about x64 prolog and epilog code conventions, go here.

    Moreover, the syntax for creating a crafty stack backtrace command in the x64 environment is currently undocumented, and I aim to shed some light on that near the end of that blog post. J

    The Example Code

    For this blog post I have used the following example C# code that requires the .NET 4.0 framework and can be easily built from a Visual Studio 2010 command prompt.  You can find the code below:

    using System;
    using System.Numerics;
    using System.Threading;
    using System.Threading.Tasks;
    using System.Collections.Concurrent;

    class EntryPoint
    {
        const int FactorialsToCompute = 2000;

        static void Main() {
            var numbers = new ConcurrentDictionary<BigInteger, BigInteger>(4, FactorialsToCompute);

            // Create a factorial delegate.
            Func<BigInteger, BigInteger> factorial = null;
            factorial = (n) => ( n == 0 ) ? 1 : n * factorial(n-1);

            // Now compute the factorial of the list
            // concurrently.
            Parallel.For( 0,
                          FactorialsToCompute,
                          (i) => {
                            numbers[i] = factorial(i);
                          } );
        }
    }

    The spirit of this code is to concurrently compute the first 2000 factorials and store the results in a dictionary.  This code uses the new Task Parallel Library to distribute this work evenly across the multiple cores on the system.  To compile the example (assuming the code is stored in test.cs), you can execute the following command from a Visual Studio 2010 command prompt:

    csc /r:system.numerics.dll test.cs

    Note:    If you are using a 64bit platform, be sure to use the x64 command prompt shortcut installed by the Visual Studio 2010 installer.

    You can download a free evaluation of Visual Studio 2010
    here.

    x64 Unwind Metadata

    So how does the debugger and functions such as RtlVirtualUnwind know how to walk the x64 stack if it cannot find a base pointer?  The secret is that it uses unwind metadata that is typically baked into the Portable Executable (PE) file at link time. You can inspect this information using the /UNWINDINFO option of the command line tool dumpbin.  For example, I went to the directory on my machine which contains clr.dll (c:\Windows\Microsoft.NET\Framework\v4.0.30319) and dumped the unwind info looking for CLREvent::WaitEx, which I have pasted below:

      00013F20 000DFDB0 000DFE3C 007267D8  ?WaitEx@CLREvent@@QEAAKKW4WaitMode@@PEAUPendingSync@@@Z (public: unsigned long __cdecl CLREvent::WaitEx(unsigned long,enum WaitMode,struct PendingSync *))
        Unwind version: 1
        Unwind flags: UHANDLER
        Size of prologue: 0x20
        Count of codes: 10
        Unwind codes:
          20: SAVE_NONVOL, register=rbp offset=0xB0
          1C: SAVE_NONVOL, register=rbx offset=0xA8
          0F: ALLOC_SMALL, size=0x70
          0B: PUSH_NONVOL, register=r14
          09: PUSH_NONVOL, register=r13
          07: PUSH_NONVOL, register=r12
          05: PUSH_NONVOL, register=rdi
          04: PUSH_NONVOL, register=rsi
        Handler: 0020ADF0 __CxxFrameHandler3
        EH Handler Data: 007B3F54

    I’ll get into what all of this means shortly.

    Note:    The dumpbin.exe functionality is also exposed via the linker.  For example, the command “dumpbin.exe /?” is identical to “link.exe /dump /?”.

     

    Within the debugger, you can find this same information for a particular function using the .fnent command.  To demonstrate, I executed the example code within a windbg instance and broke in at some random point and chose one of the threads to look at which has a stack looking like the following:

      12  Id: f80.7f0 Suspend: 1 Teb: 000007ff`fffa0000 Unfrozen
     # Child-SP          RetAddr           Call Site
    00 00000000`04a51e18 000007fe`fd4e10ac ntdll!NtWaitForSingleObject+0xa
    01 00000000`04a51e20 000007fe`f48bffc7 KERNELBASE!WaitForSingleObjectEx+0x79
    02 00000000`04a51ec0 000007fe`f48bff70 clr!CLREvent::WaitEx+0x170
    03 00000000`04a51f00 000007fe`f48bfe23 clr!CLREvent::WaitEx+0xf8
    04 00000000`04a51f60 000007fe`f48d51d8 clr!CLREvent::WaitEx+0x5e
    05 00000000`04a52000 000007fe`f4995249 clr!SVR::gc_heap::wait_for_gc_done+0x98
    06 00000000`04a52030 000007fe`f48aef28 clr!SVR::GCHeap::Alloc+0xb4
    07 00000000`04a520a0 000007fe`f48aecc9 clr!FastAllocatePrimitiveArray+0xc5
    08 00000000`04a52120 000007fe`f071244c clr!JIT_NewArr1+0x389
    09 00000000`04a522f0 000007fe`f07111b5 System_Numerics_ni+0x2244c
    0a 00000000`04a52330 000007ff`00150acf System_Numerics_ni+0x211b5
    0b 00000000`04a523d0 000007ff`0015098c 0x7ff`00150acf
    0c 00000000`04a52580 000007ff`0015098c 0x7ff`0015098c
    0d 00000000`04a52730 000007ff`0015098c 0x7ff`0015098c
    0e 00000000`04a528e0 000007ff`0015098c 0x7ff`0015098c
    0f 00000000`04a52a90 000007ff`0015098c 0x7ff`0015098c
    10 00000000`04a52c40 000007ff`0015098c 0x7ff`0015098c
    11 00000000`04a52df0 000007ff`0015098c 0x7ff`0015098c
    12 00000000`04a52fa0 000007ff`0015098c 0x7ff`0015098c
    13 00000000`04a53150 000007ff`0015098c 0x7ff`0015098c

    At first glance, it may appear that this stack is already trashed since there is no symbol information for the bottom frames in the display.  Before jumping to this conclusion, recall that this is a managed application and therefore contains JIT compiled code.  To verify that the addresses without symbol information are JIT’ed code, you can do a couple of things.

    First, use the !EEHeap extension in the SOS extension to determine if these addresses reside in the JIT code heap.  Below, you can see the commands I used to both load the SOS extension and then display the Execution Engine (EE) Heap information:

    0:014> .loadby sos clr

    0:014> !EEHeap -loader
    Loader Heap:
    --------------------------------------
    System Domain:     000007fef50955a0
    LowFrequencyHeap:  000007ff00020000(2000:1000) Size: 0x1000 (4096) bytes.
    HighFrequencyHeap: 000007ff00022000(8000:1000) Size: 0x1000 (4096) bytes.
    StubHeap:          000007ff0002a000(2000:2000) Size: 0x2000 (8192) bytes.
    Virtual Call Stub Heap:
      IndcellHeap:     000007ff000d0000(6000:1000) Size: 0x1000 (4096) bytes.
      LookupHeap:      000007ff000dc000(4000:1000) Size: 0x1000 (4096) bytes.
      ResolveHeap:     000007ff00106000(3a000:1000) Size: 0x1000 (4096) bytes.
      DispatchHeap:    000007ff000e0000(26000:1000) Size: 0x1000 (4096) bytes.
      CacheEntryHeap:  Size: 0x0 (0) bytes.
    Total size:        Size: 0x8000 (32768) bytes.
    --------------------------------------
    Shared Domain:     000007fef5095040
    LowFrequencyHeap:  000007ff00020000(2000:1000) Size: 0x1000 (4096) bytes.
    HighFrequencyHeap: 000007ff00022000(8000:1000) Size: 0x1000 (4096) bytes.
    StubHeap:          000007ff0002a000(2000:2000) Size: 0x2000 (8192) bytes.
    Virtual Call Stub Heap:
      IndcellHeap:     000007ff000d0000(6000:1000) Size: 0x1000 (4096) bytes.
      LookupHeap:      000007ff000dc000(4000:1000) Size: 0x1000 (4096) bytes.
      ResolveHeap:     000007ff00106000(3a000:1000) Size: 0x1000 (4096) bytes.
      DispatchHeap:    000007ff000e0000(26000:1000) Size: 0x1000 (4096) bytes.
      CacheEntryHeap:  Size: 0x0 (0) bytes.
    Total size:        Size: 0x8000 (32768) bytes.
    --------------------------------------
    Domain 1:          00000000003e73c0
    LowFrequencyHeap:  000007ff00030000(2000:1000) 000007ff00140000(10000:5000) Size: 0x6000 (24576) bytes total, 0x1000 (4096) bytes wasted.
    HighFrequencyHeap: 000007ff00032000(8000:5000) Size: 0x5000 (20480) bytes.
    StubHeap:          Size: 0x0 (0) bytes.
    Virtual Call Stub Heap:
      IndcellHeap:     000007ff00040000(4000:1000) Size: 0x1000 (4096) bytes.
      LookupHeap:      000007ff0004b000(2000:1000) Size: 0x1000 (4096) bytes.
      ResolveHeap:     000007ff0007c000(54000:1000) Size: 0x1000 (4096) bytes.
      DispatchHeap:    000007ff0004d000(2f000:1000) Size: 0x1000 (4096) bytes.
      CacheEntryHeap:  Size: 0x0 (0) bytes.
    Total size:        Size: 0xf000 (61440) bytes total, 0x1000 (4096) bytes wasted.
    --------------------------------------
    Jit code heap:
    LoaderCodeHeap:    000007ff00150000(40000:2000) Size: 0x2000 (8192) bytes.
    Total size:        Size: 0x2000 (8192) bytes.

    --------------------------------------
    Module Thunk heaps:
    Module 000007fee5581000: Size: 0x0 (0) bytes.
    Module 000007ff000330d8: Size: 0x0 (0) bytes.
    Module 000007fef06f1000: Size: 0x0 (0) bytes.
    Total size:      Size: 0x0 (0) bytes.
    --------------------------------------
    Module Lookup Table heaps:
    Module 000007fee5581000: Size: 0x0 (0) bytes.
    Module 000007ff000330d8: Size: 0x0 (0) bytes.
    Module 000007fef06f1000: Size: 0x0 (0) bytes.
    Total size:      Size: 0x0 (0) bytes.
    --------------------------------------
    Total LoaderHeap size:   Size: 0x21000 (135168) bytes total, 0x1000 (4096) bytes wasted.

    =======================================

    I have highlighted the JIT heap information and you can see that the JIT’ed code instruction pointers in the stack fall within this range.

    The second sanity check you can perform is to use a variant of the u instruction to confirm that there is a call instruction just prior to that address as shown below:

    0:012> ub 0x7ff`0015098c
    000007ff`0015095e 488b01          mov     rax,qword ptr [rcx]
    000007ff`00150961 48898424b0000000 mov     qword ptr [rsp+0B0h],rax
    000007ff`00150969 488b4108        mov     rax,qword ptr [rcx+8]
    000007ff`0015096d 48898424b8000000 mov     qword ptr [rsp+0B8h],rax
    000007ff`00150975 4c8d8424b0000000 lea     r8,[rsp+0B0h]
    000007ff`0015097d 488b5308        mov     rdx,qword ptr [rbx+8]
    000007ff`00150981 488d8c24c0000000 lea     rcx,[rsp+0C0h]
    000007ff`00150989 ff5318          call    qword ptr [rbx+18h]

    So at this point we have verified that we probably have a valid stack.  But how does the debugger so effectively walk this stack for us if there is no stack frame pointer?  The answer, of course, is that it uses the unwind information.

    To explore the answer to that question, let’s focus on a particular frame within the stack such as frame 4 in the stack above.  The code at that frame is inside the function clr!CLREvent::WaitEx, and if we pass that to .fnent, we get the following output:

    0:012> .fnent clr!CLREvent::WaitEx
    Debugger function entry 00000000`04075e40 for:
    (000007fe`f48bfdb0)   clr!CLREvent::WaitEx   |  (000007fe`f48bfe3c)   clr!CLREvent::Set
    Exact matches:
        clr!CLREvent::WaitEx = <no type information>

    BeginAddress      = 00000000`000dfdb0
    EndAddress        = 00000000`000dfe3c
    UnwindInfoAddress = 00000000`007267d8

    Unwind info at 000007fe`f4f067d8, 20 bytes
      version 1, flags 2, prolog 20, codes a
      frame reg 0, frame offs 0
      handler routine: clr!_CxxFrameHandler3 (000007fe`f49eadf0), data 7b3f54
      00: offs 20, unwind op 4, op info 5    UWOP_SAVE_NONVOL FrameOffset: b0
      02: offs 1c, unwind op 4, op info 3    UWOP_SAVE_NONVOL FrameOffset: a8
      04: offs f, unwind op 2, op info d     UWOP_ALLOC_SMALL
      05: offs b, unwind op 0, op info e     UWOP_PUSH_NONVOL
      06: offs 9, unwind op 0, op info d     UWOP_PUSH_NONVOL
      07: offs 7, unwind op 0, op info c     UWOP_PUSH_NONVOL
      08: offs 5, unwind op 0, op info 7     UWOP_PUSH_NONVOL
      09: offs 4, unwind op 0, op info 6     UWOP_PUSH_NONVOL

     Notice that this output is virtually identical to the same information provided by dumpbin using the /UNWINDINFO option.

    I have highlighted two interesting values above.  The value highlighted in green is a relative virtual address (RVA) to the unwind info that is baked into the PE file by the linker.  The value highlighted in yellow is the actual virtual address of the unwind info and can be computed by adding the module base address shown below to the RVA for UnwindInfoAddress.

    0:012> lmnm clr

    start             end                 module name

    000007fe`f47e0000 000007fe`f5145000   clr

    By examining the PE header using !dh you can confirm that the unwind information resides in the .rdata section of the module, which I have shown below:

    0:012> !dh clr

    File Type: DLL
    FILE HEADER VALUES
        8664 machine (X64)
           6 number of sections
    4BA21EEB time date stamp Thu Mar 18 07:39:07 2010

    <snip>
    SECTION HEADER #2
      .rdata name
      1FC8EC virtual size
      67F000 virtual address
      1FCA00 size of raw data
      67E200 file pointer to raw data
           0 file pointer to relocation table
           0 file pointer to line numbers
           0 number of relocations
           0 number of line numbers
    40000040 flags
             Initialized Data
             (no align specified)
             Read Only
    <snip>

    Using the Unwind Info

    Now let’s take a look at the unwind info and compare it to the prolog code of the function with which it is associated.  For convenience, I have reprinted the .fnent output for the function:

    0:012> .fnent clr!CLREvent::WaitEx
    Debugger function entry 00000000`04075e40 for:
    (000007fe`f48bfdb0)   clr!CLREvent::WaitEx   |  (000007fe`f48bfe3c)   clr!CLREvent::Set
    Exact matches:
        clr!CLREvent::WaitEx = <no type information>

    BeginAddress      = 00000000`000dfdb0
    EndAddress        = 00000000`000dfe3c
    UnwindInfoAddress = 00000000`007267d8

    Unwind info at 000007fe`f4f067d8, 20 bytes
      version 1, flags 2, prolog 20, codes a
      frame reg 0, frame offs 0
      handler routine: clr!_CxxFrameHandler3 (000007fe`f49eadf0), data 7b3f54
      00: offs 20, unwind op 4, op info 5    UWOP_SAVE_NONVOL FrameOffset: b0
      02: offs 1c, unwind op 4, op info 3    UWOP_SAVE_NONVOL FrameOffset: a8
      04: offs f, unwind op 2, op info d     UWOP_ALLOC_SMALL
      05: offs b, unwind op 0, op info e     UWOP_PUSH_NONVOL
      06: offs 9, unwind op 0, op info d     UWOP_PUSH_NONVOL
      07: offs 7, unwind op 0, op info c     UWOP_PUSH_NONVOL
      08: offs 5, unwind op 0, op info 7     UWOP_PUSH_NONVOL
      09: offs 4, unwind op 0, op info 6     UWOP_PUSH_NONVOL

    The yellow highlighted value tells us that the prolog code for the function is 0x20 bytes in length. Using that information we can dump out the prolog code for the function:

    0:012> u clr!CLREvent::WaitEx clr!CLREvent::WaitEx+20
    clr!CLREvent::WaitEx:
    000007fe`f48bfdb0 488bc4          mov     rax,rsp
    000007fe`f48bfdb3 56              push    rsi
    000007fe`f48bfdb4 57              push    rdi
    000007fe`f48bfdb5 4154            push    r12
    000007fe`f48bfdb7 4155            push    r13
    000007fe`f48bfdb9 4156            push    r14
    000007fe`f48bfdbb 4883ec70        sub     rsp,70h
    000007fe`f48bfdbf 48c7442440feffffff mov   qword ptr [rsp+40h],0FFFFFFFFFFFFFFFEh
    000007fe`f48bfdc8 48895810        mov     qword ptr [rax+10h],rbx
    000007fe`f48bfdcc 48896818        mov     qword ptr [rax+18h],rbp

    The list of operations in the unwind info is listed in the reverse order of the operations in the assembly code.  Each of the UWOP_PUSH_NONVOL operations in the unwind info maps to a nonvolatile register that is pushed onto the stack for safe keeping in the prolog code.  I have highlighted the sections within the prolog and the .fnent output such that highlighting with like colors indicates related information.  Now, let’s take a look at the raw stack and tie all of this information together. 

    Below is the stack with the frame we are focusing on highlighted in yellow:

    0:012> kn
     # Child-SP          RetAddr           Call Site
    00 00000000`04a51e18 000007fe`fd4e10ac ntdll!NtWaitForSingleObject+0xa
    01 00000000`04a51e20 000007fe`f48bffc7 KERNELBASE!WaitForSingleObjectEx+0x79
    02 00000000`04a51ec0 000007fe`f48bff70 clr!CLREvent::WaitEx+0x170
    03 00000000`04a51f00 000007fe`f48bfe23 clr!CLREvent::WaitEx+0xf8
    04 00000000`04a51f60 000007fe`f48d51d8 clr!CLREvent::WaitEx+0x5e
    05 00000000`04a52000 000007fe`f4995249 clr!SVR::gc_heap::wait_for_gc_done+0x98
    06 00000000`04a52030 000007fe`f48aef28 clr!SVR::GCHeap::Alloc+0xb4
    07 00000000`04a520a0 000007fe`f48aecc9 clr!FastAllocatePrimitiveArray+0xc5
    08 00000000`04a52120 000007fe`f071244c clr!JIT_NewArr1+0x389
    09 00000000`04a522f0 000007fe`f07111b5 System_Numerics_ni+0x2244c
    0a 00000000`04a52330 000007ff`00150acf System_Numerics_ni+0x211b5
    0b 00000000`04a523d0 000007ff`0015098c 0x7ff`00150acf
    0c 00000000`04a52580 000007ff`0015098c 0x7ff`0015098c
    0d 00000000`04a52730 000007ff`0015098c 0x7ff`0015098c
    0e 00000000`04a528e0 000007ff`0015098c 0x7ff`0015098c
    0f 00000000`04a52a90 000007ff`0015098c 0x7ff`0015098c
    10 00000000`04a52c40 000007ff`0015098c 0x7ff`0015098c
    11 00000000`04a52df0 000007ff`0015098c 0x7ff`0015098c
    12 00000000`04a52fa0 000007ff`0015098c 0x7ff`0015098c
    13 00000000`04a53150 000007ff`0015098c 0x7ff`0015098c

    Note:    The symbols above look a little weird and may lead you to believe that WaitEx is calling itself recursively, but it is not.  It only appears that way because you need the private symbols for clr.dll to be able to see the real function name.  Only public symbols are available outside of Microsoft.

     

    And below is the raw stack relevant to this frame with some highlighting and annotations that I have added:

    0:012> dps 00000000`04a51f60-10 L20
    00000000`04a51f50  00000000`00000001
    00000000`04a51f58  000007fe`f48bfe23 clr!CLREvent::WaitEx+0x5e
    00000000`04a51f60  00000000`c0402388
    00000000`04a51f68  00000000`c0402500
    00000000`04a51f70  000007fe`f48afaf0 clr!SystemNative::ArrayCopy
    00000000`04a51f78  00000000`00000182
    00000000`04a51f80  00000000`04a521d0
    00000000`04a51f88  000007fe`00000001
    00000000`04a51f90  00000000`00000057
    00000000`04a51f98  00000000`c0402398
    00000000`04a51fa0  ffffffff`fffffffe
    00000000`04a51fa8  007f0000`04a521d0
    00000000`04a51fb0  fffff880`009ca540
    00000000`04a51fb8  000007fe`f483da5b clr!SVR::heap_select::select_heap+0x1c
    00000000`04a51fc0  fffff880`009ca540
    00000000`04a51fc8  000007fe`fd4e18aa KERNELBASE!ResetEvent+0xa
    00000000`04a51fd0  00000000`0043dc60
    00000000`04a51fd8  00000000`00000178
    00000000`04a51fe0  00000000`00493c10
    00000000`04a51fe8  00000000`0043dc60    
    ß saved rdi
    00000000`04a51ff0  00000000`00000001

       *** call into clr!CLREvent::WaitEx ***

    00000000`04a51ff8  000007fe`f48d51d8 clr!SVR::gc_heap::wait_for_gc_done+0x98
    00000000`04a52000  00000000`00493ba0
    00000000`04a52008  00000000`00493ba0    
    ß saved rbx
    00000000`04a52010  00000000`00000058    
    ß saved rbp
    00000000`04a52018  000007fe`f0711e0f System_Numerics_ni+0x21e0f
    00000000`04a52020  00000000`00000178
    00000000`04a52028  000007fe`f4995249 clr!SVR::GCHeap::Alloc+0xb4
    00000000`04a52030  00000000`0043a140
    00000000`04a52038  00000000`0043dc60
    00000000`04a52040  00000000`00000000
    00000000`04a52048  00000000`04a522e0

    In the stack listing I have used the same color highlighting scheme as before to show how the data on the raw stack correlates to the unwind data.  And, using green highlighting, I have shown how the Child-SP value correlates to the stack frame.

    The cyan highlighting represents nonvolatile registers that are pushed onto the stack in the prolog code.  The blue highlighting represents stack space reserved for locals and for register home space allocated for calling sub routines.  In the unwind data the stack reservation is represented by a UWOP_ALLOC_SMALL operation.  And the red highlighting represents nonvolatile registers that are stored in the home space of the previous stack frame and represented by a UWOP_SAVE_NONVOL operation stored in the unwind information.

    As you can see, we have all of the information we need in the unwind data to determine which slots on the stack are used for what.  The only thing we don’t know is the partitioning of the reserved stack space for locals, which is described by the private symbol information for the clr.dll module.

    Tying it all Together

    .fnent produces its output directly from parsing the definition of the UNWIND_INFO structure and it even gives you the address of where that structure lives in memory.  The UNWIND_INFO structure also contains a variable amount of UNWIND_CODE structures.  You can find details of the structure definitions for UNWIND_INFO and UNWIND_CODE here.  Each parsed line of unwind information in the .fnent output is backed by at least one of these structures.  In fact, you can see the correlation between the structure fields for UNWIND_INFO and the data in the .fnent output as shown below:

    From UNWIND_CODE:

    UBYTE

    Offset in prolog

    UBYTE: 4

    Unwind operation code

    UBYTE: 4

    Operation info

    From .fnent:

      05: offs b, unwind op 0, op info e     UWOP_PUSH_NONVOL

    The meaning of the OpInfo (operation info) field is dependent on the UnwindOp (unwind operation code) field and is spelled out in the documentation for UNWIND_CODE.  For example, for the UWOP_PUSH_NONVOL operation shown above, the OpInfo field is an index into the following table, which indicates which nonvolatile register this push is associated with.  Note that the values in the below table are in decimal, while the .fnent values are in hex:

    0

    RAX

    1

    RCX

    2

    RDX

    3

    RBX

    4

    RSP

    5

    RBP

    6

    RSI

    7

    RDI

    8 to 15

    R8 to R15

    Therefore, the previous line from the .fnent output represents a push operation for the r14 register (05: offs b, unwind op 0, op info e   UWOP_PUSH_NONVOL).  Looking at the assembly above, we see that the topmost UWOP_PUSH_NONVOL operation in the .fnent output correlates to the last nonvolatile register push in the prolog code (push    r14).

    Note:    Remember, the push operations in the .fnent output are listed in the reverse order of where they are in the actual prolog code.  This helps the unwind code easily calculate offsets of where they should live in the stack.

     

    One thing that you will notice in the x64 calling convention is that once the prolog code has executed, the value for rsp will very rarely change.  The Child-SP value in the stack displayed by the k commands is the value of rsp for that frame after the prolog code has executed.  So the offsets to access these nonvolatile registers are then applied to the Child-SP value (previously highlighted in green) to find where they live on the stack.  So, in a way, the Child-SP value acts like the base pointer we are used to on the x86 platform.

    In the .fnent output above, you will also see the following:

      00: offs 20, unwind op 4, op info 5    UWOP_SAVE_NONVOL FrameOffset: b0

    For UWOP_SAVE_NONVOL, you see that the .fnent output shows us the offset where we can find this register, and the register in question is represented by the OpInfo value that equates to rbp.  The offset above is applied to the Child-SP value (00000000`04a51f60 in this case) to produce the address 00000000`04a52010, which indicates that’s where we can find a saved copy of rbp.  I have also annotated where it lives in the raw stack output shown previously.

    Note:    If you’re wondering why rbp is stored in the previous stack frame, check out my previous post on this topic where I describe how in optimized builds, the compiler can use the home space from the previous stack frame to save nonvolatile registers thus saving them with a MOV operation as opposed to a PUSH operation.  This is possible because in optimized builds the home space is not necessarily used to store parameters.

     

    So how does all of This Work for CLR JIT Code?

    If you have asked this question, then you are definitely paying attention!  As we have shown, the compiler and linker are responsible for placing unwind info in the Portable Executable file at build time.  But what about dynamic code that is generated at runtime?  Certainly there must be unwind information for dynamically compiled code as well, otherwise there would be no way to walk the stack or unwind the stack after an exception.

    As it turns out, APIs exist for this very situation, including RtlAddFunctionTable  and RtlInstallFunctionTableCallback.  In fact, the CLR uses RtlInstallFunctionTableCallback.  The generated unwind information is then rooted in a linked list where the head is at ntdll!RtlpDynamicFunctionTable.  The format of the linked list items is undocumented as it is an implementation detail, but using dbghelp.dll you can find the unwind information for a given instruction pointer if you so desire by calling SymFunctionTableAccess64.

    In fact, if you want to see the CLR adding dynamic unwind info in action you can run the test code above under the debugger, and then at the initial breakpoint, before the application starts running, set the following breakpoint:

    bu ntdll!RtlInstallFunctionTableCallback

    When you let the application run you should then end up with a call stack at the breakpoint that looks like the following, which clearly shows the JIT compiler adding the unwind info to the table dynamically:

    0:000> kn
     # Child-SP          RetAddr           Call Site
    00 00000000`0017dca8 000007fe`f4832cc6 ntdll!RtlInstallFunctionTableCallback
    01 00000000`0017dcb0 000007fe`f4831422 clr!InstallEEFunctionTable+0x77
    02 00000000`0017df60 000007fe`f4828ca8 clr!StubLinker::EmitUnwindInfo+0x492
    03 00000000`0017e050 000007fe`f4832c1a clr!StubLinker::EmitStub+0xe8
    04 00000000`0017e0b0 000007fe`f48328e5 clr!StubLinker::LinkInterceptor+0x1ea
    05 00000000`0017e160 000007fe`f4831e40 clr!CTPMethodTable::CreateStubForNonVirtualMethod+0xa35
    06 00000000`0017e300 000007fe`f4832926 clr!CRemotingServices::GetStubForNonVirtualMethod+0x50
    07 00000000`0017e3c0 000007fe`f48223f3 clr!MethodDesc::DoPrestub+0x38b
    08 00000000`0017e4d0 000007fe`f47e2d07 clr!PreStubWorker+0x1df
    09 00000000`0017e590 000007fe`f48210b4 clr!ThePreStubAMD64+0x87
    0a 00000000`0017e660 000007fe`f48211c9 clr!CallDescrWorker+0x84
    0b 00000000`0017e6d0 000007fe`f4821245 clr!CallDescrWorkerWithHandler+0xa9
    0c 00000000`0017e750 000007fe`f4823cf1 clr!MethodDesc::CallDescr+0x2a1
    0d 00000000`0017e9b0 000007fe`f49cdc3d clr!MethodDescCallSite::Call+0x35
    0e 00000000`0017e9f0 000007fe`f4999f0d clr!AppDomain::InitializeDomainContext+0x1ac
    0f 00000000`0017ebf0 000007fe`f49212a1 clr!SystemDomain::InitializeDefaultDomain+0x13d
    10 00000000`0017f0c0 000007fe`f4923dd6 clr!SystemDomain::ExecuteMainMethod+0x191
    11 00000000`0017f670 000007fe`f4923cf3 clr!ExecuteEXE+0x43
    12 00000000`0017f6d0 000007fe`f49a7365 clr!CorExeMainInternal+0xc4
    13 00000000`0017f740 000007fe`f8ad3309 clr!CorExeMain+0x15

    But there is one more wrinkle to this picture.  We now know that by using RtlInstallFunctionTableCallback the CLR, or any other JIT engine, can register a callback that provides the unwind information at runtime.  But how does the debugger access this information?  When the debugger is broken into the process or if you are debugging a dump, it cannot execute the callback function registered with RtlInstallFunctionTableCallback.

    This is where the sixth and final parameter to RtlInstallFunctionTableCallback comes into play.  By providing the OutOfProcessCallbackDll parameter, the CLR is providing a dll which the debugger can use to effectively parse through the JITer’s unwind information statically.  When inspecting which path the CLR passes for OutOfProcessCallbackDll on my machine, I see the following string:

    0:000> du /c 80 000007fe`f5916160
    000007fe`f5916160  "C:\Windows\Microsoft.NET\Framework64\v4.0.30319\mscordacwks.dll"

    So, the debugger uses mscordacwks.dll to statically examine the unwind info while the process is broken in the debugger or while inspecting a dump.

    Note:    This is one of the many reasons why you must have a complete process dump to effectively post-mortem debug managed applications.

    Using the ‘k =’ Command to Dump the Stack

    If you look at the documentation for the k command, you’ll see that there is a way to override the base pointer when walking the stack.  However, the documentation leaves it a complete mystery as to how to apply this in the x64 world.  To demonstrate what I mean, consider the following stack from earlier:

    0:012> kn
     # Child-SP          RetAddr           Call Site
    00 00000000`04a51e18 000007fe`fd4e10ac ntdll!NtWaitForSingleObject+0xa
    01 00000000`04a51e20 000007fe`f48bffc7 KERNELBASE!WaitForSingleObjectEx+0x79

    02 00000000`04a51ec0 000007fe`f48bff70 clr!CLREvent::WaitEx+0x170
    03 00000000`04a51f00 000007fe`f48bfe23 clr!CLREvent::WaitEx+0xf8
    04 00000000`04a51f60 000007fe`f48d51d8 clr!CLREvent::WaitEx+0x5e
    05 00000000`04a52000 000007fe`f4995249 clr!SVR::gc_heap::wait_for_gc_done+0x98
    06 00000000`04a52030 000007fe`f48aef28 clr!SVR::GCHeap::Alloc+0xb4
    07 00000000`04a520a0 000007fe`f48aecc9 clr!FastAllocatePrimitiveArray+0xc5
    08 00000000`04a52120 000007fe`f071244c clr!JIT_NewArr1+0x389
    09 00000000`04a522f0 000007fe`f07111b5 System_Numerics_ni+0x2244c
    0a 00000000`04a52330 000007ff`00150acf System_Numerics_ni+0x211b5
    0b 00000000`04a523d0 000007ff`0015098c 0x7ff`00150acf
    0c 00000000`04a52580 000007ff`0015098c 0x7ff`0015098c
    0d 00000000`04a52730 000007ff`0015098c 0x7ff`0015098c
    0e 00000000`04a528e0 000007ff`0015098c 0x7ff`0015098c
    0f 00000000`04a52a90 000007ff`0015098c 0x7ff`0015098c
    10 00000000`04a52c40 000007ff`0015098c 0x7ff`0015098c
    11 00000000`04a52df0 000007ff`0015098c 0x7ff`0015098c
    12 00000000`04a52fa0 000007ff`0015098c 0x7ff`0015098c
    13 00000000`04a53150 000007ff`0015098c 0x7ff`0015098c

    Now, imagine the top of the stack is corrupted, which I have “simulated” by blacking out the top few frames in this stack dump.  Furthermore, let’s assume that we identified a frame where the stack starts to look sane again by looking at the raw stack below:

    0:012> dps 00000000`04a51e90 
    00000000`04a51e90  00000000`00000000
    00000000`04a51e98  00000000`04a52130
    00000000`04a51ea0  00000000`ffffffff
    00000000`04a51ea8  00000000`ffffffff
    00000000`04a51eb0  00000000`00000108
    00000000`04a51eb8  000007fe`f48bffc7 clr!CLREvent::WaitEx+0x170
    00000000`04a51ec0  00000000`00000000
    00000000`04a51ec8  00000000`00000108
    00000000`04a51ed0  000007fe`00000000
    00000000`04a51ed8  00000000`00000108
    00000000`04a51ee0  ffffffff`fffffffe
    00000000`04a51ee8  00000000`00000001
    00000000`04a51ef0  00000000`00000000
    00000000`04a51ef8  000007fe`f48bff70 clr!CLREvent::WaitEx+0xf8
    00000000`04a51f00  00000000`00000000
    00000000`04a51f08  00000000`00493ba0

    From looking at this stack, we can see the typical pattern of stack frames because the return addresses resolve to symbols of sorts.

    To dump out the corrupted stack, here is the undocumented syntax for the x64 platform:

    k = <rsp> <rip> <frame_count>

    <rsp> is the stack pointer to start with.  You want to use the stack pointer that would have been in rsp when that function was active.  Remember, typically rsp does not change after the function prolog code completes.  Therefore, if you pick the stack pointer just below the return address, you should be good.

    <rip> should be an instruction pointer from within the function that was executing at the time the <rsp> value above was in play.  In this case, the return address directly above <rsp> comes from that function and I have highlighted it in green.  This piece of information is critical so that the debugger can find the unwind metadata for the function that was current at this point in the stack.  Without it, the debugger cannot walk the stack.

    Armed with this information, you can construct a k command to dump the stack starting from this frame as shown below:

    0:012> kn = 00000000`04a51ec0 000007fe`f48bffc7 10
     # Child-SP          RetAddr           Call Site
    00 00000000`04a51ec0 000007fe`f48bff70 clr!CLREvent::WaitEx+0x170
    01 00000000`04a51f00 000007fe`f48bfe23 clr!CLREvent::WaitEx+0xf8
    02 00000000`04a51f60 000007fe`f48d51d8 clr!CLREvent::WaitEx+0x5e
    03 00000000`04a52000 000007fe`f4995249 clr!SVR::gc_heap::wait_for_gc_done+0x98
    04 00000000`04a52030 000007fe`f48aef28 clr!SVR::GCHeap::Alloc+0xb4
    05 00000000`04a520a0 000007fe`f48aecc9 clr!FastAllocatePrimitiveArray+0xc5
    06 00000000`04a52120 000007fe`f071244c clr!JIT_NewArr1+0x389
    07 00000000`04a522f0 000007fe`f07111b5 System_Numerics_ni+0x2244c
    08 00000000`04a52330 000007ff`00150acf System_Numerics_ni+0x211b5
    09 00000000`04a523d0 000007ff`0015098c 0x7ff`00150acf
    0a 00000000`04a52580 000007ff`0015098c 0x7ff`0015098c
    0b 00000000`04a52730 000007ff`0015098c 0x7ff`0015098c
    0c 00000000`04a528e0 000007ff`0015098c 0x7ff`0015098c
    0d 00000000`04a52a90 000007ff`0015098c 0x7ff`0015098c
    0e 00000000`04a52c40 000007ff`0015098c 0x7ff`0015098c
    0f 00000000`04a52df0 000007ff`0015098c 0x7ff`0015098c

    Note:    The frame count in the above k expression is required.  That is the way the debugger engine distinguishes between this variant of the command (with an overridden rip) and the documented form of k that does not provide an overridden rip.

    Conclusion

    Since the x64 calling convention does not utilize a base pointer (among other things), we need some extra information to effectively walk the stack.  That extra information comes in the form of unwind metadata and is generated by the compiler and linker for static code and baked into the portable executable file.  If you happen to code in assembly language, there are various macros that you must use to decorate your assembly code so that the assembler can generate the proper unwind metadata.  For dynamically compiled code, that information is instead provided at runtime by registering a callback with the system.  Knowing this information is critical if you encounter a corrupted stack and must piece it together manually.  In such situations you’ll need to know how to dig out the unwind metadata manually and use it to effectively reconstruct the call stack.

    That said, you could spare yourself some effort and use the undocumented variant of the k command described above to dump the stack starting at any frame. J

    Happy debugging everyone!

    "The example companies, organizations, products, domain names, e-mail addresses, logos, people, places, and events depicted herein are fictitious.  No association with any real company, organization, product, domain name, email address, logo, person, places, or events is intended or should be inferred."

  • Ntdebugging Blog

    Uncovering How Workspaces Work in WinDbg

    • 1 Comments

    Author - Jason Epperly

    Workspaces have always been a little confusing to me. I knew how to bend them to do what I needed to get the job done, however they still remained a bit mysterious. Recently I decided to sort this out, just so I knew how they worked under the hood. But before I show you my investigation let's discuss the different types of workspaces. Windbg uses several built-in types including Base, User, Kernel, Remote, Processor Architecture, Per Dump, and Per Executable. It also uses named workspaces (or user defined workspaces). When you perform a particular type of debugging (e.g. live user-mode, post-mortem dump analysis etc.) these workspaces are combined into the final environment. Here's a diagram to illustrate the possible combination of workspaces.

    image

    • The green line is the case where WinDbg is used to open a dump file. In this scenario the Base workspace + per dump workspace is used. Note: per dump simply means each individual dump file opened gets its own workspace.
    • The blue line is the scenario where WinDbg is used to live debug a running application, using the Base+User-mode workspaces.
    • The orange line is an example of WinDbg used to perform a live kernel debug on an x86 machine. In this case windbg is using Base+Kernel+x86 workspaces.

    From the diagram you can see windbg typically uses a combination of two workspaces. While live kernel debugging it uses three workspaces.

    So what is in a workspace?

    • Session Information
      • All break point (bp's) and handling information for exceptions and events (sx settings).
      • All open source files.
      • All user-defined aliases.
    • Configuration settings
      • The symbol path.
      • The executable image path.
      • The source path.
      • The current source options that were set with l+, l- (Set Source Options).
      • Log file settings.
      • The COM or 1394 kernel connection settings, if the connection was started by using the graphical interface.
      • The most recent paths in each Open dialog box (except for the workspace file and text file paths, which are not saved).
      • The current .enable_unicode, .force_radix_output, and .enable_long_status settings.
    • WinDbg graphical interface.
      • The title of the WinDbg window
      • The Automatically Open Disassembly setting
      • The default font
      • The size and position of the WinDbg window on the desktop.
      • Which debugging information windows are open.
      • The size and position of each open window, including the window's size, its floating or docked status, whether it is tabbed with other windows, and all of the related settings in its shortcut menu.
      • The location of the pane boundary in the Debugger Command window and the word wrap setting in that window.
      • Whether the toolbar and status bar, and the individual toolbars on each debugging information window, are visible.
      • The customization of the Registers window.
      • The flags in the Calls window, Locals window, and Watch window.
      • The items that were viewed in the Watch window.
      • The cursor location in each Source window.

    All of these settings (except for the blue ones) are applied cumulatively (Base first, then the next workspace, etc). The blue items above are only loaded from the last workspace in the chain. To show this in action I created a simple walk through to illustrate the use of workspaces the debugger.

    First I opened windbg without the use of any command line options. When it opens in this dormant state (not attached to anything and has nothing opened) its using the Base workspace. If I don't change anything (e.g. window placement) I am not prompted with any workspace dialogs when I start debugging. However if I moved the debugger's main window to any location (we will call this position 1) followed by executing any of the highlighted operations below -

    image

    I am prompted with this dialog-

    image

    Choosing "Yes" on the dialog above integrates my changes into the "Base" workspace so window position 1 is now part of my Base workspace.

    Now I'm going to select "Open Executable" and browse to our old faithful target binary notepad.exe. Once the binary is opened, windbg uses Base+Notepad (per Executable file). Now I'll move the debugger's main window again (we will call this position 2) and choose the option Debug > Stop debugging. Because of the window location change, I am prompted with the following-

    image

    If I choose 'Yes', windbg will use window position 2 for anytime I open the notepad executable in the future. After closing the notepad.exe executable, windbg reverts back to using the Base workspace.

    This time I'll actually launch notepad (not from the debugger) and attach to the running notepad.exe process with the debugger. We are now at Base+User-mode. I moved the debugger window (new position 3), selected Debug > Stop Debugging and get prompted with this dialog-

    image

    Choosing "Yes" will store WinDbg window position 3 in the User-mode workspace. Once I have completed this step, Windbg is again using the Base workspace because we stopped debugging.

    To futher illustrate workspaces I'll attach a to a target Virtual Machine for Kernel Debugging but not break in. Windbg is using Base+Kernel now. I moved the window again and as soon as I break-in I get this dialog-

    image

    I chose 'No' on the dialog because I'm getting the hang of things. If I move the window again and type qd (quit and detach) to end the current kernel debug session, I will see this dialog-

    image

    So before we ended the session, we were at Base+Kernel+AMD64.

    Running through this exercise helped me understand why I usually create a named workspace, change all my settings and use the command line option -W to open my workspace. Hopefully this will clear up some of the complexities involved with workspaces. This is why the debugger help file recommends making all the changes you need at the lowest possible level (i.e. Base first, then the others).

    Hope this helps...

  • Ntdebugging Blog

    Storport ETW Logging to Measure Requests Made to a Disk Unit

    • 10 Comments

    Greetings NTDEBUGGING community, Bob here again, and today I would like to let everyone know about a new feature implemented in the STORPORT.SYS binaries released in the following KB articles:

    ·         KB 979764 (Windows Server 2008)

    ·         KB 978000 (Windows Server 2008 R2)

     

    Beginning with these new versions, it is now possible to measure timing statistics for requests made to a system’s disk unit.  These measurements are taken at the lowest possible level of OS interaction with the storage adapter hardware, making it much easier to diagnose storage performance issues. The measurements are taken from the port driver, STORPORT.SYS, via ETW (Event Tracing for Windows).  Windows drivers are implemented in a layered architecture, so in our case the STORPORT driver interfaces directly with the adapter or miniport driver.  The STORPORT driver is responsible for managing requests and queues, and providing other upper level functions.  Each miniport driver is specific to a particular storage adapter card, and knows how to deliver requests to the disk unit over the transport medium, like fibre channel for instance.

    Configuring and Starting STORPORT ETW Logging

    To measure storage performance, the user enters a threshold value via the GUI interface. (This can be scripted as well; see KB 978000 for details.)  When an I/O time is equal to or greater than the entered threshold, an event 13 is written to the ETW log.  

    Here is a walkthrough to enable this feature via the GUI:

    Click Start, type “perfmon”, and press ENTER to start Performance Monitor.

    image

     

    Expand “Data Collector Sets” and select “Event Trace Sessions”.

    image

     

    Right-Click on “Event Trace Sessions” and select “New -> Data Collector Set”. 

    image

     

    The following dialog will appear:

    image

     

    Give the new data collector set a name in the dialog box.  In this example I called it “Storport Performance Logging”. Choose the “Create manually (Advanced)” option then click Next to see the following dialog.

    image

     

    After clicking “Add...” on the dialog above the following list of providers will appear. Select “Microsoft-Windows-StorPort” and click OK.

    image

     

    When the dialog below opens, select “Keywords (Any)” then click Edit.

    image

     

    Check the box for IOPerfNotification, and then click OK. Note:  For Windows Server 2008 SP1 and SP2, some properties may not be present (e. g. Reset, PortNotification and PauseResume). 

    image

     

     

    Next, select “Filter” and click Edit…

    image

     

    This is where we enter our time threshold.  Remember that timing values greater than or equal to this filter value will be logged to the trace file.

     

    Select “Binary” and check the “Filter enabled” box

    image

    In the “Filter data” field, enter the threshold value in the number of milliseconds. This must be entered in little endian format. Please refer to this table for the most common entries.

    Decimal

    Hexadecimal

    Binary(little endian)

    1ms

    5ms

    10ms

    15ms

    1

    5

    A

    F

    01 00 00 00 00 00 00 00

    05 00 00 00 00 00 00 00

    0A 00 00 00 00 00 00 00

    0F 00 00 00 00 00 00 00

     

    Note: The “Filter type” value will always remain 0 as in the example above showing an entry for 1 ms. 

     

    WARNING - The whole data line needs to be filled in when entering the threshold. For demonstration purposes here is the WRONG WAY to enter 256 ms. This will not work because of the missing 00 entries on the line.

    image 

    Below is the correct way to enter the value (256ms) Note: The Filter value may have to be re-entered for next run because it could get reset.

    image

    After clicking OK, you will see a new Event Trace session (screenshot below).  Right-Click the new Event Trace session and select “Properties”

    image

     

    Next we need to set a location for the trace file.  Ensure you have sufficient disk space in the target folder because logs can be many Megabytes in size depending on the number of hits. 

    image

     

    The target folder may be changed in the “Directory” tab

    image

     

    I recommend choosing a directory that is more easily located, like C:\PerfLogs, in which case you can click “Browse”

    image

     

    Note:  It is not recommended to store the trace file on the same storage you are troubleshooting.

    image

    The name for the file is selected on the “File” tab.

    image

    When you are done setting the file and directory information click OK.

    Now right-click the log and select “Start” to begin tracing.

    image

    The “Status” should indicate that it is running.

    image

    How long to run the log depends on you issue.  You should let the log run until the issue or slow down occur then check the log for delays.

    Viewing the Log File

    First right-click and select “Stop” to flush any cached data and stop the logging.

    image

    Then use Event Viewer to examine the log.

    image

     

    Next choose "Open the saved log…"

    image

     

    image

    image

     

    After clicking OK a popup dialog will be displayed to ask if you want to convert the log. Select “No”. Here is some sample event data that you will likely see in the log.

    image

     

    Example: Event ID 13

    image

     

    Let's examine the “Details” tab which shows information about the I/O that exceeded or equaled the threshold we had set.  If you do not set a threshold, then all I/O events are logged, but for performance reasons we recommend setting a "reasonable" threshold like 15ms.

    image

    Breaking down the Event Data we have …

     

    Miniport Extension: (Kernel address of the miniport extension, not of use to performance logging, only debugging)

    PortNumber:  This is the adapter port number (i.e. RaidPort1, etc.)

    PathID:  Path number if multi-pathing is in place

    TargetID:  Target LUN exposed to the Operating System

    LUN:  The Logical Unit Number of the physical storage where the particular I/O is directed

    *RequestDuration:  How long the request took in milliseconds (the number of note here for storage latency diagnostics)

    Command:  the SCSI command sent to the unit. (Usually read or write)  This is listed in decimal.  Convert this value to hexadecimal with Calc.exe for referencing SCSI commands http://en.wikipedia.org/wiki/SCSI_command

    ScsiStatus:  the status that came back from the drive

    SRBStatus:  the status as returned from adapter  (see srb.h and scsi.h in the Microsoft WDK or http://en.wikipedia.org/wiki/SCSI_Status_Code)

     

     Example: Command 42 equals 2A which is a Write of 10 bytes.

    image

    image

    Here’s the respective SCSI command:

    image

    When viewing the log, the commands will be mostly reads and writes.  If I/O is determined to be slow at this level, then the issue is most likely disk or SAN related.

    Hope you enjoyed the walkthrough!

    -Bob

     

     

     

     

     

     

     

  • Ntdebugging Blog

    Understanding VSS in DPM

    • 1 Comments

    Data Protection Manager 2007 utilizes Windows’ Volume Shadow Copy Service to protect Data Sources on Production Servers, and to manage Recovery Points on the DPM server. This presentation provides step-by-step illustrations and explanations of how DPM uses VSS to backup a Protected Server’s file share and System State data sources, and then manage the resulting recovery points in DPM.

     

    Get Microsoft Silverlight

Page 9 of 24 (239 items) «7891011»