• Ntdebugging Blog

    Remoting Your Debug Crash Cart With KDNET

    • 2 Comments

    This is Christian Sträßner from the Global Escalation Services team based in Munich, Germany.

     

    Back in January, my colleague Ron Stock posted an interesting article about Kernel Debugging using a serial cable: How to Setup a Debug Crash Cart to Prevent Your Server from Flat Lining

     

    Today we look at a new kernel debugging transport introduced in Windows 8 and Windows Server 2012 that makes the cabling much easier, now a network cable can be used as a debug cable. The new KDNET transport utilizes a PCI Ethernet network card in the Target. Most major NIC Vendors have compatible NICs. You can find a list of supported NICs here:

    http://msdn.microsoft.com/en-us/library/windows/hardware/hh830880.aspx

     

    Note that this will not work with Wireless or USB attached NICs in the Target.

     

    In the example below, we utilized an Acer AC 100 Server as the Target. It ships with an onboard Intel 82579LM Gigabit NIC:

     

    Network Adapters

     

    The great thing about KDNET is that the NIC can still be used for normal network activity. The “Microsoft Kernel Debug Network Adapter” driver is the magic behind this. When KDNET.DLL is active, the NIC’s driver will be “banged out” and KDNET takes control of the NIC.

     

    BCD Configuration

    To configure KDNET, you first need to determine the IPv4 Address of the machine with the debugger. In our example, ipconfig.exe tells us that it is 192.168.1.35:

     

    ipconfig

     

    Next go to your Target machine.

     

    The kernel debug settings used to configure KDNET are stored globally in the BCD Store in the {dbgsettings} area. The kernel debug settings apply to all boot entries.

     

    Use bcdedit.exe /dbgsettings net hostip:<addr>port:<port> to set the transport to KDNET, the IP Address of the debugger and the port. You can connect multiple targets to the same debug host by using a different port for each target.

     

    BCD will generate a cryptographic key for you automatically the first time. You can generate a new cryptographic key by appending the ‘newkey’ keyword. Copy the ‘Key’ to a secure location - you will need it in the debugger.

     

    You can display the debug settings using: bcdedit /dbgsettings

     

    Next, for safety, copy the {current} entry to a new entry (bcdedit /copy {current} /d <description>).

     

    Then enable kernel debugging on the copy (bcdedit.exe /debug {new-guid} on).

     

    If required, also use this (new) entry to enable the checked kernel (bcdedit /set {new-guid} hal <path> and bcdedit.exe /set {new-guid} kernel <path>).

     

    bcdedit

     

    Debugger

    On your Debugger Machine open WinDbg->File->Kernel Debugging (Ctrl-K) and choose the NET tab:

     

    Copy and paste the ‘Key’ here and set the port to the value specified on the Target (the default is 50000):

     

     Kernel Debugging

     

    Next a dialog from Windows Firewall might pop up (depending on your configuration). You want to allow access at this point.

     

    Windows Firewall

     

    You need to make sure that your debug host machine allows inbound UDP traffic on the configured port (50000 in this example, and by default) for the network type in use.

     

    If your company has implemented IPSec Policies, make sure you have exceptions in place that allow unsecured communication on the port used (KDNET does not talk IPSec).

     

    The Debugger Window will now look like this:

     

    windbg waiting to reconnect

     

    The Debugger is now set up and ready to go.

     

    Reboot the target system now.

     

    When the target comes back online, it will try to connect to the IP Address and Port that was configured with the bcdedit.exe command. The Debugger Command Window will look something like the screenshot below.

     

    windbg connected

     

    You now can break in as usual. This is a good time to fix your symbol setup if you have not done it yet.

     

    Operation

    You still can communicate normally over the NIC and IP that you use on the target. You do not need an additional NIC in the target to use KDNET. When debugging production servers with heavy traffic, we recommend using a dedicated NIC for debugging (note, 10GigE NICs are currently not supported).

     

    If you don’t want the NIC to be used by the OS as well, it can be disabled via: bcdedit.exe -set loadoptions NO_KDNIC

     

    Normal Network IO

     

    Although you can use KDNET to debug power state transitions (in particular Connected Standby), it is best avoided. The KDNET protocol polls on a regular basis and as such, many systems will not drop to a lower power state. Instead, use USB, 1394, or serial.

     

    Disconnecting the NIC from media (unplugging the NIC in the target machine) is not supported and will most likely blue screen the target machine.

     

    Note 1:

    If you have more than one NIC in your target, please read the following (copied from the debugger help):

    If there is more than one network adapter in the target computer, use Device Manager to determine the PCI bus, device, and function numbers for the adapter you want to use for debugging. Then in an elevated Command Prompt window, enter the following command, where b, d, and f are the bus number, device number, and function number of the adapter:

    bcdedit /set {dbgsettings} busparams b.d.f

     

    Note 2:

    If you use the Windows NIC Teaming (LBFO) in Server 2012: KDNET is not compatible with NIC Teaming as indicated by the Whitepaper:

    http://download.microsoft.com/download/F/6/5/F65196AA-2AB8-49A6-A427-373647880534/[Windows%20Server%202012%20NIC%20Teaming%20(LBFO)%20Deployment%20and%20Management].docx

     

    How does it look on the network?

    This is a packet sent from the target to the debug host machine.

     

    Network Packet

     

    The TTL of the packets sent from the target to the debug host is currently set to 16 (this is not configurable).

     

    This screenshot shows that your connection can only run over 16 IP hops max. This is a theoretical limitation, but it highlights some important facts. Your host is not talking to the Windows IP stack on the target, instead it talks to a basic IPv4/UDP implementation in KDNET. The transport is UDP/IPv4 based, so there is not much tolerance for poor network conditions aside from retry operations at the Debugger Transport Protocol Level.

     

    A few words on performance.

    The performance is generally limited by the latency of the link between the host and target. Therefore, even with a LAN like latency (<=1ms), you will not be able to get even close to wire speed of a 1GigE Connection. Expect to see speeds between 1.5 – 2.5Mbytes/s.

     

    Keep this in mind when you plan to pull large portions of memory from the target over KDNET (like the .dump command). This screenshot was taken while executing the .dump /f command (Full Kernel Dump):

     

    Network Activity

     

    Even with the performance restrictions mentioned, KDNET is a valuable extension of the existing debugging methods.  It allows you to debug a Windows machine without the need for special hardware (1394) or legacy ports (serial) that not every machine has today (especially tablets and notebooks).  It also saves you from using USB2 debugging - which requires special cables and a good amount of hope that the machine’s vendor has attached the single debug capable USB port to an external port on the chassis.

     

    Also, there is no need for you to physically enter the Datacenter where the target is located.  You can do all these steps from your convenient office chair. J

     

    To see network kernel debugging in action, watch Episode #27 of Defrag Tools on Channel 9.

     

    Thanks to Andrew Richards and Joe Ballantyne for their help in writing this article.

  • Ntdebugging Blog

    Another Who Done It

    • 0 Comments

    Hi my name is Bob Golding, I am an EE in GES. I want to share an interesting problem I recently worked on.  The initial symptom was the system bugchecked with a Stop 0xA which means there was an invalid memory reference.  The cause of the crash was a driver making I/O requests while Asynchronous Procedure Calls (APCs) were disabled.  The bugcheck caused by an invalid memory reference was the result of the problem and not the cause.

     

    An APC is queued to a thread during I/O completion. This is to guarantee the last phase of the I/O completion occurs in the same context as the process that issued the request.

     

    The stack of the trap is presented below.  The call stack shows that APCs are being enabled allowing queued APCs to run.

     

    Child-SP          RetAddr           Call Site

    fffff880`07bf3598 fffff800`030b85a9 nt!KeBugCheckEx

    fffff880`07bf35a0 fffff800`030b7220 nt!KiBugCheckDispatch+0x69

    fffff880`07bf36e0 fffff800`030d8b56 nt!KiPageFault+0x260

    fffff880`07bf3870 fffff800`030959ff nt!IopCompleteRequest+0xc73

    fffff880`07bf3940 fffff800`0306c0d9 nt!KiDeliverApc+0x1d7

    fffff880`07bf39c0 fffff800`033f8a1a nt!KiCheckForKernelApcDelivery+0x25

    fffff880`07bf39f0 fffff800`033cce2f nt!MiMapViewOfSection+0x2bafa

    fffff880`07bf3ae0 fffff800`030b8293 nt!NtMapViewOfSection+0x2be

    fffff880`07bf3bb0 00000000`772df93a nt!KiSystemServiceCopyEnd+0x13

    00000000`0015dea8 00000000`00000000 0x772df93a

     

    The reason the trap occurred is because when issuing requests to lower drivers it is common practice in drivers to implement code similar to:

     

    KEVENT event;

     

    status = IoCallDriver( DeviceObject, irp );

     

    //

    //  Wait for the event to be signaled if STATUS_PENDING is returned.

    //

    if (status == STATUS_PENDING) {

       (VOID)KeWaitForSingleObject( &event, // event is a local which is declared on the stack

                                    Executive,

                                    KernelMode,

                                    FALSE,

                                    NULL );

    }

     

    As you can see in the above code, if the return from IoCallDriver does not return pending the code continues and exits. Part of the last phase of I/O processing that takes place in the APC is signaling the event. If the call to IoCallDriver returns success, because the event is on the stack it is critical that the APC execute immediately before the stack unwinds. Since APCs where disabled, the execution of the APC was delayed and during this time the event became invalid. The APCs were delayed because the memory manager was in a critical area and APCs could not run.

     

    I needed to determine which driver did this so I enabled IRP logging in Driver Verifier to trace I/O requests.  With this enabled the next dump should contain a transaction log that will help identify what driver is performing I/O while APCs are disabled.  The command line to enable this is:

    verifier /flags 0x410 /all

     

    The new dump with verifier enabled also crashed after delivering an APC to the thread and completing the IRP.  From the debug output below I can find the IRPs that were issued and the thread that issued them, this is what I need to look for them in the log.

     

    1: kd> !thread

    THREAD fffffa80064c9b50 Cid 0200.0204  Teb: 000007fffffde000 Win32Thread: 0000000000000000 RUNNING on processor 1

    IRP List:

        fffff9800a33ec60: (0006,03a0) Flags: 40060070  Mdl: 00000000

        fffff9800a250c60: (0006,03a0) Flags: 40060070  Mdl: 00000000

        fffff9800a3f4ee0: (0006,0118) Flags: 40060070  Mdl: 00000000

    Not impersonating

    DeviceMap                 fffff8a000007890

    Owning Process            fffffa80064bbb30       Image:         csrss.exe

    Attached Process          N/A            Image:         N/A

    Wait Start TickCount      1656           Ticks: 0

    Context Switch Count      25             IdealProcessor: 0

    UserTime                  00:00:00.000

    KernelTime                00:00:00.000

    Win32 Start Address 0x000000004a061540

    Stack Init fffff88003b21c70 Current fffff88003b20890

    Base fffff88003b22000 Limit fffff88003b1c000 Call 0

    Priority 14 BasePriority 13 UnusualBoost 0 ForegroundBoost 0 IoPriority 2 PagePriority 5

    Child-SP          RetAddr           Call Site

    fffff880`03b21428 fffff800`0307a54c nt!KeBugCheckEx

    fffff880`03b21430 fffff800`030d02ee nt!MmAccessFault+0xffffffff`fff9c15c

    fffff880`03b21590 fffff800`030c8db9 nt!KiPageFault+0x16e

    fffff880`03b21728 fffff800`030e6ab3 nt!memcpy+0x229

    fffff880`03b21730 fffff800`030c4bd7 nt!IopCompleteRequest+0x5a3

    fffff880`03b21800 fffff800`0307ba85 nt!KiDeliverApc+0x1c7

    fffff880`03b21880 fffff800`0331d96a nt!KiCheckForKernelApcDelivery+0x25

    fffff880`03b218b0 fffff800`033e742e nt!MiMapViewOfSection+0xffffffff`fff36baa

    fffff880`03b219a0 fffff800`030d1453 nt!NtMapViewOfSection+0x2bd

    fffff880`03b21a70 00000000`7761159a nt!KiSystemServiceCopyEnd+0x13

    00000000`0025f078 00000000`00000000 0x7761159a

     

    The command “!verifier 100” will dump the transaction log.  Below is the relevant portion of the log containing the IRPs for our thread.

     

    IRP fffff9800a3f4ee0, Thread fffffa80064c9b50, IRQL = 0, KernelApcDisable = -4, SpecialApcDisable = -1

    fffff80003573a68 nt!IovAllocateIrp+0x28

    fffff800033b20e2 nt!IoBuildDeviceIoControlRequest+0x32

    fffff8000356d72e nt!IovBuildDeviceIoControlRequest+0x4e

    fffff880010f8bcc fltmgr!FltGetVolumeGuidName+0x18c

    fffff88004e4fbe1 baddriver+0x12be1

    fffff88004e73523 baddriver +0x36523

    fffff88004e7300c baddriver +0x3600c

    fffff88004e72cce baddriver +0x35cce

    fffff88004e5f715 baddriver +0x22715

    fffff88004e4c6c7 baddriver +0xf6c7

    fffff88004e48342 baddriver +0xb342

    fffff88004e5e44e baddriver +0x2144e

    fffff88004e5e638 baddriver +0x21638

     

    IRP fffff9800a250c60, Thread fffffa80064c9b50, IRQL = 0, KernelApcDisable = -5, SpecialApcDisable = -1

    fffff80003573a68 nt!IovAllocateIrp+0x28

    fffff800033b20e2 nt!IoBuildDeviceIoControlRequest+0x32

    fffff8000356d72e nt!IovBuildDeviceIoControlRequest+0x4e

    fffff8800101eec7 mountmgr!MountMgrSendDeviceControl+0x73

    fffff88001010a6b mountmgr!QueryDeviceInformation+0x207

    fffff8800101986b mountmgr!QueryPointsFromMemory+0x57

    fffff88001019f86 mountmgr!MountMgrQueryPoints+0x36a

    fffff8800101ea71 mountmgr!MountMgrDeviceControl+0xe9

    fffff80003574c16 nt!IovCallDriver+0x566

    fffff880010f8bec fltmgr!FltGetVolumeGuidName+0x1ac

    fffff88004e4fbe1 baddriver +0x12be1

    fffff88004e73523 baddriver +0x36523

    fffff88004e7300c baddriver +0x3600c

     

    IRP fffff9800a33ec60, Thread fffffa80064c9b50, IRQL = 0, KernelApcDisable = -5, SpecialApcDisable = -1

    fffff80003573a68 nt!IovAllocateIrp+0x28

    fffff800033b20e2 nt!IoBuildDeviceIoControlRequest+0x32

    fffff8000356d72e nt!IovBuildDeviceIoControlRequest+0x4e

    fffff8800101eec7 mountmgr!MountMgrSendDeviceControl+0x73

    fffff88001010afd mountmgr!QueryDeviceInformation+0x299

    fffff8800101986b mountmgr!QueryPointsFromMemory+0x57

    fffff88001019f86 mountmgr!MountMgrQueryPoints+0x36a

    fffff8800101ea71 mountmgr!MountMgrDeviceControl+0xe9

    fffff80003574c16 nt!IovCallDriver+0x566

    fffff880010f8bec fltmgr!FltGetVolumeGuidName+0x1ac

    fffff88004e4fbe1 baddriver +0x12be1

    fffff88004e73523 baddriver +0x36523

    fffff88004e7300c baddriver +0x3600c

     

    From the IRP log in verifier I can see that baddriver.sys is calling FltGetVolumeGuidName while APCs are disabled. Further investigation found that baddriver.sys had registered a function for image load notification, and the memory manager has APCs disabled when it calls the image notification routine. The image notification routine in baddriver.sys called FltGetVolumeGuidName which issued the I/O.  From the log output I see KernelApcDisable and SpecialApcDisable, the issue is SpecialApcDisable being –1.  The I/O completion APCs are considered special APCs, so kernel APC disable would not affect them.

     

    The solution was for the driver to check for APCs disabled before issuing the FltGetVolumeGuidName and not make this call if APCs are disabled.

  • Ntdebugging Blog

    Our Bangalore Team is Hiring - Windows Server Escalation Engineer

    • 0 Comments

    Would you like to join the world’s best and most elite debuggers to enable the success of Microsoft solutions?

     

    As a trusted advisor to our top customers you will be working with to the most experienced IT professionals and developers in the industry. You will influence our product teams in sustained engineering efforts to drive improvements in our products.

     

    This role involves deep analysis of product source code and debugging to solve problems in multi-million dollar configurations and will give you an opportunity to stretch your critical thinking skills. During the course of debugging, you will uncover opportunities to improve the customer experience while influencing the current and future design of our products.

     

    In addition to providing support to customers while being the primary interface to our sustained engineering teams, you will also have the opportunity to work with new technologies and unreleased software. Through our continuous investment in depth training and hands-on experience with tough customer challenges you will become the world’s best in this area. Expect to partner with many various roles at Microsoft launching a very successful career!

     

    This position is located is at the Microsoft Global Technical Support Center in Bangalore, India.

     

    Learn more about what an Escalation Engineer does at:

    Profile: Ron Stock, CTS Escalation Engineer - Microsoft Customer Service & Support - What is CSS?

    Microsoft JobsBlog JobCast with Escalation Engineer Jeff Dailey

    Microsoft JobsBlog JobCast with Escalation Engineer Scott Oseychik

     

    Apply here:

    https://careers.microsoft.com/jobdetails.aspx?ss=&pg=0&so=&rw=1&jid=109989&jlang=en&pp=ss

Page 1 of 1 (3 items)