Debugging a Network Connectivity Issue - TrackNblOwner to the Rescue

Debugging a Network Connectivity Issue - TrackNblOwner to the Rescue

Rate This
  • Comments 0

Hello Debug community this is Karim Elsaid again.  Today I’m going to discuss a recent interesting case where intermittently the server is losing access to the network.  No communication (even pings) can be done from / to the server when the issue hits.

 

We went through the normal exercise and asked the customer to obtain a Kernel memory dump from the machine while it was in the problematic state, hoping that we will find some data to help us to demystify the issue.

 

One of the very first commands we run upon receiving a hang dump is the very famous “!locks” command.  This yielded the following:

 

8: kd> !locks

**** DUMP OF ALL RESOURCE OBJECTS ****

KD: Scanning for held locks..

 

Resource @ nt!IopDeviceTreeLock (0xfffff80001a81c80)    Shared 1 owning threads

     Threads: fffffa800cd8a040-01<*>

KD: Scanning for held locks.

 

Resource @ nt!PiEngineLock (0xfffff80001a81b80)    Exclusively owned

    Contention Count = 6

     Threads: fffffa800cd8a040-01<*>

KD: Scanning for held locks

84372 total locks, 2 locks currently held

 

What I’m looking for is Locks with exclusive owners and waiters.  From the above output we can see that thread fffffa800cd8a040 exclusively owns a Plug and Play (Pi prefix) lock and shared owns an I/O Manager (Io prefix) device tree lock.

 

There are no waiters for the exclusive lock, however PnP locks always worth investigating.  While debugging I always treat everything a possible suspect unless proven otherwise, so let’s dump this thread:

 

8: kd> !thread fffffa800cd8a040 e

THREAD fffffa800cd8a040  Cid 0004.005c  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (Executive) KernelMode Non-Alertable

    fffff88002b0f118  SynchronizationEvent

IRP List:

    fffffa8016527510: (0006,0310) Flags: 00000000  Mdl: 00000000

Not impersonating

DeviceMap                 fffff8a000006100

Owning Process            fffffa800cd56040       Image:        System

Attached Process          N/A            Image:         N/A

Wait Start TickCount      14791337       Ticks: 15577 (0:00:04:03.002)

Context Switch Count      835317         IdealProcessor: 2            

UserTime                  00:00:00.000

KernelTime                00:00:26.863

Win32 Start Address nt!ExpWorkerThread (0xfffff8000188f530)

Stack Init fffff88002b0fc70 Current fffff88002b0ee30

Base fffff88002b10000 Limit fffff88002b0a000 Call 0

Priority 12 BasePriority 12 UnusualBoost 0 ForegroundBoost 0 IoPriority 2 PagePriority 5

*** ERROR: Module load completed but symbols could not be loaded for myfault.sys

Child-SP          RetAddr            Call Site

fffff880`02b0ee70 fffff800`0187ba32 nt!KiSwapContext+0x7a

fffff880`02b0efb0 fffff800`0188cd8f nt!KiCommitThreadWait+0x1d2

fffff880`02b0f040 fffff800`018e1816 nt!KeWaitForSingleObject+0x19f

fffff880`02b0f0e0 fffff880`01618fcd nt! ??::FNODOBFM::`string'+0x12ff6

fffff880`02b0f150 fffff880`0173f54e tcpip!FlPnpEvent+0x17d

fffff880`02b0f1c0 fffff880`00f87b2f tcpip!Fl48PnpEvent+0xe

fffff880`02b0f1f0 fffff880`00f884b7 NDIS!ndisPnPNotifyBinding+0xbf

fffff880`02b0f280 fffff880`00fa1911 NDIS!ndisPnPNotifyAllTransports+0x377

fffff880`02b0f3f0 fffff880`00fa2c5b NDIS!ndisCloseMiniportBindings+0x111

fffff880`02b0f500 fffff880`00f3bbc2 NDIS!ndisPnPRemoveDevice+0x25b

fffff880`02b0f6a0 fffff880`00fa5b69 NDIS!ndisPnPRemoveDeviceEx+0xa2

fffff880`02b0f6e0 fffff800`01aec8d9 NDIS!ndisPnPDispatch+0x609

fffff880`02b0f780 fffff800`01c6c1e1 nt!IopSynchronousCall+0xc5

fffff880`02b0f7f0 fffff800`0197f733 nt!IopRemoveDevice+0x101

fffff880`02b0f8b0 fffff800`01c6bd34 nt!PnpRemoveLockedDeviceNode+0x1a3

fffff880`02b0f900 fffff800`01c6be40 nt!PnpDeleteLockedDeviceNode+0x44

fffff880`02b0f930 fffff800`01cfcd04 nt!PnpDeleteLockedDeviceNodes+0xa0

fffff880`02b0f9a0 fffff800`01cfd35c nt!PnpProcessQueryRemoveAndEject+0xc34

fffff880`02b0fae0 fffff800`01be65ce nt!PnpProcessTargetDeviceEvent+0x4c

fffff880`02b0fb10 fffff800`0188f641 nt! ?? ::NNGAKEGL::`string'+0x5ab9b

fffff880`02b0fb70 fffff800`01b1ce5a nt!ExpWorkerThread+0x111

fffff880`02b0fc00 fffff800`01876d26 nt!PspSystemThreadStartup+0x5a

fffff880`02b0fc40 00000000`00000000 nt!KiStartSystemThread+0x16

 

Interesting, by looking at the stack above we can see that thread is doing some NDIS PnP stuff.  This thread has been waiting for more than 4 minutes, but hold on,  what is “ nt! ?? ::FNODOBFM::`string”?  This doesn’t seem to be a useful function name, no its not!  This is a side effect of Basic Block Tools optimization (BBT).  Using public symbols the debugger will find it hard to get to the right symbol, there is a nice a trick you can use in order to get to the right function.

 

P.S for a nice x64 Deep Dive please refer to our archive.

 

Let’s display the function data for the return address fffff800`018e1816:

 

8: kd> .fnent fffff800`018e1816

Debugger function entry 000000e8`f28f14f8 for:

(fffff800`018c4790)   nt! ?? ::FNODOBFM::`string'+0x12ff6   |  (fffff800`018c47c8)   nt!vDbgPrintExWithPrefixInternal

 

BeginAddress      = 00000000`000da7d0

EndAddress        = 00000000`000da81c

UnwindInfoAddress = 00000000`001c8a54

 

Unwind info at fffff800`019cfa54, 10 bytes

  version 1, flags 4, prolog 0, codes 0

 

Chained info:

BeginAddress      = 00000000`000182f0

EndAddress        = 00000000`00018358

UnwindInfoAddress = 00000000`001bf910

 

Unwind info at fffff800`019c6910, 6 bytes

  version 1, flags 0, prolog 4, codes 1

  00: offs 4, unwind op 2, op info c      UWOP_ALLOC_SMALL. 

 

For optimized binaries, you will find a section “Chained Info”.  Add the BeginAddress to the start address of the module and you should hit the correct function so:

 

8: kd> ln nt+000182f0

(fffff800`0181f2f0)   nt!ExWaitForRundownProtectionReleaseCacheAware  |  (fffff800`0181f358)   nt!KeGetRecommendedSharedDataAlignment

Exact matches:

    nt!ExWaitForRundownProtectionReleaseCacheAware (<no parameter info>)

 

Bingo!  You got the function.  So tcpip!FlPnpEvent was calling ExWaitForRundownProtectionReleaseCacheAware.  This function will basically wait for the rundown protection to drop down to 0.

 

A thread can call ExAcquireRundownProtectionEx against a shared object for safe access.  Rundown Protection provides a way to protect an object from being deleted unless all outstanding access has been finished (Run Down).  The “ExWaitForRundownProtectionReleaseCacheAware” will do exactly the same; it will wait for all rundown protection calls to be completed.

 

The question is which structure are we waiting for its rundown to drain, that will depend on what we are dealing with.  Because of code optimization the debugger is not showing you the full picture.  Through code review I found that in this particular dump there is an inline call to function “FlpUninitializePacketProviderInterface”.

 

So the stack in reality should look like this:

 

Child-SP          RetAddr           Call Site

fffff880`02b0ee70 fffff800`0187ba32 nt!KiSwapContext+0x7a

fffff880`02b0efb0 fffff800`0188cd8f nt!KiCommitThreadWait+0x1d2

fffff880`02b0f040 fffff800`018e1816 nt!KeWaitForSingleObject+0x19f

fffff880`02b0f0e0 fffff880`01618fcd nt!ExWaitForRundownProtectionReleaseCacheAware

----inline function----             tcpip!FlpUninitializePacketProviderInterface

fffff880`02b0f150 fffff880`0173f54e tcpip!FlPnpEvent+0x17d

fffff880`02b0f1c0 fffff880`00f87b2f tcpip!Fl48PnpEvent+0xe

 

So we need to un-initialize a network interface but before doing that we need to make sure that there are no outstanding references to packets and that there are no outstanding packets pending.  When we say packets, starting in NDIS 6 we basically mean “NET_BUFFER” and “Net_Buffer_List” structures.  So we need to check for any outstanding Net_Buffer_Lists (NBLs) that are pending, one reference will correspond to one pending NBL.

 

To the rescue, the “NDISKD” debugger extension has a very nice and handy command to display all pending NBLS and their owners, it is “!pendingnbls”.  For the command to work it you must first enable “TrackNblOwner” through the registry.  By default, this registry key is not enabled on server SKUs as it may cause a performance hit.  On client SKUs this is enabled by default.

 

When you run !pendingnbls on a clean Windows 2008 R2 install you get:

 

8: kd> !ndiskd.pendingnbls

    This command requires NBL tracking to be enabled on the debugee target

    machine.  (By default, client operating systems have level 1, and servers

    have level 0).  To enable, set this REG_DWORD value to a nonzero value on

    the target machine and reboot the target machine:

   

    HKLM\SYSTEM\CurrentControlSet\Services\NDIS\Parameters ! TrackNblOwner

    Possible Values (features are cumulative)

    * 0:  Disable all tracking.

    * 1:  Track the most recent owner of each NBL (enables !ndiskd.pendingnbls)

 

    Show me all allocated NBLs so I can manually find the one I want

 

You can find all allocated NBLs with the command “!ndiskd.nblpool -force -find ((@$extin.Flags)&0x108)==0x100)”, but still you don’t get any owner.

 

So I asked the customer to turn on “TrackNblOwner” and reboot, wait for the next occurrence of the issue and get a new memory dump.

 

Two days later we received the memory dump file.  I verified that they are having the same issue I found in the last dump and that TrackNblOwner is configured correctly:

 

23: kd> dp NDIS!ndisTrackNblOwner L1

fffff880`00ef1a30  00000000`00000001

 

Then I immediately checked all pending NBLs to claim the prize, and it was not surprising to see why the NIC card was not un-initializing:

 

23: kd> !ndiskd.pendingnbls

 

PHASE 1/3: Found 20 NBL pool(s).                

PHASE 2/3: Found 550 freed NBL(s).                                   

 

    Pending Nbl        Currently held by                                       

    fffffa801dc559f0   fffffa80142d31a0 - My Ethernet 1Gb 4-port Adapter  [Miniport]                   

    fffffa801dc81680   fffffa80142d31a0 - My Ethernet 1Gb 4-port Adapter  [Miniport]                   

    fffffa80131d2aa0   fffffa80142d31a0 - My Ethernet 1Gb 4-port Adapter  [Miniport]

……………………………….

Ret of the repeated output omitted

 

PHASE 3/3: Found 1854 pending NBL(s) of 3005 total NBL(s).                     

Search complete.

 

So we currently have 1854 NBLs pending on the NIC miniport driver “fffffa80142d31a0”.  This is the Miniport that currently holding all NBLs:

 

23: kd> !ndiskd.miniport fffffa80142d31a0

 

 

MINIPORT

 

    My Ethernet 1Gb 4-port Adapter 

 

    Ndis handle        fffffa80142d31a0

    Ndis API version   v6.20

    Adapter context    fffffa80138cc000

    Miniport driver    fffffa800d4f7530 - MyMiniPortDriver  v1.0

    Network interface  fffffa800d25e870

 

    Media type         802.3

    Device instance    PCI\VEN_1111&DEV_1111&SUBSYS_169D103C&REV_01\4&2263a140&0&0010

    Device object      fffffa80142d3050    More information

    MAC address        xx-xx-xx-xx-xx-xx

 

 

STATE

 

    Miniport           Running

    Device PnP         QUERY_REMOVED

    Datapath           Normal

    Operational status DORMANT

    Operational flags  DORMANT_PAUSED

    Admin status       ADMIN_UP

    Media              Connected

    Power              D0

    References         9

    Total resets       0

    Pending OID        None

    Flags              BUS_MASTER, 64BIT_DMA, SG_DMA, DEFAULT_PORT_ACTIVATED,

                       SUPPORTS_MEDIA_SENSE, DOES_NOT_DO_LOOPBACK,

                       MEDIA_CONNECTED

    PnP flags          PM_SUPPORTED, DEVICE_POWER_ENABLED, RECEIVED_START,

                       HARDWARE_DEVICE

 

What you notice from the above that the device received a “Query_Removed” PNP and is currently in a Dormant_Paused state.

 

From: http://msdn.microsoft.com/en-us/library/ff566737.aspx:

NET_IF_OPER_STATUS_DORMANT_PAUSED

The operational status is set to NET_IF_OPER_STATUS_DORMANT because the miniport adapter is in the paused or pausing state.

 

NDIS 6.0 and up allow miniport adapters to be paused and the documentation here shows what the miniport driver should do when it receives a pause request.

 

Because the adapter was in a pause state, basic network commads like “ping” ceased to work as described earlier in the symptoms.  The next action is definitely to involve the miniport adapter vendor to trace this further and find out why all these pending NBLs were not completed.

 

Until a next adventure!

Best Regards,

Karim

Leave a Comment
  • Please add 7 and 6 and type the answer here:
  • Post