Hello Debug community this is Karim Elsaid again. Today I’m going to discuss a recent interesting case where intermittently the server is losing access to the network. No communication (even pings) can be done from / to the server when the issue hits.
We went through the normal exercise and asked the customer to obtain a Kernel memory dump from the machine while it was in the problematic state, hoping that we will find some data to help us to demystify the issue.
One of the very first commands we run upon receiving a hang dump is the very famous “!locks” command. This yielded the following:
8: kd> !locks
**** DUMP OF ALL RESOURCE OBJECTS ****
KD: Scanning for held locks..
Resource @ nt!IopDeviceTreeLock (0xfffff80001a81c80) Shared 1 owning threads
KD: Scanning for held locks.
Resource @ nt!PiEngineLock (0xfffff80001a81b80) Exclusively owned
Contention Count = 6
KD: Scanning for held locks
84372 total locks, 2 locks currently held
What I’m looking for is Locks with exclusive owners and waiters. From the above output we can see that thread fffffa800cd8a040 exclusively owns a Plug and Play (Pi prefix) lock and shared owns an I/O Manager (Io prefix) device tree lock.
There are no waiters for the exclusive lock, however PnP locks always worth investigating. While debugging I always treat everything a possible suspect unless proven otherwise, so let’s dump this thread:
8: kd> !thread fffffa800cd8a040 e
THREAD fffffa800cd8a040 Cid 0004.005c Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (Executive) KernelMode Non-Alertable
fffffa8016527510: (0006,0310) Flags: 00000000 Mdl: 00000000
Owning Process fffffa800cd56040 Image: System
Attached Process N/A Image: N/A
Wait Start TickCount 14791337 Ticks: 15577 (0:00:04:03.002)
Context Switch Count 835317 IdealProcessor: 2
Win32 Start Address nt!ExpWorkerThread (0xfffff8000188f530)
Stack Init fffff88002b0fc70 Current fffff88002b0ee30
Base fffff88002b10000 Limit fffff88002b0a000 Call 0
Priority 12 BasePriority 12 UnusualBoost 0 ForegroundBoost 0 IoPriority 2 PagePriority 5
*** ERROR: Module load completed but symbols could not be loaded for myfault.sys
Child-SP RetAddr Call Site
fffff880`02b0ee70 fffff800`0187ba32 nt!KiSwapContext+0x7a
fffff880`02b0efb0 fffff800`0188cd8f nt!KiCommitThreadWait+0x1d2
fffff880`02b0f040 fffff800`018e1816 nt!KeWaitForSingleObject+0x19f
fffff880`02b0f0e0 fffff880`01618fcd nt! ??::FNODOBFM::`string'+0x12ff6
fffff880`02b0f150 fffff880`0173f54e tcpip!FlPnpEvent+0x17d
fffff880`02b0f1c0 fffff880`00f87b2f tcpip!Fl48PnpEvent+0xe
fffff880`02b0f1f0 fffff880`00f884b7 NDIS!ndisPnPNotifyBinding+0xbf
fffff880`02b0f280 fffff880`00fa1911 NDIS!ndisPnPNotifyAllTransports+0x377
fffff880`02b0f3f0 fffff880`00fa2c5b NDIS!ndisCloseMiniportBindings+0x111
fffff880`02b0f500 fffff880`00f3bbc2 NDIS!ndisPnPRemoveDevice+0x25b
fffff880`02b0f6a0 fffff880`00fa5b69 NDIS!ndisPnPRemoveDeviceEx+0xa2
fffff880`02b0f6e0 fffff800`01aec8d9 NDIS!ndisPnPDispatch+0x609
fffff880`02b0f780 fffff800`01c6c1e1 nt!IopSynchronousCall+0xc5
fffff880`02b0f7f0 fffff800`0197f733 nt!IopRemoveDevice+0x101
fffff880`02b0f8b0 fffff800`01c6bd34 nt!PnpRemoveLockedDeviceNode+0x1a3
fffff880`02b0f900 fffff800`01c6be40 nt!PnpDeleteLockedDeviceNode+0x44
fffff880`02b0f930 fffff800`01cfcd04 nt!PnpDeleteLockedDeviceNodes+0xa0
fffff880`02b0f9a0 fffff800`01cfd35c nt!PnpProcessQueryRemoveAndEject+0xc34
fffff880`02b0fae0 fffff800`01be65ce nt!PnpProcessTargetDeviceEvent+0x4c
fffff880`02b0fb10 fffff800`0188f641 nt! ?? ::NNGAKEGL::`string'+0x5ab9b
fffff880`02b0fb70 fffff800`01b1ce5a nt!ExpWorkerThread+0x111
fffff880`02b0fc00 fffff800`01876d26 nt!PspSystemThreadStartup+0x5a
fffff880`02b0fc40 00000000`00000000 nt!KiStartSystemThread+0x16
Interesting, by looking at the stack above we can see that thread is doing some NDIS PnP stuff. This thread has been waiting for more than 4 minutes, but hold on, what is “ nt! ?? ::FNODOBFM::`string”? This doesn’t seem to be a useful function name, no its not! This is a side effect of Basic Block Tools optimization (BBT). Using public symbols the debugger will find it hard to get to the right symbol, there is a nice a trick you can use in order to get to the right function.
P.S for a nice x64 Deep Dive please refer to our archive.
Let’s display the function data for the return address fffff800`018e1816:
8: kd> .fnent fffff800`018e1816
Debugger function entry 000000e8`f28f14f8 for:
(fffff800`018c4790) nt! ?? ::FNODOBFM::`string'+0x12ff6 | (fffff800`018c47c8) nt!vDbgPrintExWithPrefixInternal
BeginAddress = 00000000`000da7d0
EndAddress = 00000000`000da81c
UnwindInfoAddress = 00000000`001c8a54
Unwind info at fffff800`019cfa54, 10 bytes
version 1, flags 4, prolog 0, codes 0
BeginAddress = 00000000`000182f0
EndAddress = 00000000`00018358
UnwindInfoAddress = 00000000`001bf910
Unwind info at fffff800`019c6910, 6 bytes
version 1, flags 0, prolog 4, codes 1
00: offs 4, unwind op 2, op info c UWOP_ALLOC_SMALL.
For optimized binaries, you will find a section “Chained Info”. Add the BeginAddress to the start address of the module and you should hit the correct function so:
8: kd> ln nt+000182f0
(fffff800`0181f2f0) nt!ExWaitForRundownProtectionReleaseCacheAware | (fffff800`0181f358) nt!KeGetRecommendedSharedDataAlignment
nt!ExWaitForRundownProtectionReleaseCacheAware (<no parameter info>)
Bingo! You got the function. So tcpip!FlPnpEvent was calling ExWaitForRundownProtectionReleaseCacheAware. This function will basically wait for the rundown protection to drop down to 0.
A thread can call ExAcquireRundownProtectionEx against a shared object for safe access. Rundown Protection provides a way to protect an object from being deleted unless all outstanding access has been finished (Run Down). The “ExWaitForRundownProtectionReleaseCacheAware” will do exactly the same; it will wait for all rundown protection calls to be completed.
The question is which structure are we waiting for its rundown to drain, that will depend on what we are dealing with. Because of code optimization the debugger is not showing you the full picture. Through code review I found that in this particular dump there is an inline call to function “FlpUninitializePacketProviderInterface”.
So the stack in reality should look like this:
Child-SP RetAddr Call Site
fffff880`02b0f0e0 fffff880`01618fcd nt!ExWaitForRundownProtectionReleaseCacheAware
----inline function---- tcpip!FlpUninitializePacketProviderInterface
So we need to un-initialize a network interface but before doing that we need to make sure that there are no outstanding references to packets and that there are no outstanding packets pending. When we say packets, starting in NDIS 6 we basically mean “NET_BUFFER” and “Net_Buffer_List” structures. So we need to check for any outstanding Net_Buffer_Lists (NBLs) that are pending, one reference will correspond to one pending NBL.
To the rescue, the “NDISKD” debugger extension has a very nice and handy command to display all pending NBLS and their owners, it is “!pendingnbls”. For the command to work it you must first enable “TrackNblOwner” through the registry. By default, this registry key is not enabled on server SKUs as it may cause a performance hit. On client SKUs this is enabled by default.
When you run !pendingnbls on a clean Windows 2008 R2 install you get:
8: kd> !ndiskd.pendingnbls
This command requires NBL tracking to be enabled on the debugee target
machine. (By default, client operating systems have level 1, and servers
have level 0). To enable, set this REG_DWORD value to a nonzero value on
the target machine and reboot the target machine:
HKLM\SYSTEM\CurrentControlSet\Services\NDIS\Parameters ! TrackNblOwner
Possible Values (features are cumulative)
* 0: Disable all tracking.
* 1: Track the most recent owner of each NBL (enables !ndiskd.pendingnbls)
Show me all allocated NBLs so I can manually find the one I want
You can find all allocated NBLs with the command “!ndiskd.nblpool -force -find ((@$extin.Flags)&0x108)==0x100)”, but still you don’t get any owner.
So I asked the customer to turn on “TrackNblOwner” and reboot, wait for the next occurrence of the issue and get a new memory dump.
Two days later we received the memory dump file. I verified that they are having the same issue I found in the last dump and that TrackNblOwner is configured correctly:
23: kd> dp NDIS!ndisTrackNblOwner L1
Then I immediately checked all pending NBLs to claim the prize, and it was not surprising to see why the NIC card was not un-initializing:
23: kd> !ndiskd.pendingnbls
PHASE 1/3: Found 20 NBL pool(s).
PHASE 2/3: Found 550 freed NBL(s).
Pending Nbl Currently held by
fffffa801dc559f0 fffffa80142d31a0 - My Ethernet 1Gb 4-port Adapter [Miniport]
fffffa801dc81680 fffffa80142d31a0 - My Ethernet 1Gb 4-port Adapter [Miniport]
fffffa80131d2aa0 fffffa80142d31a0 - My Ethernet 1Gb 4-port Adapter [Miniport]
Ret of the repeated output omitted
PHASE 3/3: Found 1854 pending NBL(s) of 3005 total NBL(s).
So we currently have 1854 NBLs pending on the NIC miniport driver “fffffa80142d31a0”. This is the Miniport that currently holding all NBLs:
23: kd> !ndiskd.miniport fffffa80142d31a0
My Ethernet 1Gb 4-port Adapter
Ndis handle fffffa80142d31a0
Ndis API version v6.20
Adapter context fffffa80138cc000
Miniport driver fffffa800d4f7530 - MyMiniPortDriver v1.0
Network interface fffffa800d25e870
Media type 802.3
Device instance PCI\VEN_1111&DEV_1111&SUBSYS_169D103C&REV_01\4&2263a140&0&0010
Device object fffffa80142d3050 More information
MAC address xx-xx-xx-xx-xx-xx
Device PnP QUERY_REMOVED
Operational status DORMANT
Operational flags DORMANT_PAUSED
Admin status ADMIN_UP
Total resets 0
Pending OID None
Flags BUS_MASTER, 64BIT_DMA, SG_DMA, DEFAULT_PORT_ACTIVATED,
PnP flags PM_SUPPORTED, DEVICE_POWER_ENABLED, RECEIVED_START,
What you notice from the above that the device received a “Query_Removed” PNP and is currently in a Dormant_Paused state.
The operational status is set to NET_IF_OPER_STATUS_DORMANT because the miniport adapter is in the paused or pausing state.
NDIS 6.0 and up allow miniport adapters to be paused and the documentation here shows what the miniport driver should do when it receives a pause request.
Because the adapter was in a pause state, basic network commads like “ping” ceased to work as described earlier in the symptoms. The next action is definitely to involve the miniport adapter vendor to trace this further and find out why all these pending NBLs were not completed.
Until a next adventure!