All your NBL are belong to !ndiskd

Last time we talked about !ndiskd.pendingnbls.  This command shows you which component currently holds an NBL.  But what if you want to see how the NBL got there?  That sounds like a job for !ndiskd.nbl-log!

Starting with Windows 8 and Windows Server 2012, NDIS can be configured to record a log of all NBL activity.  NDIS holds a large circular buffer, and writes an event to the buffer each time an NBL changes hands or is cloned.  You can use !ndiskd.nbl -log to search through that giant ringbuffer and collect all the events mentioning this NBL.

Here’s some example output:

kd> !ndiskd.nbl @rdx -log
Allocated
ProtocolSent       ffffcf80012b2c70 - QoS Packet Scheduler-0000
FilterSent         ffffcf800121cc70 - NDIS Sample LightWeight Filter-0000
FilterSent         ffffcf800125ec70 - WFP Native MAC Layer LightWeight Filter-0000
FilterSent         ffffe000012261a0 - Microsoft Hyper-V Network Adapter
SentToMiniport     ffffe000012261a0 - Microsoft Hyper-V Network Adapter

What does this show?  Read the log from top to bottom, where each line is one event in the history of the NBL.  On the first line, we see the NBL was Allocated from NDIS.  The next line shows that a protocol (not named here) sent the NBL to QoS Packet Scheduler.  Next a filter (presumably QoS) sent the NBL to another filter, the NDIS Sample LightWeight Filter.  Skipping down to the last line, we see the last event was the NBL getting sent to a miniport.  So this NBL is a pretty classic case of an NBL being allocated by a protocol, then sent down to the NIC, where it’s currently pending transmission.

That was an easy one.  Let’s take a look at a more complex example:

kd> !ndiskd.nbl @rdx -log
Allocated
    Child-1: Cloned    Parent: fffff8021269ae00, new child: fffffa800ae31a80
    Child-1-1: Cloned  Parent: fffffa800ae31a80, new child: fffffa8007c625c0
    Child-1-2: Cloned  Parent: fffffa800ae31a80, new child: fffffa8008ff1600
    Child-1-2: FreedClone
    Child-1-1: FreedClone
    Child-1: FreedClone
    Child-2: Cloned    Parent: fffff8021269ae00, new child: fffffa800ae31a80
ProtocolSent       fffffa800a0cbc80 - QoS Packet Scheduler-0000
FilterSent         fffffa800af711a0 - Microsoft Hyper-V Network Adapter
SentToMiniport     fffffa800af711a0 - Microsoft Hyper-V Network Adapter
    Child-2-1: Cloned  Parent: fffffa800ae31a80, new child: fffffa8007c625c0
    Child-2-1: FreedClone
    Child-2: FreedClone
MiniportSendCompleted fffffa800a0cbc80 - QoS Packet Scheduler-0000
FilterSendCompleted send complete in NDIS, sorting to Opens
SendCompleted      fffffa8008260010 - TCPIP
Freed

What is going on here?  The first thing you notice is the “Child” NBLs.  The log output will show the activity of the NBL you specify, and also any derived NBLs.  So the output shows that a driver allocated a Clone NBL, which !ndiskd names “Child-1”.  (There’s no significance to the name, other than to make it unique in the !ndiskd output.)  Then the driver cloned the clone, so we have a grandchild NBL named “Child-1-1”.  (The naming scheme is to just append a unique number to the parent’s name, so the third grandchild of the second child would be named “Child-2-3”.)  Another grandchild named “Child-1-2” is allocated.  Then all three clones are freed (presumably the driver used them outside of NDIS, perhaps in WFP, so NDIS didn’t log any activity on these NBLs).

Next we see something interesting.  Another clone is created, “Child-2”.  Then the original NBL is sent down the stack.  This is illegal!  NDIS requires that you hold onto the parent NBL while there are outstanding child NBLs.  But we can see that the parent NBL was sent down the stack before the “Child-2” NBL was freed.

Now let’s take a look at one final example:

kd> !ndiskd.nbl @rdx -log
AllocatedNb
ProtocolSent       fffffa800e3518a0 - QoS Packet Scheduler-0000
FilterSent         fffffa800d23a1a0 - Microsoft Hyper-V Network Adapter
SentToMiniport     fffffa800d23a1a0 - Microsoft Hyper-V Network Adapter
Smuggled to        fffffa800d385010 - BUGGY_PROTOCOL
ProtocolSent       fffffa800d2481a0 - Sample Network Adapter
SentToMiniport     fffffa800d2481a0 - Sample Network Adapter
MiniportSendCompleted NDIS
SendCompleted      fffffa800d385010 - BUGGY_PROTOCOL
Freed

This log was recorded just before a bugcheck.  The interesting line is “Smuggled to … BUGGY_PROTOCOL”.  What is “smuggling”?  !ndiskd defines smuggling as any time that an NBL seems to magically hop from one NDIS driver to another, without going through NDIS.  Sometimes, this is perfectly legal.  For example, the miniport and protocol edges of an IM driver exchange packets between them using internal bindings, without going through NDIS.  So IM drivers always smuggle packets between their miniport and protocol edges; this is not a bug.  Likewise, the Microsoft Virtual WiFi Miniport Adapter” is in cahoots with the “Microsoft Virtual WiFi Filter Driver”; they smuggle packets around by-design.

But in a case like this, you see a packet magically hopping from the Hyper-V network adapter to an unrelated protocol driver, the aptly-named BUGGY_PROTOCOL.  That is suspicious, since these two drivers are not in cahoots, and should have no secret backchannel through which to smuggle packets.  The reality is that BUGGY_PROTOCOL has a use-after-free bug, so that protocol tried to send the packet to another NIC while the packet was still in the network adapter.

Before we go, you need to know a few important details.  This NBL logging uses a global ringbuffer, which means that the history of an NBL might be incomplete if the ringbuffer has wrapped around.  The ringbuffer is sized based on how much physical RAM you have, but even on a system with terrabytes of RAM, the ringbuffer can only hold a few milliseconds of events for full 10Gbps traffic.

NBL logging is disabled by default.  You can enable it by setting this registry value to 4 and rebooting the computer:

HKLM\SYSTEM\CurrentControlSet\Services\NDIS\Parameters ! TrackNblOwner [REG_DWORD]

Some people ask what are the performance implications of NBL history logging.  This depends highly on the number of CPUs you have, the amount of traffic you’re pushing, the number of NDIS drivers that are installed, etc.  As a rule-of-thumb, I measured the CPU path length increase to be 3x the default configuration — large enough that you wouldn’t want to enable this during perf testing, but small enough that you probably won’t notice it if you enable it during functional testing.