Hello Debug community this is Karim Elsaid again. Today I’m going to discuss a recent interesting case where intermittently the server is losing access to the network. No communication (even pings) can be done from / to the server when the issue hits.
We went through the normal exercise and asked the customer to obtain a Kernel memory dump from the machine while it was in the problematic state, hoping that we will find some data to help us to demystify the issue.
One of the very first commands we run upon receiving a hang dump is the very famous “!locks” command. This yielded the following:
8: kd> !locks
**** DUMP OF ALL RESOURCE OBJECTS ****
KD: Scanning for held locks..
Resource @ nt!IopDeviceTreeLock (0xfffff80001a81c80) Shared 1 owning threads
KD: Scanning for held locks.
Resource @ nt!PiEngineLock (0xfffff80001a81b80) Exclusively owned
Contention Count = 6
KD: Scanning for held locks
84372 total locks, 2 locks currently held
What I’m looking for is Locks with exclusive owners and waiters. From the above output we can see that thread fffffa800cd8a040 exclusively owns a Plug and Play (Pi prefix) lock and shared owns an I/O Manager (Io prefix) device tree lock.
There are no waiters for the exclusive lock, however PnP locks always worth investigating. While debugging I always treat everything a possible suspect unless proven otherwise, so let’s dump this thread:
8: kd> !thread fffffa800cd8a040 e
THREAD fffffa800cd8a040 Cid 0004.005c Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (Executive) KernelMode Non-Alertable
fffffa8016527510: (0006,0310) Flags: 00000000 Mdl: 00000000
Owning Process fffffa800cd56040 Image: System
Attached Process N/A Image: N/A
Wait Start TickCount 14791337 Ticks: 15577 (0:00:04:03.002)
Context Switch Count 835317 IdealProcessor: 2
Win32 Start Address nt!ExpWorkerThread (0xfffff8000188f530)
Stack Init fffff88002b0fc70 Current fffff88002b0ee30
Base fffff88002b10000 Limit fffff88002b0a000 Call 0
Priority 12 BasePriority 12 UnusualBoost 0 ForegroundBoost 0 IoPriority 2 PagePriority 5
*** ERROR: Module load completed but symbols could not be loaded for myfault.sys
Child-SP RetAddr Call Site
fffff880`02b0ee70 fffff800`0187ba32 nt!KiSwapContext+0x7a
fffff880`02b0efb0 fffff800`0188cd8f nt!KiCommitThreadWait+0x1d2
fffff880`02b0f040 fffff800`018e1816 nt!KeWaitForSingleObject+0x19f
fffff880`02b0f0e0 fffff880`01618fcd nt! ??::FNODOBFM::`string'+0x12ff6
fffff880`02b0f150 fffff880`0173f54e tcpip!FlPnpEvent+0x17d
fffff880`02b0f1c0 fffff880`00f87b2f tcpip!Fl48PnpEvent+0xe
fffff880`02b0f1f0 fffff880`00f884b7 NDIS!ndisPnPNotifyBinding+0xbf
fffff880`02b0f280 fffff880`00fa1911 NDIS!ndisPnPNotifyAllTransports+0x377
fffff880`02b0f3f0 fffff880`00fa2c5b NDIS!ndisCloseMiniportBindings+0x111
fffff880`02b0f500 fffff880`00f3bbc2 NDIS!ndisPnPRemoveDevice+0x25b
fffff880`02b0f6a0 fffff880`00fa5b69 NDIS!ndisPnPRemoveDeviceEx+0xa2
fffff880`02b0f6e0 fffff800`01aec8d9 NDIS!ndisPnPDispatch+0x609
fffff880`02b0f780 fffff800`01c6c1e1 nt!IopSynchronousCall+0xc5
fffff880`02b0f7f0 fffff800`0197f733 nt!IopRemoveDevice+0x101
fffff880`02b0f8b0 fffff800`01c6bd34 nt!PnpRemoveLockedDeviceNode+0x1a3
fffff880`02b0f900 fffff800`01c6be40 nt!PnpDeleteLockedDeviceNode+0x44
fffff880`02b0f930 fffff800`01cfcd04 nt!PnpDeleteLockedDeviceNodes+0xa0
fffff880`02b0f9a0 fffff800`01cfd35c nt!PnpProcessQueryRemoveAndEject+0xc34
fffff880`02b0fae0 fffff800`01be65ce nt!PnpProcessTargetDeviceEvent+0x4c
fffff880`02b0fb10 fffff800`0188f641 nt! ?? ::NNGAKEGL::`string'+0x5ab9b
fffff880`02b0fb70 fffff800`01b1ce5a nt!ExpWorkerThread+0x111
fffff880`02b0fc00 fffff800`01876d26 nt!PspSystemThreadStartup+0x5a
fffff880`02b0fc40 00000000`00000000 nt!KiStartSystemThread+0x16
Interesting, by looking at the stack above we can see that thread is doing some NDIS PnP stuff. This thread has been waiting for more than 4 minutes, but hold on, what is “ nt! ?? ::FNODOBFM::`string”? This doesn’t seem to be a useful function name, no its not! This is a side effect of Basic Block Tools optimization (BBT). Using public symbols the debugger will find it hard to get to the right symbol, there is a nice a trick you can use in order to get to the right function.
P.S for a nice x64 Deep Dive please refer to our archive.
Let’s display the function data for the return address fffff800`018e1816:
8: kd> .fnent fffff800`018e1816
Debugger function entry 000000e8`f28f14f8 for:
(fffff800`018c4790) nt! ?? ::FNODOBFM::`string'+0x12ff6 | (fffff800`018c47c8) nt!vDbgPrintExWithPrefixInternal
BeginAddress = 00000000`000da7d0
EndAddress = 00000000`000da81c
UnwindInfoAddress = 00000000`001c8a54
Unwind info at fffff800`019cfa54, 10 bytes
version 1, flags 4, prolog 0, codes 0
BeginAddress = 00000000`000182f0
EndAddress = 00000000`00018358
UnwindInfoAddress = 00000000`001bf910
Unwind info at fffff800`019c6910, 6 bytes
version 1, flags 0, prolog 4, codes 1
00: offs 4, unwind op 2, op info c UWOP_ALLOC_SMALL.
For optimized binaries, you will find a section “Chained Info”. Add the BeginAddress to the start address of the module and you should hit the correct function so:
8: kd> ln nt+000182f0
(fffff800`0181f2f0) nt!ExWaitForRundownProtectionReleaseCacheAware | (fffff800`0181f358) nt!KeGetRecommendedSharedDataAlignment
nt!ExWaitForRundownProtectionReleaseCacheAware (<no parameter info>)
Bingo! You got the function. So tcpip!FlPnpEvent was calling ExWaitForRundownProtectionReleaseCacheAware. This function will basically wait for the rundown protection to drop down to 0.
A thread can call ExAcquireRundownProtectionEx against a shared object for safe access. Rundown Protection provides a way to protect an object from being deleted unless all outstanding access has been finished (Run Down). The “ExWaitForRundownProtectionReleaseCacheAware” will do exactly the same; it will wait for all rundown protection calls to be completed.
The question is which structure are we waiting for its rundown to drain, that will depend on what we are dealing with. Because of code optimization the debugger is not showing you the full picture. Through code review I found that in this particular dump there is an inline call to function “FlpUninitializePacketProviderInterface”.
So the stack in reality should look like this:
Child-SP RetAddr Call Site
fffff880`02b0f0e0 fffff880`01618fcd nt!ExWaitForRundownProtectionReleaseCacheAware
----inline function---- tcpip!FlpUninitializePacketProviderInterface
So we need to un-initialize a network interface but before doing that we need to make sure that there are no outstanding references to packets and that there are no outstanding packets pending. When we say packets, starting in NDIS 6 we basically mean “NET_BUFFER” and “Net_Buffer_List” structures. So we need to check for any outstanding Net_Buffer_Lists (NBLs) that are pending, one reference will correspond to one pending NBL.
To the rescue, the “NDISKD” debugger extension has a very nice and handy command to display all pending NBLS and their owners, it is “!pendingnbls”. For the command to work it you must first enable “TrackNblOwner” through the registry. By default, this registry key is not enabled on server SKUs as it may cause a performance hit. On client SKUs this is enabled by default.
When you run !pendingnbls on a clean Windows 2008 R2 install you get:
8: kd> !ndiskd.pendingnbls
This command requires NBL tracking to be enabled on the debugee target
machine. (By default, client operating systems have level 1, and servers
have level 0). To enable, set this REG_DWORD value to a nonzero value on
the target machine and reboot the target machine:
HKLM\SYSTEM\CurrentControlSet\Services\NDIS\Parameters ! TrackNblOwner
Possible Values (features are cumulative)
* 0: Disable all tracking.
* 1: Track the most recent owner of each NBL (enables !ndiskd.pendingnbls)
Show me all allocated NBLs so I can manually find the one I want
You can find all allocated NBLs with the command “!ndiskd.nblpool -force -find ((@$extin.Flags)&0x108)==0x100)”, but still you don’t get any owner.
So I asked the customer to turn on “TrackNblOwner” and reboot, wait for the next occurrence of the issue and get a new memory dump.
Two days later we received the memory dump file. I verified that they are having the same issue I found in the last dump and that TrackNblOwner is configured correctly:
23: kd> dp NDIS!ndisTrackNblOwner L1
Then I immediately checked all pending NBLs to claim the prize, and it was not surprising to see why the NIC card was not un-initializing:
23: kd> !ndiskd.pendingnbls
PHASE 1/3: Found 20 NBL pool(s).
PHASE 2/3: Found 550 freed NBL(s).
Pending Nbl Currently held by
fffffa801dc559f0 fffffa80142d31a0 - My Ethernet 1Gb 4-port Adapter [Miniport]
fffffa801dc81680 fffffa80142d31a0 - My Ethernet 1Gb 4-port Adapter [Miniport]
fffffa80131d2aa0 fffffa80142d31a0 - My Ethernet 1Gb 4-port Adapter [Miniport]
Ret of the repeated output omitted
PHASE 3/3: Found 1854 pending NBL(s) of 3005 total NBL(s).
So we currently have 1854 NBLs pending on the NIC miniport driver “fffffa80142d31a0”. This is the Miniport that currently holding all NBLs:
23: kd> !ndiskd.miniport fffffa80142d31a0
My Ethernet 1Gb 4-port Adapter
Ndis handle fffffa80142d31a0
Ndis API version v6.20
Adapter context fffffa80138cc000
Miniport driver fffffa800d4f7530 - MyMiniPortDriver v1.0
Network interface fffffa800d25e870
Media type 802.3
Device instance PCI\VEN_1111&DEV_1111&SUBSYS_169D103C&REV_01\4&2263a140&0&0010
Device object fffffa80142d3050 More information
MAC address xx-xx-xx-xx-xx-xx
Device PnP QUERY_REMOVED
Operational status DORMANT
Operational flags DORMANT_PAUSED
Admin status ADMIN_UP
Total resets 0
Pending OID None
Flags BUS_MASTER, 64BIT_DMA, SG_DMA, DEFAULT_PORT_ACTIVATED,
PnP flags PM_SUPPORTED, DEVICE_POWER_ENABLED, RECEIVED_START,
What you notice from the above that the device received a “Query_Removed” PNP and is currently in a Dormant_Paused state.
The operational status is set to NET_IF_OPER_STATUS_DORMANT because the miniport adapter is in the paused or pausing state.
NDIS 6.0 and up allow miniport adapters to be paused and the documentation here shows what the miniport driver should do when it receives a pause request.
Because the adapter was in a pause state, basic network commads like “ping” ceased to work as described earlier in the symptoms. The next action is definitely to involve the miniport adapter vendor to trace this further and find out why all these pending NBLs were not completed.
Until a next adventure!
Hello, my name is Fred Jeng from the Global Escalation Services team. For today’s post, I want to go over how Windows 7 and Windows Server 2008 R2 load file system mini-filters in a mixed environment when legacy filters are also present. I recently came across an issue where the filters were being loaded out of order based on their altitudes. This can cause all sorts of problems with a filter driver’s functionality if they are incorrectly positioned on the stack. Take for example the following filter stack, obtained using the fltmc command from the cmd prompt:
Filter Name Num Instances Altitude Frame
------------------------------ ------------- ------------ -----
AVLegacy 389998.99 <Legacy>
EncryptionLegacy 149998.99 <Legacy>
AVMiniFilter 3 328000 0
luafv 1 135000 0
FileInfo 13 45000 0
At first glance it looks like there is a problem causing the legacy encryption filter to be loaded above the antivirus minifilter, which has a higher altitude. This may cause issues with AVMiniFilter as the IOs that it receives are still encrypted. Due to limitations in how the filter drivers attach to the driver stack, this is actually the intended behavior. However, there is a solution to manipulate the load order to load the legacy filters correctly based on their altitude.
First some background information regarding legacy filters and minifilters.
In the old days before minifilters, legacy drivers can only attach at the top of the driver stack so the load order also controlled the attachment order. The earlier a legacy driver loads, the lower it can attach on the file system stack. Minifilters on the other hand can load at any time, but their positions relative to other minifilters are controlled by their altitude. When a minifilter loads, it needs to register with an appropriate frame created by fltmgr. Each frame is a fltmgr device object and represents a range of altitudes. There can be more than one frame on the file system stack but the range of altitudes that each frame represents cannot overlap with the altitude range of another frame. For interoperability with legacy drivers, minifilters must still maintain a load order group. The frames are created and managed by fltmgr, which itself is a legacy driver. The ramification of this is that fltmgr must follow the old legacy filter driver rules and attach only at the top of the stack.
From the above example, let’s walk through how the legacy and minifilters above are loaded to get us into the state such that the altitude appears to be out of order.
First, here are the details for the 5 drivers.
MSDN has an article that describes load order groups and altitudes for minifilters: http://msdn.microsoft.com/en-us/library/windows/hardware/ff549689%28v=vs.85%29.aspx.
Referencing this article regarding load order groups and altitudes for minifilter drivers, we can determine that our filters will load in the following order.
On system bootup, when fltmgr.sys loads it will create Frame 0 with a default altitude range of 0 to 49999. When FileInfo loads with an altitude of 45000, it will fit into the default Frame 0. Next to load is EncryptionLegacy. Since this is a legacy driver, it will attach on top of the legacy driver fltmgr.sys. So this is how our file system stack looks right now.
Next up is the AVLegacy driver. This is a legacy driver, so it has to attach above EncryptionLegacy.
Now the AVMinifilter will load with an altitude of 328000. The OS will check if it will fit in the Frame 0 Fltmgr, but this frame only supports an altitude of 0-45000. Before deciding to create a new fltmgr Frame instance, it will check if there are any legacy filters attached above Frame 0 and adjust Frame 0’s altitude if there are. So in our case, we do have legacy filters on the file system stack at this point and so we go up the list of legacy drivers. First we see EncryptionLegacy with an altitude of 149998.99 so we adjust Frame 0 to cover from 0 to 149998.99. We continue up the list and see AVLegacy with an altitude of 329998.99 so we again adjust the altitude of Frame 0 to now cover 0-329998.99. The reason we do this is because Frame 0 now must handle all minifilters below 329998.99. Since we can only attach legacy filters to the top of the stack, if we add an additional FltMgr frame instance, it has to sit above AVLegacy and can only support minifilters with altitude of 329998.99 or higher. Now that Frame 0 supports 0-329998.99, we can register AVMinifilter with Frame 0.
At this point, you can already see that AVMinifilter which has a higher altitude then EncryptionLegacy will be loaded below EncryptionLegacy. The last driver to load is the luafv minifilter, and it will fit into Frame 0.
A couple of things to point out.
Why can’t we insert a Frame between AVLegacy and EncryptionLegacy when ACMinifilter loads?
This is due to how the file system stack is constructed with legacy drivers only being able to attach to the top of the stack. Since FltMgr is a legacy driver, it has to conform to these rules.
Why do we adjust the altitude in Frame 0 to cover 0-329998.99? Why not stop at 149998.99?
If Frame 0 only adjusts its altitude to the legacy filter directly attached above it and not all the way to the highest attached legacy filter, we won’t be able to handle some range of mini-filters. For example, assume we only adjust Frame 0 to cover 0-149998.99, then when the AVMinifilter with an altitude of 328000 comes along, it won’t fit in Frame 0, and we’re unable to insert a Frame between AVLegacy (329998.99) and EncryptionLegacy(149998.99) so we would either be unable to load AVMinifilter, or we would have to create Frame 1 above AVLegacy and load AVMinifilter there. In which case we would again be faced with the altitude disordering issue.
If this is the expected behavior, how do we resolve the problem of EncryptionLegacy being loaded above AVMinifilter? The solution is to inject a dummy minifilter that loads at the appropriate time to force fltmgr to create a Frame between the legacy filters. For our case above, I used the DDK to create the NullFilter minifilter driver and changed the load order to FSFilter Compression and gave it an altitude of 160030 which is within the assigned altitude for FSFilterCompression and set the start type to SERVICE_BOOT_START. Please note that I only used this driver in a test environment, production minfilter drivers must use an altitude assigned by Microsoft.
For information on minifilter load order groups and altitude, reference http://msdn.microsoft.com/en-us/windows/hardware/gg462963.aspx.
Our list of filters are as follows:
So with the new NullFilter dummy driver, our filter load order should be as follows:
After FileInfo and EncryptionLegacy loads, the stack is the same as what we had earlier.
Now when the NullFilter minifilter loads with an altitude of 160030, we see that it doesn’t fit in Frame 0. As before, we check for any attached legacy filter drivers and see EncryptionLegacy so we adjust Frame 0 to cover 0-149998.99. Since NullFilter still does not fit in Frame 0, we will create a new Frame and attach it above the EncryptionLegacy driver.
The AVLegacy driver will load next, and since it is a legacy driver, it will attach above the Frame 1 instance of FltMgr.
The last two minifilters to load are AVMiniFilter and luafv. When AVMinifilter loads into Frame 1 with an altitude of 328000, it will see that Frame 1 at the time only supports 149998.99-160030. It follows the same algorithm to check if there are any legacy filters attached above the frame. In this case, we have AVLegacy attached above Frame 1 so we adjust Frame 1 to cover 149998.99-329998.99 before inserting AVMinifilter into Frame 1.
By strategically injecting a dummy minifilter driver, we can get the legacy and minifilter drivers to all load at the correct altitude.
Recently, I was contacted by a customer who was advised by an ISV to set a registry value under one of the sub keys in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\. Let's call it UseQuantumComputing = 1 (value name has been changed to protect the ISV). The customer wanted to know what this value actually did and no one could find any documentation explaining it. These issues often come to our team because we have access to the Windows source code. I did a bit of code review to find out what this value does. As it turns out, nowhere in Windows source code between Windows 2000 and Windows Server 2012 do we ever check for or set UseQuantumComputing.
I can think of a few reasons the ISV would suggest setting this value. Perhaps they were under the impression this did something but got confused about the value name. It's possible they hoped making a registry change would have a placebo effect. Or, perhaps their software actually checks this value, not Windows.
The latter of these possibilities is probably the worst case scenario. An ISV should not create a registry value inside of keys used for Windows’ own internal use. Why? The first reason is that there's no guarantee that Microsoft won't end up coincidentally using that same value name later. This would cause a conflict between the two users of the value. Second, we have to consider the scenario where two different ISVs both decide to use the same value. That would be bad too. Lastly, there's no guarantee that the key in use will still exist in later versions, or that it will be writeable or readable by the ISV due to permission changes. In addition to all these reasons, there is the common sense issue that it is just confusing. Now the ISV's software and uninstaller needs to look all over the registry, not just in their own keys.
On a similar note, I also recently had a case where a "Windows Tips" blog (not created, endorsed, or run by Microsoft) suggested using a registry value that was implemented in Windows but was not documented by Microsoft. It turns out this value wasn't thoroughly tested (because it was undocumented and wasn't intended to be used in production), and using it would cause server hangs under certain conditions. These hangs were only discovered after a large customer decided to implement the undocumented value across their enterprise.
Here are a few tips for IT Pros, developers, and users alike:
I'll leave you with the warning displayed in many of our KBs - it's there for a reason!
WARNING: If you use Registry Editor incorrectly, you may cause serious problems that may require you to reinstall your operating system. Microsoft cannot guarantee that you can solve problems that result from using Registry Editor incorrectly. Use Registry Editor at your own risk.