Hi again! This is Tate from the CPR team and I’m going to show you how to debug a Server Service hang and the sometimes dreaded Event ID: 2021 and Event ID: 2022. There is much Voodoo about troubleshooting these two events but never fear, it’s possible to debug quickly given the right approach.
Recall, I’ve talked about two other common Server Service events prior, 2019 and 2020. (Event ID: 20xx usually denotes a srv.sys series event) Also, from now on I’m going to refer to the Server Service as srv. It’s in the driver srv.sys that the bulk of the work is happening and as such that’s just what we call it.
Anyway, if you have seen these events before you likely have queried the web with the ID or description and have seen the big KB317249 article.
This is an oldie but goodie, but the problem is that, well it’s big and it would help to have perspective on how srv is supposed to work because all the root causes for the event seem quite varied…but often there is only one of two things happening here.
1.) Blocking in the I/O stack preventing normal amounts of SMB requests from getting processed in a timely fashion. (this is the most common that we see, the SMB pump failure case)
2.) Abnormally large amount of requests directed at srv(this is usually a really chatty SMB application or usually accumulation of ChangeNotify requests).
I’ll show you how to determine which of these two are happening but first let’s start with the event text as a refresher…
Event ID: 2022Source: Srv Description: Server was unable to find a free connection n times in the last s seconds.
Event ID: 2021Source: SrvDescription: Server was unable to create a work item %2 times in the last %3 seconds.
What does this mean?
We really have to start with an explanation of Connections and Work Items. So, internally to srv these are just structures used to represent a connection into srv and an item of work…couldn’t be that straightforward right? It is, when a request comes into srv it has to use an Endpoint, Connection, a Work Item, and other structures to know about where the request came from, what the work is, and how to get the response back to what client.
At its core, srv is just a Kernel Mode Driver that does SMB processing. It takes in network SMB/CIFS requests, converts them into local I/O, and pumps the return back out to the client. But the problem here is why would you be “unable to find” or “unable to create” one of these or what breaks this pump/processing?
But first, let’s comment on storage. Kernel memory is not free and as mentioned in prior blogs these structures take up space in the kernel pools and or may have preconfigured srv nominal limits (see MaxWorkItems in KB317249, etc.) if you hit either of these limits the event will be logged. So as we learned prior running out of Paged or NonPaged pool can hang a machine no problem but can also log these events should Srv not be able to allocate memory on behalf of client requests. (See the Event ID: 2019 2020 blog for that info and troubleshooting). What usually happens here though are these WorkItems build up due to blocking or high load and cause the Event ID: 2019 or 2020…(with the top consumer usually LSwn – LanmanServer WorkItems) and then you will see these along with the 2021/2022s. An event party if you will with the 2021/2022 issue in srv actually exhausting kernel memory pools.
Could you bump up the limits of everything in \LanmanServer via KB317249 not read further and be okay?
Maybe, but it depends how much blocking you may have or how high the load gets before the point of failure. If you are logging this even after the server has been working fine under your normal ebb and flow of load in the past, it may be likely that this article will not completely fix your problem so further work is needed. Keep in mind too that if you bump up the limits in the registry per the above explanation you could actually hit the pool limits sooner! The values in KB317249 are pretty conservative so they shouldn’t hurt but again it might not completely help. Also, if you don’t like changing things until you understand more…read on!
REASON #1 for Event ID: 2021 and 2022
The pump mentioned above is really the Srv!WorkerThread and KQUEUE architecture. When requests come in they are put in KQUEUEs via KeInsertQueue and as you might imagine, the processing/draining of those queues is done by the Srv!WorkerThread(s) via a call to KeRemoveQueue. These threads are the heart of Srv and are the key to determining what is usually happening to break the SMB processing here, so we usually look there for the answer.
To check in on srv you gather a kernel memory.dmp via KB244139 during the problem or with livekd (Sysinternals) perhaps…but the best way is KB244139 or by attaching the kernel debugger. The best time to gather this dump is when the server is somewhat unresponsive via the network (SMB) or you are constantly logging the 2021/2022 events…or if the machine is completely hung after hitting these Events.
With memory.dmp in hand, just load it up and do the following to get the system process’ EPROCESS address via !process 0 0 system
0: kd> !process 0 0 system
PROCESS 89067520 SessionId: 0 Cid: 0008 Peb: 00000000 ParentCid: 0000
DirBase: 00030000 ObjectTable: 89067008 TableSize: 490.
Now change process context using that address with .process /r /p <system EPROCESS address>
0: kd> .process /r /p 89067520
Implicit process is now 89067520
Loading User Symbols
Now let’s dump out all the system threads using !process <system EPROCESS address>
0: kd> !process 89067520
Dumping out the threads…
Once the output of threads is finished, I like to search up from the bottom for “Start Address srv!WorkerThread” looking at the call stacks…
Note: There’s a shortcut here for Windows 2003 to just do a “!stacks 2 Srv!WorkerThread” right away. This output will give you a really quick look at the threads without having to .process into system, etc.
Here’s an example of a “good thread” just waiting for work to come in…sitting in KeRemoveQueue
THREAD 887c14e0 Cid 8.474 Teb: 00000000 Win32Thread: 00000000 WAIT: (WrEventPairLow) KernelMode Non-Alertable
Owning Process 89067520
Wait Start TickCount 160832116 Elapsed Ticks: 203335
Context Switch Count 134294
Start Address srv!WorkerThread (0xbea06880)
Stack Init beb11000 Current beb10d3c Base beb11000 Limit beb0e000 Call 0
Priority 10 BasePriority 9 PriorityDecrement 0 DecrementCount 0
ChildEBP RetAddr Args to Child
beb10d54 8042ef5f 88a10eb8 be9fd320 00000001 nt!KiSwapThread+0x1b1
beb10d78 bea068ec 00000000 00000000 00000000 nt!KeRemoveQueue+0x197
beb10da8 804578c2 be9fd300 00000000 00000000 srv!WorkerThread+0x6c
beb10ddc 8046c966 bea06880 be9fd320 00000000 nt!PspSystemThreadStartup+0x54
00000000 00000000 00000000 00000000 00000000 nt!KiThreadStartup+0x16
What does a “bad thread” look like?
Here is a blocked Srv!WorkerThread that is trying to process a FindFirst SMB request…
THREAD 887dd4e0 Cid 8.45c Teb: 00000000 Win32Thread: 00000000 WAIT: (UserRequest) KernelMode Non-Alertable
86941288: (0006,01fc) Flags: 00000884 Mdl: 00000000
Impersonation token: e5d56990 (Level Impersonation)
Wait Start TickCount 160783273 Elapsed Ticks: 252178
NOTE the Tick count here on this thread….it’s been waiting a while, not good.
Context Switch Count 12993
Stack Init bec29000 Current bec284f4 Base bec29000 Limit bec26000 Call 0
bec2850c 8042d893 00000000 882549c8 00000000 nt!KiSwapThread+0x1b1
bec28534 f5ae3408 88255db8 00000006 00000000 nt!KeWaitForSingleObject+0x1a3
bec285bc f5ae4066 bec285e4 00000000 00000000 SomeFiltr+0x3408<<<<<<<making a blocking call here…
bec285fc f5ae19fb 884ce3a0 01941288 8041eecb SomeFiltr+0x4066
bec287a4 804c4994 884ce3a0 00000000 bec28894 SomeFiltr+0x19fb
bec287dc 804531c2 884766c8 00000000 bec28894 nt!IopParseFile+0x44
bec28854 804da4d8 0000051c bec28950 00000040 nt!ObpLookupObjectName+0xf8
bec28964 804a4495 00000000 00000000 00000000 nt!ObOpenObjectByName+0xc8
bec28a40 804a403a 8690d008 00000001 bec28b4c nt!IopCreateFile+0x407
bec28a88 bea0282c 8690d008 00000001 bec28b4c nt!IoCreateFile+0x36
bec28af8 bea050f6 88a0d2f8 8690d008 00000001 srv!SrvIoCreateFile+0x32a
bec28b8c bea0acdb 88a0d2f8 00000001 8046b000 srv!SrvQueryDirectoryFile+0x1c8
bec28c5c bea0936d 88a0d2f8 00000001 00000000 srv!SrvFind2Loop+0x171
bec28cd0 bea07d2f 88a0d201 bea0a9c8 88a0d2f8 srv!DoFindFirst2+0x373
bec28cd8 bea0a9c8 88a0d2f8 88a0d2f8 804636f0 srv!SrvSmbFindFirst2+0x31
bec28cf0 bea089b7 88994f38 88a0d2f8 00000000 srv!ExecuteTransaction+0xdc
bec28d68 be9f27ae 88a0d2f8 be9f2cde 88a0d300 srv!SrvSmbTransaction+0x5af
bec28d70 be9f2cde 88a0d300 88a144a0 bec28da8 srv!SrvProcessSmb+0x46
bec28d84 bea069a9 00000000 00000000 00000000 srv!SrvRestartReceive+0xa2
bec28da8 804578c2 88a14400 00000000 00000000 srv!WorkerThread+0x129
bec28ddc 8046c966 bea06880 88a144a0 00000000 nt!PspSystemThreadStartup+0x54
Note: The tick count (Elapsed Ticks) was not bad for the prior “good thread” because we’ve just been waiting in the call to KeRemoveQueue for that long for work to arrive. No work to do in that queue, no problem. However, in this “bad thread “case we’ve grabbed a WorkItem to process and been waiting for a long time (about 1hr 5 min.) stuck while processing it, yikes.
Why is this bad?
So if the Srv!WorkerThread is the worker thread that drains the WorkItems from this list of request in the KQUEUEs and number of these threads are finite(they are), then we have a problem! As more requests come into srv they will continue to get placed in the KQUEUEs by different event handler threads at the transport layer but with no WorkerThread available to process the queue on the back end (and considering these Connections and WorkItems are stored in Pool and the number is finite as well) this is how and usually why the pump stops and we log these events! So not only might we run out of the configured limits in the registry (KB317249) for srv, we might run out of Non Paged Pool, and we are definitely not giving clients a timely response if at all in this state.
What if I see “good” and “bad” threads in my !process of the system process?
This is likely because of the two queues that we have in service. There are blocking and a non-blocking queues so you may see the threads that are waiting in KeRemoveQueue are actually for the blocking queue whereas your other bad threads are blocked on things like CreateFile, Find, etc. requests…supposed to be shorter time requests, appropriately so, on the the non-blocking work queues. In general, if you see any Srv!WorkerThread blocked in anything but KeRemoveQueue for very long, it’s bad.
So what is the solution?
Referencing the above “bad thread” for SomeFiltr.sys note that the filter driver in question has chosen to make a blocking call to KeWaitForSingleObject. This is blocking a FindFirst request from a client but more importantly it’s consuming an Srv!WorkerThread in the process as stated….so it’s actually the filter driver making the decision to stop the processing so the fault lies with it in this case.
Now it’s worth mentioning that this is a fairly simple 1st tier blocking example but suffice to say any blocking call taking too long in a srv!WorkerThread can have bad consequences. This could have easily been any filter driver or device driver in the I/O stack (between the network and srv and the disk) which is blocking doing some work to “help” process the I/O or more difficult to debug the filter or function driver could be waiting on some other queuing mechanism that it implements…so you see how srv can encounter a dependency failure here.
Another trick that we use is looking at all other thread stacks containing the name of the filter driver to see if it’s doing something recognizable like waiting on another Kernel api, filter, etc…in other words guessing through the black box of it’s decision to wait on this thread and making an educated guess of what work it is trying to do on another…A common case is when the blocked thread may be waiting on a response from an LPC to LSASS which is performing some security check. Then it’s necessary to jump …..
In summary, since we can’t see further into the blocking here through SomeFiltr.sys and there’s no other threads in process with this driver in the call stack we’d recommend testing updating, removing, or contact the vendor of the SomeFiltr.sys filter to see if this is fixed, etc. You can do an “lmvm SomeFiltr” to get the time/date information for the file.
So, what if all the threads are just waiting like in the good thread example?
This could be the unfortunate side of gathering a single memory.dmp to represent a problem that may be transient. If a fraction of a second before you hit the Ctrl+Scroll+Scroll to issue the dump the Srv!WorkerThread unblocks, it may look completely normal. What you can do for this class of issue that has temporary blocking is either try again to get it in state, or have a look at the device stack that srv has to work through to get things done(see devstack below). Usually you will find out what filter drivers you are running, etc. and since that is the #1 reason for this event updating them is usually the solution because we or the vendor have likely already debugged this for you and with the filter vendor’s help have likely fixed it in a later version.
This is why usually the first thing we target is filter drivers before debugging anything for hangs or if a dump is not available. By frequency this is far and away the usual root cause for hangs of many many (did I say many?) symptom types.
How do I check the members of device stack that Srv.sys has to work through to get I/O requests processed, filter drivers I need to update, etc?
This is cool…so did you notice that there was an Irp listed in the prior “bad thread”?
Dump it out…
0: kd> !irp 86941288
Irp is active with 11 stacks 11 is current (= 0x86941460)
No Mdl Thread 887dd4e0: Irp stack trace.
cmd flg cl Device File Completion-Context
[ 0, 0] 0 0 00000000 00000000 00000000-00000000
Args: 00000000 00000000 00000000 00000000
Args: 00000000 00000000 00000000 00000000
>[ 0, 0] 1 0 884ce3a0 878977a8 00000000-00000000
Args: bec28778 01000000 00070000 00000000
Now grab the Device Object for SomeFiltr…
0: kd> !devstack 884ce3a0
!DevObj !DrvObj !DevExt ObjectName
> 884ce3a0 \FileSystem\SomeFiltr884ce458
88ba5c80 \Driver\InnocentDriver 88ba5d38
88ba5460 \FileSystem\Ntfs 88ba5518
Now you can see how this thing works….ntfs talks to InnocentDriver and it to SomeFiltr, etc….all these guys have to work flawlessly to not kill the srv request dependent upon this device stack.
What if you didn’t have an Irp, no problem…everthing hangs off Object Manager so start at Ntfs’s device objects…This is cooler…
0: kd> !drvobj \FileSystem\Ntfs <<<<note that’s !DRVobj for the Driver Object
Driver object (88dd2710) is for:
Driver Extension List: (id , addr)
Device Object list:
882b4120 8717a620 88b91800 88b981e0
88b9e600 88ba5460 88f583e0 88dd25f0
0: kd> !devstack 882b4120
!DevObj !DrvObj !DevExt ObjectName
86576020 \FileSystem\SomeFiltr865760d8 <<<<<<there you are Mr. SomeFiltr
> 882b4120 \FileSystem\Ntfs 882b41d8
0: kd> !devstack 8717a620
8792ae00 \FileSystem\SomeFiltr8792aeb8 <<<<<<and here
> 8717a620 \FileSystem\Ntfs 8717a6d8
0: kd> !devstack 88dd25f0
886c9020 \FileSystem\SomeFiltr886c90d8 <<<<<<<and here
88f74020 \Driver\InnocentDriver 88f740d8
> 88dd25f0 \FileSystem\Ntfs 00000000 Ntfs
The prudent thing to do if you aren’t sure of root cause by debugging the Srv!WorkerThreads you have captured in a memory.dmp is at least make sure all the filter drivers in these stacks are updated and perhaps even your storage drivers that are being used under ntfs serving your file share(s).
REASON #2 for Event ID: 2021 and 2022
What we have just discussed was the most common reason for hitting the Event ID: 2021 and 2022 event in blocking by a filter or function driver on the I/O stack and how to identify. There are of course other reasons like bursts of demand and a lower than the maximum configured Srv (the nominal limits mentioned above and discussed in KB317249) or just simply a lot of long term requests that are on the system such as ChangeNotify requests. How would you find out about these? Analyze the network traffic.
There are tons of network sniffers which allow you to gauge what type of SMB traffic is going in between servers so I won’t get into that here. I will only say that if you have the ability to generate Statistics of top SMB commands, that’s the way to go. You could however just filter on SMB and ballpark what kind of request and from which clients are occurring most and given your client configuration if that’s normal. What we see here usually is a misbehaving network application in a loop, etc. hitting the file server with repeated requests for the same file, etc. This is also a great strategy for high system process CPU issues where when you look at the thread call stacks generating the CPU via Process Explorer (sysinternals) you find the base of the activity in Srv!WorkerThreads coming and going rapidly.
I will mention however there are a few known ways to reduce the cumulative load of ChangeNotifications and QueryFile and QueryPathInfo requests that Windows Clients actually generate. You can check out the following KBs for these configuration changes on the client side:
For the client SMB side’s shell: NoRemoteChangeNotify.
“An update is available in Windows to turn off directory notification SMB requests”
For the client SMB side (redirector): InfoCacheLevel
(also has shell change NoRemoteRecursiveEvents )
“Windows XP Explorer Pane flickers on mapped network drives”
I hope you’ve enjoyed this post and hopefully the next time you see these srv events you can nail the root cause with windbg no problem!
Keep that device stack updated and until next time…
I was aware of this blog just recently, and am very happy with all the debugging information provided.
Continue with the great work.
I was able to resolve my issue with from what i read here. Thanks Tate
AWESOME !! I never miss a blog from Tate. Keep up the gud work :-)