I have a customer who has a two node (W2K3 R2 SP2+ x64) failover cluster (MSCS) which are mainly dedicated to host 11 instances of SQL Server 2005 and 7 instances of Analysis Services 2005.
When the total percentage of processor time reached 100% and was pegged there for a few seconds, the Network Interface/Output Queue Length counter for the MS TCP Loopback interface started increasing and any packet routed to that pseudo-interface got hung for as much time as it took the system load to come down a bit.
The problem with that is, as you may know already, that the cluster library for SQL Server resource type implements the IsAlive callback so that it tries to connect to the instance of SQL Server and executes the “SELECT @@SERVERNAME” query. Since the IP of the clustered instance is local, it’s routed to the this loopback interface, and since it isn’t responding because of the situation explained above the cluster service ends up assuming the resource is not operational and restarts it (or eventually fail it over to the other node, depending on how many failures had occurred for the resource in a given period, and its configuration).
While this was occurring over the loopback interface, if we tried to connect to any of the instances of SQL Server from outside the node, so that all traffic went over the public interface, connections were successfully made and the query returned complete resultsets in a timely manner. Therefore, we could say that what the cluster resource library concluded (i.e. resource wasn’t operational) wasn’t true, since clients were still able to connect to the resource and it responded to all queries sooner than later.
It was McAffee’s mini-firewall driver which was causing the Delayed Work Queue threads starvation. Uninstalling this component from the system solved the issue.
Following I’m attaching a screenshot of real time activity captured with Performance Monitor when the problem was seen. In it one can see the relationship between CPU consumption and queueing of network packets targeting any locally bound IP.
This is a screenshot showing McAffee’s system drivers that were causing the problem.
And, finally, here’s an analysis of the Kernel Dump taken just when the problem occurred.
===== Dump analysis =====
A hidden moral of the above story is:
Stopping an Anti-Virus Service does not stop that Anti-Virus software's filter drivers. We may think we know what a software vendor did back when such an issue is caught, but what about now ... could it not be that the software vendor has since changed their code? Thus while selectively unloading/disabling filter drivers might be tempting, one also might step on a land mine. In contrast, unnstalling anti-virus software is a far safer and more certain practice (especially when one incorrectly thought stopping a service accomplished the same thing as uninstalling the service's software, including its filter drivers).
Excellent post, Nacho!
I've seen something similar in a different context, on an IIS server where PING responses where 'timing out'. Stop the IIS worker process, and the PING would suddenly come back to life. In that case, the diagnosis was similar, and based on the increasing Output Queue Length counter for the network card, we figured something was wrong with a 3rd party filter driver for a firewall software component of an antivirus - the manufacturer will not be named here. Uninstallation was what we had to do, though it must be said that in that specific case, the network card drivers were very old as well.