We recently dealt with an interesting issue that I would like to share, hope you enjoy. - Jason
Issue Summary
Customer reports that within their Citrix server farm (running on Windows Server 2003), when a user logs into or out of a session (seems more pronounced on logoff), ALL users connected to the server experience a ~5-20 second hang. The hang is described as follows:
Customer has found that during the timeframe of the hang, Winlogon.exe (in session 0) is showing 25% CPU usage.
Troubleshooting & Analysis
Where to start… to begin, given the details above, we can assume we are not experiencing a hard hang where the server is completely unresponsive. The user can still move windows and the connection to the server is still active. Keyboard input is working, but seems to be buffered during the issue. The windows not redrawing indicates the message pump for each of the affected applications is stuck waiting for something.
In this scenario, I focused on the CPU usage for Winlogon under the premise that the other symptoms were side effects of whatever the CPU was working so hard on. On to the 25%; this is an interesting number to spike at. Generally speaking a thread has work to do, or it doesn't. If there was work to do it would normally take the CPU to 100%. So why 25%? In this scenario, each of the servers within the server farm had 4 processors. So we pegged 1 out of the 4 @ 100% resulting in task manager displaying 25% CPU utilization for the Winlogon in question.
So now we have a CPU spike in Winlogon. Why only the Winlogon in session 0? Going back to the issue summary, the customer reported that if ANY user logs off, ALL users experience a hang, and Winlogon in session 0 spikes. First, let's talk about what winlogon does for a living.
Sidebar: Winlogon on Windows Server 2003.
Any time you ask what a feature is or how it works, you should begin your research with the Windows Technical Reference (try a Live search like "winlogon site:http://technet2.microsoft.com"). In this case I focused on the "How Interactive Logon Works" article.
This article starts with:
The Windows Server 2003 interactive logon architecture includes the following components:
So Winlogon has something to do with Interactive logon.
What else can we find out about the responsibilities of Winlogon
Looking at this list of tasks performed by Winlogon, we need to try to determine which task aligns to our symptoms (hang during logon and logoff). I decided to focus on User Profiles due to the relationship between profiles and login and logoff.
What is a User Profile? Back to Live with "user profiles site:http://technet2.microsoft.com". Out of the results, I went with the "User Profile Structure" link. The page provides the following detail:
A user profile consists of:
With this we can now look at the Winlogon that is spiking the CPU. My step when looking at a CPU spike for a process is to determine if the time spent is in User-mode or Kernel mode. In task manager on the Performance tab you can monitor CPU usage. If you select the View menu and chose to "Show Kernel Times" you can get additional detail showing you if the time is associated with a User-mode module or a Kernel module. In this case the Winlogon spike showed the spike to be in Kernel time. This means the application (Winlogon) asked the OS to do something and it is trying to get it done.
My second step is to determine what user-mode request led to the spike in Kernel time. To answer this on a production system I chose to use Sysinternal's Process Explorer. I downloaded the tool and set it up so it would be able to obtain symbols from the public MS symbols server. During the repro I monitored the thread CPU time in Winlogon. Here is a screen shot of what Process Explorer looks when looking at the thread activity of an idle Winlogon.
Figure 1 – Process Properties in Process Explorer
When the CPU spike occurred we looked at the stack for the spiked thread (double click on the thread with the high CPU time or click the "Stack" button with the thread highlighted):
ntdll.dll!KiFastSystemCall+0x3ntdll.dll!KiFastSystemCallRetntdll.dll!ZwUnloadKey+0xcADVAPI32.dll!LocalBaseRegUnLoadKey+0x51ADVAPI32.dll!RegUnLoadKeyW+0x73USERENV.dll!MyRegUnLoadKey+0x6dUSERENV.dll!CUserProfile::UnloadUserProfileP+0x2a2USERENV.dll!UnloadUserProfileI+0x198RPCRT4.dll!Invoke+0x30RPCRT4.dll!NdrStubCall2+0x299RPCRT4.dll!NdrServerCall2+0x19RPCRT4.dll!DispatchToStubInCNoAvrf+0x38RPCRT4.dll!RPC_INTERFACE::DispatchToStubWorker+0x11fRPCRT4.dll!RPC_INTERFACE::DispatchToStub+0xa3RPCRT4.dll!LRPC_SCALL::DealWithRequestMessage+0x42cRPCRT4.dll!LRPC_ADDRESS::DealWithLRPCRequest+0x127RPCRT4.dll!LRPC_ADDRESS::ReceiveLotsaCalls+0x430RPCRT4.dll!RecvLotsaCallsWrapper+0xdRPCRT4.dll!BaseCachedThreadRoutine+0x9dRPCRT4.dll!ThreadStartRoutine+0x1bkernel32.dll!BaseThreadStart+0x34
Just to make sure, we repro'd again and looked at the call stack
We can see here that both call stacks for the thread in question are RPC threads waiting for incoming requests. The request that came in was from userenv module running in Winlogon associated with the session of the user logging off and came into the Winlogon in session 0. The request is to unload the User Profile (highlighted above). This led to a call to RegUnloadKey call which removes a hive from the registry but does not modify the file containing the registry information.
We now know the issue is related to unloading the user profile and specifically the registry portion of the user profile. Now we need to determine where we are spending the time in kernel mode.
Kernrate
Kernrate is a tool included with the Windows 2003 resource kit. This will let us dig into the kernel mode side of what is going on. I collected the following data from kernrate during the hang condition (Summarized):
Parsing the output, we see that 72.66% of the Kernel time was spent in the idle process. This is attributed to the 3 idle threads that are on the 3 processors not being used by the Winlogon thread. We see that the Winlogon process accounts for 20.39% of the kernel time (the CPU spike on the 4th processor). If we move into the module summary, you can see the correlated CPU times for the modules, INTELPPM (CPU driver) running on the idle processors and NTOSKRNL running on the 4th processor (this is where we will focus). In the final drill down, we see the function summary for NTOSKRNL and we can identify the function in use 82% of the time was CmpCleanUpKCBCacheTable. The Cm function prefix lets us know this is related to Configuration Manager (the Registry – see Table 2-7 from Microsoft Windows Internals 4th Edition for Commonly used Kernel Prefixes).
What do we know? We know the issue manifests most during user logoff. We know during the logoff we are trying to unload the profile. We know that in the process of doing this we are trying to unload a registry hive. We know that leads us to spend a lot of CPU time in the Kernel doing CmpCleanUpKCBCacheTable.
Why does this hang the machine? The registry is protected / synchronized with an ERESOURCE named CmpRegistryLock. While this is held exclusively during this cleanup function, all registry access is blocked (both read and write). This explains all of our symptoms. Applications freeze and do not redraw due to operations that need registry access being done on the GUI thread of a process. Perfmon cannot report due to its dependency on the registry.
Resolution
Now that we know the exact behavior, we could align it to a known issue that was introduced with the release of Windows Sever 2003 SP1 (applies to both SP1 and SP2)
KB927182
From the KB cause section:
This problem occurs because of the way that Windows Server 2003 Service Pack 1 (SP1) cleans up user registry information. When a user logs off, there is a five-second delay before Windows Server 2003 SP1 flushes the user registry hive. If the terminal server experiences heavy registry activity, the cached registry information on the terminal server may increase significantly. For example, this behavior may occur when a program scans the user profile during the logoff process. When the terminal server experiences heavy registry activity, it takes Windows longer to flush the data.
The fix applies to Windows Server 2003 with SP1 or SP2.
PingBack from http://lostdriveblog.com/2008/04/10/closing-the-loop-cpu-spike-in-winlogonexe/
Nice post, Jason - good use of the tools!
(The link to Process Explorer is actually pointing at the Process Monitor utility page. And, interestingly, it's pointing to a prior version of Process Monitor - Process Monitor 1.3 was released this week, and the page that is linked to is apparently an old page for Process Monitor 1.26. The Sysinternals Site theme is even the old one...)
"USERENV.dll!MyRegUnLoadKey" -- Love it! Also "ReceiveLotsaCalls" is pretty funny too :)