Consider the following scenario:

You have an ASP.NET application which intermittently responds sluggishly. As the problem occurs memory usage is about average, as is CPU usage, but still certain pages respond slower and slower. The machine acts just as if it is under heavy load, but judging from the CPU it isn't. In fact CPU usage might even decrease rather than increase. When trying to access non .NET content such as plain text/html or images you usually find that response times are just as fast as ever.

When you find that your application has hung in a low CPU state it is usually because of one of the following reasons (in order of likelihood based on my experience) :

  1. External resources responding slowly. E.g. a database query or AD lookup is taking a long time to complete and so the application is simply waiting for another server to respond.
  2. One of the threads of the application has entered a critical section and now several other threads are waiting to enter the same critical section.
  3. A classic deadlock. Thread A waits for thread B which waits for thread C which waits for thread A.

Quite often you'll see a mixture of items 1 and 2. A thread enters a critical section, calls a remote database, or similar, and all other threads attempting to enter the same critical section will have to nicely stay in line until the struggling database has responded. Still, it is quite possible that the critical section is waiting for a local event to finish, but then you'd usually see a higher CPU load on the machine. (Though not necessarily in the w3wp.exe process.)

What are Critical Sections?

In case you've forgotten or simply not worked that much with critical sections it might be necessary to refresh your memory on this.

A Critical Section is a segment of code that you do not want to be executed more than one at a time. In order to ensure this you use the lock keyword to ensure that a certain block of code is able to complete with no interruption from any other threads. To illustrate this I've created the following code sample:

public class FileIO
{
    private static object myLock = new object();

	public FileIO()
	{
	}

    public string performLockOperation()
    {
        string strResult = "The Lock Operation started on " + System.DateTime.Now.ToLongTimeString();
        lock (myLock)
        {
            // Let the thread sleep for 10 seconds
            // This way we simulate a File IO operation that takes some time to complete.
            System.Threading.Thread.Sleep(10000);
        }
        strResult += " and ended on " + System.DateTime.Now.ToLongTimeString();
        return strResult;
    }

    public string performNoLockOperation()
    {
        string strResult = "The No Lock Operation started on " + System.DateTime.Now.ToLongTimeString();
        System.Threading.Thread.Sleep(10000);
        strResult += " and ended on " + System.DateTime.Now.ToLongTimeString();
        return strResult;
    }

}

Let's take a quick look at the parts of this class that I've written, called FileIO:

myLock

The class has a static variable called myLock that I will be using to identify the lock I'm about to create. This is why the variable is static. Otherwise I'd be creating new locks for each instance of the FileIO class leaving the lock operation completely redundant. Now that I use a static variable the lock spans all instances of the FileIO class.

constructor

The constructor is empty. Move along. Nothing to see here.

performLockOperation

This is the most interesting part of the class. It is the function we'll be using to actually reproduce the potential problems. The function returns a string which will contain information about when the call to the function was made and when it completed. We store this data in a string called strResult which we will eventually be returning as the result of the function call.

Having declared this variable and put the initial time of the function call in it we then create our lock. This is simply done with the following syntax:

lock (object)
{
}

Any code within this block will first make sure that there is not already a lock taken on this object. If there is, then they will pause execution and wait for the lock to become available before continuing execution.

Inside our lock we then simulate a slow File IO operation by letting the thread sleep for 10 seconds.

After that we append the current time to the result and return it.

performNoLockOperation

This function is identical to the previous one, except this time we do not have a critical section. This function is included just for comparison.

The potential problem

Now, let's get down to business. If we have a web page that makes a call to the performLockOperation-function and that page is requested at the same time by four different clients. What would be the response times?

The answer is 10, 20, 30 & 40 seconds.

Here's why:

  1. The thread handling the first request enters the critical section, effectively locking it for everyone else. When it is complete 10 seconds have passed due to the Sleep-call.
  2. One of the other threads enters the critical section once the first request is finished. It has already waited for 10 seconds, and the thread will now sleep for an additional 10 seconds.
  3. As the second thread finishes one of the remaining two enters the critical section. It has now waited for 20 seconds and will sleep for an additional 10.
  4. The final request is able to enter the critical section after having waited for 30 seconds. Once it is complete 40 seconds have passed.

If we'd been using the performNoLockOperation instead then all requests would have responded after 10 seconds.

Now, I know that 10 seconds is an exceptionally long time to let the thread sleep, but this was all for illustration purposes. Imagine instead that the sleeping time was 1 second and the number of requests was 200...

Troubleshooting this issue

Okay, so what would it look like when this problem occurs? Let's have a look. If you haven't looked at my posts on getting started with Windbg, then I suggest reading them before continuing. (Part I & Part II)

First of all I fire up WinDbg, then I attach to the w3wp.exe. I then use "~*e!clrstack" to look at the managed threads of the w3wp.exe application. Here is a snippet of what I get:

OS Thread Id: 0xce0 (24)
Child-SP         RetAddr          Call Site
000000000447e0f0 000007ff001e13d2 FileIO.performLockOperation()
000000000447e170 000007feee323ec9 _Default.Page_Load(System.Object, System.EventArgs)
000000000447e1b0 000007fee7956aea System.Web.Util.CalliHelper.EventArgFunctionCaller(IntPtr, System.Object, System.Object, System.EventArgs)
000000000447e1e0 000007fee794d2c4 System.Web.Util.CalliEventHandlerDelegateProxy.Callback(System.Object, System.EventArgs)
000000000447e210 000007fee794d322 System.Web.UI.Control.OnLoad(System.EventArgs)
000000000447e250 000007fee79498ac System.Web.UI.Control.LoadRecursive()
000000000447e2a0 000007fee7948db0 System.Web.UI.Page.ProcessRequestMain(Boolean, Boolean)
000000000447e370 000007fee7948cdb System.Web.UI.Page.ProcessRequest(Boolean, Boolean)
000000000447e3d0 000007fee7948c70 System.Web.UI.Page.ProcessRequest()
000000000447e430 000007ff001e0c09 System.Web.UI.Page.ProcessRequest(System.Web.HttpContext)
000000000447e490 000007fee7950117 ASP.default_aspx.ProcessRequest(System.Web.HttpContext)
000000000447e4c0 000007fee791449b System.Web.HttpApplication+CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()
000000000447e570 000007fee7ffbd41 System.Web.HttpApplication.ExecuteStep(IExecutionStep, Boolean ByRef)
000000000447e610 000007fee7fed132 System.Web.HttpApplication+PipelineStepManager.ResumeSteps(System.Exception)
000000000447e7a0 000007fee7fcf599 System.Web.HttpApplication.BeginProcessRequestNotification(System.Web.HttpContext, System.AsyncCallback)
000000000447e7f0 000007fee80f5344 System.Web.HttpRuntime.ProcessRequestNotificationPrivate(System.Web.Hosting.IIS7WorkerRequest, System.Web.HttpContext)
000000000447e910 000007fee80f65bb System.Web.Hosting.PipelineRuntime.ProcessRequestNotificationHelper(IntPtr, IntPtr, IntPtr, Int32)
000000000447ea80 000007fee80f4994 System.Web.Hosting.PipelineRuntime.ProcessRequestNotification(IntPtr, IntPtr, IntPtr, Int32)
000000000447eae0 000007fef72701ea DomainNeutralILStubClass.IL_STUB(Int64, Int64, Int64, Int32)
000000000447f310 000007fee80f5424 DomainNeutralILStubClass.IL_STUB(IntPtr, System.Web.RequestNotificationStatus ByRef)
000000000447f3f0 000007fee80f65bb System.Web.Hosting.PipelineRuntime.ProcessRequestNotificationHelper(IntPtr, IntPtr, IntPtr, Int32)
000000000447f560 000007fee80f4994 System.Web.Hosting.PipelineRuntime.ProcessRequestNotification(IntPtr, IntPtr, IntPtr, Int32)
000000000447f5c0 000007fef727043b DomainNeutralILStubClass.IL_STUB(Int64, Int64, Int64, Int32)
OS Thread Id: 0x15ec (25)
Unable to walk the managed stack. The current thread is likely not a 
managed thread. You can run !threads to get a list of managed threads in
the process
OS Thread Id: 0x1958 (26)
Child-SP         RetAddr          Call Site
0000000004c1dd10 000007ff001e13d2 FileIO.performLockOperation()
0000000004c1dd90 000007feee323ec9 _Default.Page_Load(System.Object, System.EventArgs)
0000000004c1ddd0 000007fee7956aea System.Web.Util.CalliHelper.EventArgFunctionCaller(IntPtr, System.Object, System.Object, System.EventArgs)
0000000004c1de00 000007fee794d2c4 System.Web.Util.CalliEventHandlerDelegateProxy.Callback(System.Object, System.EventArgs)
0000000004c1de30 000007fee794d322 System.Web.UI.Control.OnLoad(System.EventArgs)
0000000004c1de70 000007fee79498ac System.Web.UI.Control.LoadRecursive()
0000000004c1dec0 000007fee7948db0 System.Web.UI.Page.ProcessRequestMain(Boolean, Boolean)
0000000004c1df90 000007fee7948cdb System.Web.UI.Page.ProcessRequest(Boolean, Boolean)
0000000004c1dff0 000007fee7948c70 System.Web.UI.Page.ProcessRequest()
0000000004c1e050 000007ff001e0c09 System.Web.UI.Page.ProcessRequest(System.Web.HttpContext)
0000000004c1e0b0 000007fee7950117 ASP.default_aspx.ProcessRequest(System.Web.HttpContext)
0000000004c1e0e0 000007fee791449b System.Web.HttpApplication+CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()
0000000004c1e190 000007fee7ffbd41 System.Web.HttpApplication.ExecuteStep(IExecutionStep, Boolean ByRef)
0000000004c1e230 000007fee7fed132 System.Web.HttpApplication+PipelineStepManager.ResumeSteps(System.Exception)
0000000004c1e3c0 000007fee7fcf599 System.Web.HttpApplication.BeginProcessRequestNotification(System.Web.HttpContext, System.AsyncCallback)
0000000004c1e410 000007fee80f5344 System.Web.HttpRuntime.ProcessRequestNotificationPrivate(System.Web.Hosting.IIS7WorkerRequest, System.Web.HttpContext)
0000000004c1e530 000007fee80f65bb System.Web.Hosting.PipelineRuntime.ProcessRequestNotificationHelper(IntPtr, IntPtr, IntPtr, Int32)
0000000004c1e6a0 000007fee80f4994 System.Web.Hosting.PipelineRuntime.ProcessRequestNotification(IntPtr, IntPtr, IntPtr, Int32)
0000000004c1e700 000007fef72701ea DomainNeutralILStubClass.IL_STUB(Int64, Int64, Int64, Int32)
0000000004c1ef30 000007fee80f5424 DomainNeutralILStubClass.IL_STUB(IntPtr, System.Web.RequestNotificationStatus ByRef)
0000000004c1f010 000007fee80f65bb System.Web.Hosting.PipelineRuntime.ProcessRequestNotificationHelper(IntPtr, IntPtr, IntPtr, Int32)
0000000004c1f180 000007fee80f4994 System.Web.Hosting.PipelineRuntime.ProcessRequestNotification(IntPtr, IntPtr, IntPtr, Int32)
0000000004c1f1e0 000007fef727043b DomainNeutralILStubClass.IL_STUB(Int64, Int64, Int64, Int32)
OS Thread Id: 0x1e28 (27)
Child-SP         RetAddr          Call Site
000000000492dcf0 000007ff001e13d2 FileIO.performLockOperation()
000000000492dd70 000007feee323ec9 _Default.Page_Load(System.Object, System.EventArgs)
000000000492ddb0 000007fee7956aea System.Web.Util.CalliHelper.EventArgFunctionCaller(IntPtr, System.Object, System.Object, System.EventArgs)
000000000492dde0 000007fee794d2c4 System.Web.Util.CalliEventHandlerDelegateProxy.Callback(System.Object, System.EventArgs)
000000000492de10 000007fee794d322 System.Web.UI.Control.OnLoad(System.EventArgs)
000000000492de50 000007fee79498ac System.Web.UI.Control.LoadRecursive()
000000000492dea0 000007fee7948db0 System.Web.UI.Page.ProcessRequestMain(Boolean, Boolean)
000000000492df70 000007fee7948cdb System.Web.UI.Page.ProcessRequest(Boolean, Boolean)
000000000492dfd0 000007fee7948c70 System.Web.UI.Page.ProcessRequest()
000000000492e030 000007ff001e0c09 System.Web.UI.Page.ProcessRequest(System.Web.HttpContext)
000000000492e090 000007fee7950117 ASP.default_aspx.ProcessRequest(System.Web.HttpContext)
000000000492e0c0 000007fee791449b System.Web.HttpApplication+CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()
000000000492e170 000007fee7ffbd41 System.Web.HttpApplication.ExecuteStep(IExecutionStep, Boolean ByRef)
000000000492e210 000007fee7fed132 System.Web.HttpApplication+PipelineStepManager.ResumeSteps(System.Exception)
000000000492e3a0 000007fee7fcf599 System.Web.HttpApplication.BeginProcessRequestNotification(System.Web.HttpContext, System.AsyncCallback)
000000000492e3f0 000007fee80f5344 System.Web.HttpRuntime.ProcessRequestNotificationPrivate(System.Web.Hosting.IIS7WorkerRequest, System.Web.HttpContext)
000000000492e510 000007fee80f65bb System.Web.Hosting.PipelineRuntime.ProcessRequestNotificationHelper(IntPtr, IntPtr, IntPtr, Int32)
000000000492e680 000007fee80f4994 System.Web.Hosting.PipelineRuntime.ProcessRequestNotification(IntPtr, IntPtr, IntPtr, Int32)
000000000492e6e0 000007fef72701ea DomainNeutralILStubClass.IL_STUB(Int64, Int64, Int64, Int32)
000000000492ef10 000007fee80f5424 DomainNeutralILStubClass.IL_STUB(IntPtr, System.Web.RequestNotificationStatus ByRef)
000000000492eff0 000007fee80f65bb System.Web.Hosting.PipelineRuntime.ProcessRequestNotificationHelper(IntPtr, IntPtr, IntPtr, Int32)
000000000492f160 000007fee80f4994 System.Web.Hosting.PipelineRuntime.ProcessRequestNotification(IntPtr, IntPtr, IntPtr, Int32)
000000000492f1c0 000007fef727043b DomainNeutralILStubClass.IL_STUB(Int64, Int64, Int64, Int32)
OS Thread Id: 0x1c38 (28)
Child-SP         RetAddr          Call Site
0000000004d0e0b0 000007ff001e13d2 FileIO.performLockOperation()
0000000004d0e130 000007feee323ec9 _Default.Page_Load(System.Object, System.EventArgs)
0000000004d0e170 000007fee7956aea System.Web.Util.CalliHelper.EventArgFunctionCaller(IntPtr, System.Object, System.Object, System.EventArgs)
0000000004d0e1a0 000007fee794d2c4 System.Web.Util.CalliEventHandlerDelegateProxy.Callback(System.Object, System.EventArgs)
0000000004d0e1d0 000007fee794d322 System.Web.UI.Control.OnLoad(System.EventArgs)
0000000004d0e210 000007fee79498ac System.Web.UI.Control.LoadRecursive()
0000000004d0e260 000007fee7948db0 System.Web.UI.Page.ProcessRequestMain(Boolean, Boolean)
0000000004d0e330 000007fee7948cdb System.Web.UI.Page.ProcessRequest(Boolean, Boolean)
0000000004d0e390 000007fee7948c70 System.Web.UI.Page.ProcessRequest()
0000000004d0e3f0 000007ff001e0c09 System.Web.UI.Page.ProcessRequest(System.Web.HttpContext)
0000000004d0e450 000007fee7950117 ASP.default_aspx.ProcessRequest(System.Web.HttpContext)
0000000004d0e480 000007fee791449b System.Web.HttpApplication+CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()
0000000004d0e530 000007fee7ffbd41 System.Web.HttpApplication.ExecuteStep(IExecutionStep, Boolean ByRef)
0000000004d0e5d0 000007fee7fed132 System.Web.HttpApplication+PipelineStepManager.ResumeSteps(System.Exception)
0000000004d0e760 000007fee7fcf599 System.Web.HttpApplication.BeginProcessRequestNotification(System.Web.HttpContext, System.AsyncCallback)
0000000004d0e7b0 000007fee80f5344 System.Web.HttpRuntime.ProcessRequestNotificationPrivate(System.Web.Hosting.IIS7WorkerRequest, System.Web.HttpContext)
0000000004d0e8d0 000007fee80f65bb System.Web.Hosting.PipelineRuntime.ProcessRequestNotificationHelper(IntPtr, IntPtr, IntPtr, Int32)
0000000004d0ea40 000007fee80f4994 System.Web.Hosting.PipelineRuntime.ProcessRequestNotification(IntPtr, IntPtr, IntPtr, Int32)
0000000004d0eaa0 000007fef72701ea DomainNeutralILStubClass.IL_STUB(Int64, Int64, Int64, Int32)
0000000004d0f2d0 000007fee80f5424 DomainNeutralILStubClass.IL_STUB(IntPtr, System.Web.RequestNotificationStatus ByRef)
0000000004d0f3b0 000007fee80f65bb System.Web.Hosting.PipelineRuntime.ProcessRequestNotificationHelper(IntPtr, IntPtr, IntPtr, Int32)
0000000004d0f520 000007fee80f4994 System.Web.Hosting.PipelineRuntime.ProcessRequestNotification(IntPtr, IntPtr, IntPtr, Int32)
0000000004d0f580 000007fef727043b DomainNeutralILStubClass.IL_STUB(Int64, Int64, Int64, Int32

So, what does this tell us?

Well, obviously we have four threads that call my function FileIO.performLockOperation() from Page_Load.

Please note that we don't have any additional callstack, and since the performLockOperation-function was named by me it could just as well have been named "foo". In order to find out that these threads are actually waiting for a lock we need to look at the native callstack using the kb-command. Let's take a look at thread 24:

0:029> ~24kb
ntdll!NtDelayExecution+0xa
kernel32!SleepEx+0x84
mscorwks!EESleepEx+0x2d
mscorwks!Thread::UserSleep+0x71
mscorwks!ThreadNative::Sleep+0xf9
App_Code_tvh3usvo!FileIO.performLockOperation()+0x8a
App_Web_uywrbugj!_Default.Page_Load(System.Object, System.EventArgs)+0x32
System_Web_RegularExpressions_ni!System.Web.Util.CalliHelper.EventArgFunctionCaller(IntPtr, System.Object, System.Object, System.EventArgs)+0x19
System_Web_ni!System.Web.Util.CalliEventHandlerDelegateProxy.Callback(System.Object, System.EventArgs)+0x2a
System_Web_ni!System.Web.UI.Control.OnLoad(System.EventArgs)+0x84
System_Web_ni!System.Web.UI.Control.LoadRecursive()+0x42
System_Web_ni!System.Web.UI.Page.ProcessRequestMain(Boolean, Boolean)+0x97c
System_Web_ni!System.Web.UI.Page.ProcessRequest(Boolean, Boolean)+0xa0
System_Web_ni!System.Web.UI.Page.ProcessRequest()+0x5b
System_Web_ni!System.Web.UI.Page.ProcessRequest(System.Web.HttpContext)+0xf0
App_Web_uywrbugj!ASP.default_aspx.ProcessRequest(System.Web.HttpContext)+0x9
System_Web_ni!System.Web.HttpApplication+CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()+0x257
System_Web_ni!System.Web.HttpApplication.ExecuteStep(IExecutionStep, Boolean ByRef)+0xab
System_Web_ni!System.Web.HttpApplication+PipelineStepManager.ResumeSteps(System.Exception)+0x501
System_Web_ni!System.Web.HttpApplication.BeginProcessRequestNotification(System.Web.HttpContext, System.AsyncCallback)+0x72

Apparently we entered FileIO.performLockOperation, and then called sleep, (which is what we're doing in this repro in order to simulate a slow IO operation.) So, what do the other threads look like? Here's thread 26:

0:029> ~26kb
ntdll!ZwWaitForMultipleObjects+0xa
kernel32!WaitForMultipleObjectsEx+0x10b
mscorwks!WaitForMultipleObjectsEx_SO_TOLERANT+0xc1
mscorwks!Thread::DoAppropriateAptStateWait+0x41
mscorwks!Thread::DoAppropriateWaitWorker+0x191
mscorwks!Thread::DoAppropriateWait+0x5c
mscorwks!CLREvent::WaitEx+0xbe
mscorwks!AwareLock::EnterEpilog+0xc9
mscorwks!AwareLock::Enter+0x72
mscorwks!AwareLock::Contention+0x1fb
mscorwks!JITutil_MonContention+0xdf
App_Code_tvh3usvo!FileIO.performLockOperation()+0x7f
App_Web_uywrbugj!_Default.Page_Load(System.Object, System.EventArgs)+0x32
System_Web_RegularExpressions_ni!System.Web.Util.CalliHelper.EventArgFunctionCaller(IntPtr, System.Object, System.Object, System.EventArgs)+0x19
System_Web_ni!System.Web.Util.CalliEventHandlerDelegateProxy.Callback(System.Object, System.EventArgs)+0x2a
System_Web_ni!System.Web.UI.Control.OnLoad(System.EventArgs)+0x84
System_Web_ni!System.Web.UI.Control.LoadRecursive()+0x42
System_Web_ni!System.Web.UI.Page.ProcessRequestMain(Boolean, Boolean)+0x97c
System_Web_ni!System.Web.UI.Page.ProcessRequest(Boolean, Boolean)+0xa0
System_Web_ni!System.Web.UI.Page.ProcessRequest()+0x5b

This is one of the threads waiting for the critical section. It then found that it was unable to get a lock and so began waiting for the critical section on thread 24 to end.

Resolving the problem

This is the tricky part. There is no quick-fix for this. We can use the debugger to identify the bottleneck, but after that we'll need to review our design in order to get this working smoothly. A couple of generic things to try would be the following:

  • Review what you're doing inside the critical section. Is it really necessary to perform all of the operations you're doing, or would it be possible to do some/all of them outside the scope of the critical section?
  • Identify the reason for the lock taking so much time. Can it be resolved by load-balancing or clustering the resource we're waiting for?
  • If we're waiting for another server, make sure the physical infrastructure is okay. Are network response times within the expected parameters or do you need to fix your wiring, router, etc?
  • Are all calls to the critical section taking the same amount of time, or are there certain calls that take an exceptional amount of time? Perhaps most calls to the function are just fine, but intermittently a call comes along that tries to get everything from the AD, database, etc. In order to investigate this I'd use the commands outlined in Getting started with windbg -part II. Identify the thread owning the lock and investigate the callstack and stack objects using !clrstack and !dso. Use !dso and !do to query the stack objects and try to find the SQL query or AD query that is causing the thread to take such a long time to finish. Is the query significantly different from the ones waiting to execute, then this is most likely our culprit.

Post Script

A quick thought on locks and their limits:

Locks do not scale well. A lock can only be made within a specific process, so if you're using a web farm, or in any other way more than one w3wp.exe, then you will not be able to lock all processes. In other words. Server 1 can get a lock, and so can Server 2. Both will happily execute their code independent of each other. For products such as SQL server you'd deal with this in your SQL query, but if you're performing other operations, such as File IO you might have to work around this by using the same approach as Ms Word. Create a lock file in a shared space, and before you enter your critical section you simply check if the file exists:

  1. You are entering your critical section. Look for lock file in shared location
  2. If file exists, then wait
  3. If/when file doesn't/no longer exist, create file and enter critical section
  4. Make sure you have solid exception handling within the critical section so that you under no circumstances leave the critical section without removing the lock file.
  5. When done, remove lock file

/ Johan