The problem: When using a particular application on one specific device, users noticed very bad UI response for a period of a few seconds.  After those seconds, UI responsiveness went back to normal.

 

The investigation: I used ActiveSync to drop celog.dll and oscapture.exe onto a standalone device.  I could have hooked the device up to Platform Builder and run Remote Kernel Tracker over KITL, but in this case the device I had was configured to run without KITL.  I just didn’t bother reconfiguring it. 

 

I made a .lnk file to start oscapture, and a second .lnk file to trigger it (run a second instance of oscapture with the “-c” parameter).  I dropped these onto the Start menu of the device.  I ran oscapture, and played around with the errant application until the UI slowdown occurred.  As quickly as I could (due to the UI slowdown it took a few seconds) I brought up the start menu and clicked the link to trigger oscapture.  OScapture saved its data to a file.  I then hooked up ActiveSync again, and copied the log file off the device.  Actually now that I think of it, I probably could have just connected Remote Kernel Tracker to the device over ActiveSync, but you know I’m not that smart.

 

Here is what I saw when opening up the file in Remote Kernel Tracker:

(scroll down if you can't see anything)

 

 

The main thread of pimg.exe is taking all the CPU time, and a few other threads (notably a couple in telshell.exe, which is responsible for the Smartphone UI) took little blips of time here and there, contending over a critical section.  I know they are contending because CeLog only logs critical section entry and exit events when there is contention.  Critical sections are acquired and released far too often to log uncontended operations. You don’t see any critical section exit events in this screen shot, but they’re there; they happen so close to the entry events that the “exit” icons are hidden underneath the “entry” icons.  If I had zoomed in some more, you’d see them resolve into distinct events.  All of these threads are at the same priority, 251 (THREAD_PRIORITY_NORMAL).  I know the thread priorities because I hovered over the thread lines and saw the priorities in the resulting tooltips.  But despite the equal priorities in this case, only one thread, that from pimg.exe, is getting very much CPU time.  It is running to the completion of its quantum.  I know that because I set the “time marker 1 & 2” in the time scale bar – you can see them in the screen shot.  In the bottom right of the Remote Kernel Tracker window, where it says “Marker Time Delta,” is the time between the two markers.  That time is about 100ms, the default thread quantum in Windows CE.  Each time the pimg.exe thread runs for about the same length of time, about 100ms.

 

I immediately recognized this as a lock convoy problem.  Let me describe what that means.  A lock convoy is a situation which occurs when two or more threads at the same priority frequently (several times per quantum) acquire a synchronization object, even if they only hold that object for a very short amount of time.  It happens most often with critical sections, but can occur with mutexes, etc as well.  For a while the threads may go along happily without contending over the object.  But eventually some thread’s quantum will expire while it holds the object, and then the problem begins.  The expired thread (let’s call it Thread A) stops running, and the next thread at that priority level begins.  Soon that thread (let’s call it Thread B) gets to a point where it needs to acquire the object.  It blocks on the object.  The kernel chooses the next thread in the priority-queue.  If there are more threads at that priority which end up trying to acquire the object, they block on the object too.  This continues until the kernel returns to Thread A which owns the object.  That thread begins running again, and soon releases the object.  Here are the two important points.  First, once Thread A releases the object, the kernel chooses a thread that’s blocked waiting for the object (probably Thread B), makes that thread the next owner of the object, and marks it as “runnable.”  Second, Thread A hasn’t expired its quantum yet, so it continues running rather than switching to Thread B.  Since the threads in this scenario acquire the synchronization object frequently, Thread A soon comes back to a point where it needs to acquire the object again.  This time, however, Thread B owns it.  So Thread A blocks on the object, and the kernel again chooses the next thread in the priority-queue to run.  It eventually gets to Thread B, who does its work while owning the object, then releases the object.  The next thread blocked on the object receives ownership, and this cycle continues endlessly until eventually the threads stop acquiring so often.

 

The real performance killer about a lock convoy situation is that every time the threads involved need to acquire the synchronization object, they block and take at least a couple of context switches.  Since the premise of this situation is that the threads acquire the object frequently (several times a quantum), they are running for much less than a quantum between each context switch.  The less time they run between acquires, the more context switches involved – and remember that this is all regardless of how long or short of a time they spend owning the object.

 

The other performance killer in a lock convoy situation is that it dramatically increases the percentage of CPU time non-contending threads at the same priority level get to run.  If all threads at that priority level are contending over the object, what you see is very frequent context switching between them all.  If one or more threads at that level are running continuously without blocking on any objects, those threads get to run for a full quantum each time the contending threads loop on the object.  That is what you see in the screen shot above; the pimg.exe thread is not involved in the lock convoy, but since it’s at the same priority level as all the threads in the convoy, it gets all the CPU time.  In this case, the UI threads only ran very tiny periods once each quantum of the pimg.exe thread execution, so the UI was very slow to respond to user input.  Eventually the pimg.exe thread completed its work, the UI threads completed their convoy, and the UI response time went back to its snappy old self.

 

The solution: Lock convoys are not always easy to solve.  In this case, I don’t know the solution for sure, but I believe they either lowered the priority of the pimg.exe main thread or made it do its work at another time.  While that doesn’t actually get rid of the lock convoy, it speeds it up to the point where it is unnoticeable to the user.

 

Another work-around might have been to make the pimg.exe thread yield frequently, by calling Sleep(0), to allow the other threads to do their work more frequently than once per quantum.  Note, however, that this isn’t a very ideal work-around, because it’d lead to very frequent thread switching.  Context-switching very often is a problem in itself; you spend all your time making the kernel reschedule threads instead of letting those threads run.

 

A different work-around is to reduce contention over the object by allowing it to be shared more often.  For example, suppose the object is a critical section protecting some data.  You don’t want threads to read the data while it’s being written, or write concurrently, etc.  If reads are relatively frequent while writes are relatively rare, then a way to reduce contention over the lock is to convert it to a reader / writer lock.  If writes are frequent, the improvement of using a reader/writer lock is minimal, but otherwise such a change can allow threads to read at the same time without serializing over a lock.

 

Sometimes you can work around the problem by acquiring the lock at some other time.  For example, I note that there’s a thread in filesys.exe involved in the convoy.  So I suspect the critical section that was undergoing contention was inside filesys.exe.  (While critical sections cannot be shared between processes, the non-filesys.exe threads may have used its critical section during calls into filesys.exe APIs.)  Maybe the UI threads were reading from files to draw icons, or reading from the registry, or writing to a log file.  In the read case, perhaps the UI could have kept the objects in a cache so as to avoid entering the file system.  In the write case, perhaps it could queue up file system writes to execute at a later time.  Either solution amounts to caching data and performing operations asynchronously, instead of during time-critical UI processing.

 

While lock convoys are also possible in Windows XP, they’re a little less problematic because of the priority boosting and aging that the Windows XP scheduler does.  If a thread blocks on an object, then when it becomes un-blocked the XP scheduler temporarily boosts its priority a little.  When a thread runs to the completion of its quantum, the XP scheduler ages its priority a little.  In this manner, threads that block often end up at higher priority than threads that run continuously without blocking.  Then the lock convoy isn’t eliminated, but there is less chance of another unrelated thread skunking up the works by running a full quantum each thread switch.

 

A better solution, if you can change all the code that acquires the contentious lock, is to use TryEnterCriticalSection to acquire the lock only when it’s uncontended. 

 

    retries = 0;

    while(!TryEnterCriticalSection(&CritSec)) {

        if( retries++ < SpinLimit ){

            Sleep(0);

            continue;

        }

        EnterCriticalSection(&CritSec);

        break;

    }

 

This code will use TryEnterCriticalSection a number of times, yielding the processor on each failure until it reaches the SpinLimit and then uses regular EnterCriticalSection.  The SpinLimit is necessary in case the critical section is owned by a lower-priority thread which would not run and release the critical section while the current thread is looping.  In that case the EnterCriticalSection call is required to make the kernel invert the priority of the owning thread briefly, so that the higher-priority thread can acquire the critical section.

 

You may expect this code, which basically polls for the critical section, to hurt performance.  But believe it or not, it can greatly improve it.  The difference between this loop and a regular EnterCriticalSection call is that threads do not enter the queue to acquire the critical section if some other thread owns it.  So in the example above, when Thread A releases the critical section, Thread B does not immediately become the owner.  Thread A is free to continue running and even acquire and release the critical section a few more times before Thread B comes back and tries to acquire it again.  Even with a SpinLimit of 2 (= 1 Sleep() call) two threads can recover from critical section contention without creating a convoy.

 

Alas, this solution is unavailable if you cannot change the code which acquires the critical section.

 

To those of you who might ask whether we intend to make changes to Windows CE to eliminate lock convoys, I’m not at liberty to discuss what we may or may not do in future products.  I can only say the standard non-informative spiel: Microsoft is committed to improving the quality of its products and considers all improvements in the context of other improvements we wish to make, to provide our customers the biggest benefits possible in the time available blah blah blah.  (By the way that spiel, while hokey, really is true.)

 

In the meantime I hope I’ve provided you with some information you can use to identify and escape this problem if you encounter it.

 

UPDATE 26 June 06:  Discussion of whether the TryEnter/Sleep solution hurts perf.  I don’t really think the Sleep() will hurt perf at all.  Here is what would happen:

  1. Thread 1 quantum expires while holding the CS (which is what starts a convoy)
  2. Thread 2 does TryEnter, fails, and sleeps
  3. Thread 1 runs again, acquires and releases the CS a bunch of times, and quantum expires while not holding the CS
  4. Thread 2 runs again, does TryEnter again, succeeds.  Now it can acquire and release the CS a bunch of times.

So the sleep causes a thread switch, but it’s a thread switch you’d have taken anyway if you were using regular EnterCriticalSection.  And now that I think more about this, I think your change to hold the CS for a longer period would negatively impact the perf of the TryEnter solution, because it’d increase the chances of expiring your quantum while holding the CS.  Though the impact would still not be bad because in general the threads would run to quantum completion.