Larry Osterman's WebLog

Confessions of an Old Fogey
Blog - Title

Concurrency, Part 9 - APIs that enable scalable programming

Concurrency, Part 9 - APIs that enable scalable programming

  • Comments 7
Sorry, I just noticed that this didn't get posted yesterday - I don't know what happened, but...

As I mentioned the other day, Win32 provides a number of tools for writing scalable applications.

Today I want to catalog some of them.  Remember, that the #1 goal you have when attempting to design MP scalable applications is to avoid blocking your thread.  You absolutely never, ever want to let your thread go through the CPU scheduler, because that means that your processor is spending time switching threads when it could be doing work.  So what are some of the tools that Win32 provides for this?

First off, the two I mentioned yesterday, Fibers and Interlocked operations.  Fibers allow an application to switch between tasks without exhausting the currently running threads' quantum, but their cost can be quite high in terms of complexity - you essentially have to write your own scheduler to effectively use them.

The Interlocked functions, however are really, really neat.  You can't really appreciate the power of the APIs from their description.

Interlocked Increment and Decrement are straightforward - they increment and decrement a long, returning the new value of the variable.

InterlockedExchangeAdd is again straightforward - it adds (or subtracts) from a 32bit value.  Again, straightforward.  So is InterlockedExchangePointer - it swaps two pointer values.

The magic of the interlocked functions starts with InterlockedCompareExchangePointer - It turns out that if you're clever, you can use InterlockedCompareExchangePointer to implement a LIFO queue that doesn't require any external synchronization structures.  I'd write down how to make this work, but the good news is that I don't have to, because the NT base team formalized it with the Interlocked Singly Linked List primitives.  These allow you to add and remove entries to a list, again without requiring a critical section.

The next tools that are available for scalability are, somewhat paradoxically, the Win32 critical section structures.  It turns out that there are two big issues with critical sections.  One is that critical sections can block on contention.  The other is that if you use a critical section to protect a commonly accessed variable (like the head of your work queue), a phenomenon known as a lock convoy.  Lock convoys are horribly bad when you're designing an application to be scalable, because all your threads eventually wind up being serialized on a single critical section.  So a great deal of effort has been put into reducing the likelihood of lock convoys in critical sections.  The first of the changes made (for NT4 SP3 (and Windows 98+)) was to add the InitializeCriticalSectionAndSpinCount API.  This API allows the user to specify a "spin count" for critical sections - essentially, when you attempt to enter the critical section, if the critical section is owned by another thread, instead of simply blocking (and throwing away your quantum), the EnterCriticalSection API will "spin" trying repeatedly to acquire the critical section.  So if your critical section is "hot" (meaning it's a high contention critical section, and the critical section is held for a very short period of time), then the odds are very good that the critical section will be released while the other thread is spinning.  And that means that the thread that's trying to grab the critical section won't have to give up its quantum.

Now ICSASC isn't a panacea - if you own the critical section for any length of time, then spinning on contention just wastes time.  But for some critical sections, it can be a huge win.

In addition, for Windows 2003, SP1, the EnterCriticalSection API has a subtle change that's intended tor resolve many of the lock convoy issues.  Before Win2003 SP1, if 10 threads were blocked on EnterCriticalSection and all 10 threads had the same priority, then EnterCriticalSection would service those threads in a FIFO (first -in, first-out) basis.  Starting in Windows 2003 SP1, the EnterCriticalSection will wake up a random thread from the waiting threads.  If all the threads are doing the same thing (like a thread pool) this won't make much of a difference, but if the different threads are doing different work (like the critical section protecting a widely accessed object), this will go a long way towards removing lock convoy semantics.

As I mentioned in my "So you want a worker thread pool" article, another incredibly useful way of achieving scalability is to use an I/O completion port for a work queue - because the completion port queue is protected by the same kernel structure that protects the scheduler in NT, if there are any work items on the queue, your thread won't block when it calls GetQueuedCompletionStatus.

And, in addition to all these, there's one other facility - Process Affinity.  The SetProcessAffinityMask, SetThreadAffinityMask and SetThreadIdealProcessor APIs can be used to "pin" a thread to a CPU.  Otherwise the thread will be run on whichever processor is available.  Switching a thread from one processor to another can have serious negative consequences, since there is internal per-CPU state that gets tossed when a thread is moved.

There may be other APIs and facilities for improving the scalability of your application in Win32, but (honestly) they're not coming to mind immediately :)

Tomorrow, I want to talk about a couple of the real-world problems that you start encountering when you build highly scalable applications.

Edit: Added a paragraph about the Win2003 SP1 EnterCriticalSection behavior changes (I just got approval to post that behavior change).

Edit2: Fixed InterlockedIncrement/Decrement description, thanks Dave

  • > InitializeCriticalSectionAndSpinCount API

    For multiprocessor systems the advantage is obvious, as long as the thread trying to enter the critical section is on a different processor from the thread that already owns the critical section. The current owner might really leave quickly.

    But for uniprocessor systems it still looks like a disadvantage. The caller still has to go to sleep and the kernel still has to switch the context back to the current owner at some time in order for the current owner to ever have a chance of exiting the critical section. So why waste CPU time spinning instead of letting some other executable thread do something? Sure you've reduced lock convoys but you've done it by turning the same CPU cycles into pure waste.

    You say that this API was implemented in Windows 98. There were no multiprocessor Windows 98 implementations, right? How could this ever be anything other than a loss on Windows 98?

    If you want the caller of EnterCriticalSection to get the rest of the quantum that it was going to have, then the theoretical answer is to give that thread a minor priority boost and the rest of the quantum that it was using. But from everything I've read about Windows, priorities still aren't fine-grained enough to do this without impinging on the genuinely higher priorities of higher priority tasks. Anyway, the theoretical answer isn't to spin when you've only got one CPU.

    > Sorry, I just noticed that this didn't get
    > posted yesterday - I don't know what
    > happened, but...

    The race between the process that submits threads for posting and the thread that actually posts them went the other way ^_^
  • Good points Norman.

    I suspect that on Win98, the "and spin count" part of the API was a NOP (but I don't know, having never written code for Win98). My suspicion is that they just added a wrapper for the API.

    For UP machines ICSASC is almost always a loss (I can't think of a situation where it's a win). But the vast majority of machines are MP machines these days, so it's more important. Also, the big reason you use this API is if you're trying to remove lock convoys in your application. If you're seeing lock convoys in your application then it's likely to be a server type application. And...
  • > But the vast majority of machines are MP
    > machines these days

    Now the principal asks why you misspelled "servers" ^_^
  • MSDN says that InterlockedIncrement() returns the RESULTING value, not what you said (the previous value). This is an important difference. BTW, excellent column, Larry, I just discovered it and am enjoying it immensely.
  • This is more of a .NET question, but what is the difference between concurrency with threads and asynchronous events? Is it the same under the hood or does the CLR do something funky?

    Would the basics that you mention work for message passing languages like Erlang?

    How does Micro C++ fit change the equation?

    I now you already mentioned that you aren't the expert in concurrency. Just fielding the questions to see if you may now any of the answers.
  • What about the memory barriers imposed by use of the InterlockedFoo() operations? My (very superficial) understanding is that they shouldn't be viewed as a panacea for 'lock free' programming because of the potential costs of the memory barriers...?
  • Matt, you're right, there are memory barriers, and they do have a cost associated with them. But their cost is effectively 0 when compared to the cost of a context switch.

    Giampiero, I don't think I can answer that, I don't know how the CLR handles asynchronous events.
Page 1 of 1 (7 items)