|
|
-
-
This is the first in a series of posts about the improvements we are making to the CLR thread pool for CLR 4.0 (which will ship with Visual Studio 2010). This post will cover changes to the queuing infrastructure in the thread pool, which aim to enable high-performance fine-grained parallelism via the Task Parallel Library. Future posts will cover the “thread injection” algorithm, and any other topics that readers would like to see.
Please note that all the usual caveats apply here: I’m discussing pre-release software, and all details are subject to change before final release. In fact, one goal of this post is to solicit feedback, so we can know what changes we need to make before we ship. J
A thread pool basically has two functions: It maintains a queue (or queues) of work to be done, and a collection of threads which execute work from the queue(s). So designing a thread pool really comes down to a) finding ways to enqueue and dequeue work items very quickly (to keep the overhead of using the thread pool to a minimum) and b) developing an algorithm for choosing an optimal number of threads to service the queues. This post will cover part (a).
In all prior releases of the CLR, the thread pool exposed a single way to queue work: ThreadPool.QueueUserWorkItem (which I will abbreviate as QUWI from here on out). There are a couple of overloads of this method, and also a version called UnsafeQueueUserWorkItem, but these all amount to basically the same thing: you give us a delegate, and we stick it on a queue for later execution.
(Really there are even more ways to queue work. As mentioned in my previous post we really have two pools of threads – the “worker threads” and the “I/O threads.” Work is queued to the I/O threads in response to the completion of asynchronous I/O, or manually via ThreadPool.UnsafeQueueNativeOverlapped. We currently do not plan any significant changes to the I/O pool for CLR 4.0, as our focus for this release is on enabling fine-grained computational parallelism. For the remainder of this post, we will only discuss the mechanisms behind the “worker threads.”)
QUWI conveys basically zero information about each work item, aside from that it exists. This places some important constraints on the execution of these items. For example, the thread pool does not know whether individual work items are related or not, so it has to assume they are all completely independent, implying that we cannot reorder work to optimize its execution, as independent work items typically must be executed in FIFO order to ensure fairness. (Imagine each work item represents a request from a user - you would not want to keep earlier requests waiting while later requests are processed, as this would result in unacceptably long latency for the users who made their requests first).
This means that we are basically forced to use a single FIFO queue for all work queued via QUWI. In prior versions of the CLR, this queue was a simple linked list, protected by a Monitor lock. This incurs some overhead: we must allocate nodes for the list (and pay the cost of the GC having to traverse the list each time a GC occurs), and we must pay the cost of acquiring that lock every time we enqueue or dequeue a work item.
(Another aside: please do not take the above to mean that we make any guarantees about strict FIFO ordering of work item execution. In fact, we violate this “rule” already: since .NET 3.5, the CLR thread pool has maintained separate FIFO queues for each AppDomain in the process, and an additional independent FIFO queue for “native” work items such as those queued by a host (ASP.net being the prime user of this feature). We round-robin between these work queues, allowing each to execute work for some time before moving on to the next.
This strategy is motivated by performance concerns, as it greatly reduces the number of transitions between AppDomains, which are fairly expensive. But it is designed to maintain fairness, which is the chief concern which we can never completely abandon. We may make further changes in the future which further deviate from the strict FIFO model, but we are unlikely to ever make QUWI really unfair, which, as we will see, is crucial to achieving good performance for fine-grained workloads.)
This was fine for the kinds of workloads for which the thread pool was originally designed. These are relatively "coarse" workloads, where each work item represents a fairly large amount of work. The canonical example is an ASP.net web application, where each work item represents the generation of an entire web page. In such workloads, the work itself takes long enough that the overhead of allocating queue nodes and acquiring locks is barely noticeable.
However, in the new world of machines with rapidly increasing core counts, there is increased interest in more "fine grained" work. Where before the job of the thread pool was to take a large number of independent, coarse-grained, tasks and funnel them onto a few threads, we are increasingly being asked to execute many very small tasks representing tiny pieces of some larger operation.
Anyone who has tried executing such a workload on the existing CLR thread pool has probably found that it's not a simple matter of calling QUWI for each piece of the calculation; with such tiny work items, the overhead of enqueuing and dequeuing the work can be much greater than the work itself, resulting in slower execution than if we had just run the work on a single thread to begin with! It is possible to make this work, by “batching” work into a smaller number of calls to QUWI. There are many strategies for this, all of which are fairly complex in the general case. We would like to make this easy, but the current QUWI is insufficient for this goal.
We can improve this situation in a couple of ways: we can implement a more efficient FIFO queue, and we can enhance the API to allow the user to give us more information, allowing us to turn to even more efficient queuing strategies. For CLR 4.0, we are doing both of these.
Faster FIFO
Recall that the overhead of the existing FIFO queue comes from the expense of allocating and traversing the data structure, and the cost of acquiring the lock on each enqueue and dequeue operation. For 4.0, we are switching to a lock-free data structure with much lower synchronization overhead. More importantly, this new queue is much friendlier to the GC; we still need to allocate a new object for each call to QUWI, but these objects are smaller, and are tracked in large “chunks” which are much easier for the GC to traverse than the simple linked list used previously. This new queue is virtually identical to System.Collections.Concurrent.ConcurrentQueue<T>, which is also new in 4.0.
Improving the performance of QUWI is nice, as it benefits existing applications which use the thread pool without requiring any changes to the application code. How much of a speedup you can expect will depend greatly on many factors, including your application’s workload and the details of the particular hardware on which it executes, but for fine-grained workloads on multi-core hardware the speedup should be significant.
However, we are still restricted in what we can do here – we still have very little information about the work we’re executing, and so we still need to use the same basic strategy to execute it. We can trim overhead here and there, but QUWI will probably never be a great way to execute very fine-grained workloads. We need a new API.
The Task Parallel Library
The Task Parallel Library (TPL) is a collection of new classes specifically designed to make it easier and more efficient to execute very fine-grained parallel workloads on modern hardware. TPL has been available separately as a CTP for some time now, and was included in the Visual Studio 2010 CTP, but in those releases it was built on its own dedicated work scheduler. For Beta 1 of CLR 4.0, the default scheduler for TPL will be the CLR thread pool, which allows TPL-style workloads to “play nice” with existing, QUWI-based code, and allows us to reuse much of the underlying technology in the thread pool - in particular, the thread-injection algorithm, which we will discuss in a future post.
I won’t discuss all of the details of the TPL API, which better covered by its authors. From the point of view of the performance of the thread pool, the important thing about TPL is that it is a much richer API than QUWI, giving the thread pool much more information about the work being executed. In particular, the new Task type exposes the notion of parent/child relationships, giving us some idea of the structure of the overall computation being performed by the individual work items. Having this information opens up possibilities for much more efficient execution of these tasks.
Even without parent/child relationships, Task is a major improvement over QUWI. QUWI returns nothing of use to the caller; it simply queues a delegate, and leaves it up to the implementation of that delegate to coordinate its activities with the rest of the application. QUWI provides no means of waiting for the completion of the work item, for handling exceptions, or getting the result of a computation. Task provides all of this in a very easy-to-use form, while adding very little overhead vs. QUWI.
The fact that Task has a Wait method is not just a convenience; it eliminates one of the most common problems people face when using QUWI. It is fairly common for one work item to need to wait for the execution of another work item to complete. If the second work item has not yet begun executing, it will be sitting in the queue waiting for a worker thread to pick it up. It is possible that there are no available worker threads – maybe they’re all waiting for other work items to complete! This can cause deadlock in the worst case, and very slow execution in the best, as the thread pool may be slow to add more worker threads to pick up these work items. Task.Wait, on the other hand, knows it’s waiting for another task, and is tightly integrated with the thread pool such that it is able to determine whether the task has started executing, and if not it executes it immediately, in-line on the current thread. This greatly improves performance and eliminates the possibility of deadlock in this situation.
For new code, Task is now the preferred way to queue work to the thread pool.
Top-level Tasks have no parent. These are Tasks created by non-thread-pool threads, or with certain options specified at Task-creation time. These tasks are queued to the same FIFO queue we use for QUWI, and thus benefit from the improvements we’ve made there – but they are also subject to the same limitations. Tasks queued in this way are simply a better QUWI – but now the fun starts: A parent task can create child tasks. This happens whenever a Task creates another Task (unless it overrides this behavior). These children are implicitly treated as sub-tasks of the larger task. We assume that sub-tasks can be executed in any order – fairness is not necessary – because all that matters is that the overall operation be completed as fast as possible. This lets us throw those FIFO restrictions out the window, and opens up the possibility for much more efficient work scheduling strategies.
Work Stealing
Since a child task is just a piece of a larger task, we don’t need to worry about execution order. We just need to execute these things quickly. One well-known strategy for fast execution of unordered work items is “work stealing.” Joe Duffy and Daniel Moth explain this very well; click on the links if you’re interested.
The most important aspect of work-stealing is that it enables very fast enqueue and dequeue in the typical case, often requiring no synchronization at all. This virtually eliminates a large part of the overhead of QUWI, when working with child tasks. We still do need to allocate memory for the Task itself, and for the work-stealing queue, but like the improvements to the FIFO queue these data structures have been optimized for good GC performance. Parent tasks are fast; child tasks are much faster.
There are still some limitations to how quickly tasks can be executed. If all tasks are top-level (non-child) tasks, they are subject to the FIFO ordering constraints of QUWI (albeit with much richer functionality). And even with work-stealing, we need to allocate and queue Task instances for every work item. To get even better performance, we need even more information about the work. Which brings us to…
Parallel.For and PLINQ
While not, strictly speaking, features of the CLR thread pool, the methods of the new Parallel class and PLINQ are critical new features of the public concurrency APIs in CLR 4.0. In fine-grained parallel applications, it is very common to need to execute the same code, over and over, with different data inputs. With QUWI or Task, this means allocating and queuing separate workitems for each input. The thread pool infrastructure does not know that all of these work items do the same thing, so it literally has to execute them, one at a time, as if they were completely different tasks.
Parallel.For, Parallel.ForEach, and PLINQ, provide a better way. These are essentially different ways of expressing the same thing: here is some code that needs to execute N times, as quickly as possible.
Just as with the parent/child relationships that Task provides, this extra information enables more aggressive optimization. These “parallel loops” do not need to be broken down into separate work items for each loop iteration. All that is needed is to break them into enough chunks (“partitions”) that they can be efficiently load-balanced across all available machine resources. A typical scenario might be that 1,000,000 iterations need to be broken into, say, four work items.
There is, of course, some overhead introduced by the need to dynamically partition the data (done automatically by the framework). But this pales in comparison to the savings of not having to allocate, queue, and dequeue millions (or more!) work items. In a test I just tried, for a particular work load on one of my machines, Parallel.For executed more than 300 times as fast as the equivalent naïve usage of QUWI.
Where do we go from here?
By now you’ve probably got the general theme: the more information we have about a workload, the faster we are likely to be able to execute it. I expect this theme to continue in future releases. The challenge is to find new ways to easily express (or automatically extract) useful information about parallel workloads, which can be used by the thread pool (or higher-level abstractions like PLINQ) to enable more optimizations. I think this is going to be an interesting space to watch for quite some time.
In the meantime, please try out the new mechanisms we are providing in Visual Studio 2010 Beta 1. Try it with the kinds of workloads you expect to use in production; one of the biggest challenges we face is that we can really only guess at what developers will do with this stuff in the real world, so feedback from our customers is extremely important to ensure that these new mechanisms will meet your needs in the final product.
Feel free to post any questions you may have in the comments for this post; I’ll try to answer what I can. My next post will cover the thread pool’s “thread injection” algorithm, which is how we determine how many threads should be servicing the various queues I’ve discussed here. If there’s something else you’d like me to cover, please let me know.
|
-
It's been a while since I've posted; for the past several months I've been working on deep changes in the CLR ThreadPool to support the Parallel Extensions to the .NET Framework.
I hope to say a few things about the changes we're making in the CLR to support this new programming model over the coming months. But for now I encourage everyone to watch Daniel Moth's excellent PDC presentation on this subject. He does a fantastic job of outlining what we're doing, and why it matters to you.
And a quick note about the CTP release of the CLR that was distributed at PDC: this does not include the new ThreadPool. The Parallel Extensions are running on their own task scheduler in this release. However, as Daniel mentions we do intend to ship CLR 4.0 with the Parallel Extensions deeply integrated with a much-improved CLR thread pool. Stay tuned!
|
-
A question recently came up on an internal discussion forum, which I'll paraphrase: The Windows QueueUserWorkItem API has an option to queue to an I/O thread. Why doesn't the managed ThreadPool.QueueUserWorkItem support this option?
First, some background:
In the Windows thread pool (the old one, not the new Vista thread pool), an "I/O thread" is one that processes APCs (Asynchronous Procedure Calls) queued by other threads, or by I/O initiated from the I/O threads. One example of an I/O functions that completes via APCs is ReadFileEx. The "non-I/O" threads get their work from a completion port, either as a result of QueueUserWorkItem, or I/O initiated on a handle bound to the threadpool with BinIoCompletionCallback. So they are both geared toward processing I/O completions, but they just use different mechanisms.
In the managed ThreadPool, we use the terms "worker thread" and "I/O thread." In our case, an I/O thread is one that waits on a completion port; i.e., it's exactly equivalent to Windows' non-I/O thread. How confusing! Our "worker threads" wait on a simple user-space work queue, and never enter an alertable state (unless user code does so), and so explicitly do not process APCs. Managed "worker threads" have no equivalent in the Windows thread pool, just as Windows "I/O threads" have no managed equivalent.
The managed QueueUserWorkItem queues work to the "worker threads" only. UnsafeQueueNativeOverlapped queues to the I/O threads, as do completions for asynchronous I/O performed on kernel objects that have been bound to the ThreadPool via BindHandle.
Why don't we support APCs as a completion mechanism? APCs are really not a good general-purpose completion mechanism for user code. Managing the reentrancy introduced by APCs is nearly impossible; any time you block on a lock, for example, some arbitrary I/O completion might take over your thread. It might try to acquire locks of its own, which may introduce lock ordering problems and thus deadlock. Preventing this requires meticulous design, and the ability to make sure that someone else's code will never run during your alertable wait, and vice-versa. This greatly limits the usefulness of APCs.
And APCs don't scale well, except in certain very constrained scenarios, because there's no load-balancing of completions across threads; all I/O initiated by a given thread always completes with an APC queued to that same thread. You can, of course, implement your own load balancing, by using the APC to notify another thread of the completion, but you'll never do better in user-space than the kernel does with completion ports. So we provide a rich async I/O infrastructure based on completion ports, and nothing else.
The real question to answer, then, is: why does Windows allow this option in the first place? I would guess that this is because there is a lot of unmanaged code out there that uses APCs, so the unmanaged thread pool needs to support APCs in order to support running that large body of code on the thread pool. This doesn't apply to managed code; when the managed thread pool was first introduced there was no managed code in existence, so there was no backward-compatibility requirement to support APCs.
|
-
This has been discussed fairly frequently on the Web.
Chris Brumme discusses this here: http://blogs.msdn.com/cbrumme/archive/2003/04/15/51351.aspx
Raymond Chen discusses this here: http://blogs.msdn.com/oldnewthing/archive/2004/12/31/344799.aspx.
There’s some discussion about the pros and cons of Fibers in SQL Server here: http://msdn.microsoft.com/en-us/library/aa175385.aspx
Those are just the first three relevant links from a simple search. But still, the subject comes up.
Most people think of threads vs. fibers in terms of the "expense" of each option. Fibers are often thought to be "cheaper" than threads (they're often referred to a "light-weight threads,") and are thought to perform better, or at least enable your app to do things to make it perform better. So what are the expenses involved here?
For most apps, the most important expense of threads is the physical address space consumed by the stack. Fibers don’t fix this; they carry just as much stack as a thread.
Another expense of threads is that there is overhead to preemptive scheduling. If you have more threads than CPUs, and they routinely use up all of their scheduling quanta, you have to pay the expense of the kernel periodically forcing a (possibly unnecessary) context switch. This is a fairly rare problem for real apps, which almost never use their full quanta. Usually an app has blocked on some kernel object, waiting for input from the user, the network, etc., long before the full quantum is used. And even if preemption does occur, it’s rarely the most significant expense an app incurs.
If preemptive context switches are your biggest problem, the only solution is to reduce the number of threads. Fibers don’t solve that problem. However, reducing the number of threads introduces a new problem: you’ve still got the same amount of work to do, and now it’s up to you to multiplex all that work onto just a few threads.
What fibers actually do (try to) solve is the problem of manually multiplexing lots of work onto a few threads. There are lots of ways to do this – asynchronous I/O being the prime example – but fibers try to make it really easy by letting you keep your stack, so you can just write code the way you’re used to.
Another common use of fibers is to implement coroutines. This is what Raymond Chen’s doing in his Enumerator example. The idea here, again, is to make switching contexts easier by allowing you to do everything on the stack. But here there’s no pretense of a performance benefit; it’s quite obvious to everyone that allocating two or more 1MB stacks, and switching between them every few instructions, is going to suck, perf-wise. It’s just (supposed to be) easier to write code this way than the alternatives (allocating everything on the heap, splitting your logic across lots of methods, etc.).
So really fibers aren’t about performance at all, but about making manual context-switching easier.
The downside of fibers is that they’re not a general-purpose mechanism like threads. Any code that depends on thread-local storage (which is almost all code that does anything) doesn’t work with fibers. By going with fibers, you’re taking a lot of matters into your own hands; this is obvious in the case of scheduling, but it’s not so obvious that you’re also taking responsibility for most of what your language’s runtime library does, or vast swaths of Win32, etc. You also can’t write managed code. And you’ll never be able to call out to 3rd party code.
So in exchange for easy context-switching, you get increased memory usage, and you lose the ability to use virtually any basic software building block ever written.
For something like Microsoft SQL Server, which is one of the most highly-optimized apps on the planet, and which pretty much lives in its own world where memory is “cheap” and library code is suspect anyway, it probably made sense to go with fibers to help reduce context-switch overhead without having to completely re-write SQL Server. For almost literally everyone else, it just doesn’t make sense.
And if all you want is coroutines, ask your favorite language vender to provide proper support for them. J C# has iterators, which beat the pants off of Raymond’s enumeration example; that’s a good start, but iterators are awkward for anything other than, well, enumerating things.
|
-
.NET 3.5 SP1 (A.K.A. CLR 2.0 SP2) is now available as a Beta release.
This beta release has the fix for the ThreadPool ramp-up issue discussed here and here. It also has some other improvements to the ThreadPool's thread creation algorithm that can dramatically improve performance in some situations.
The "thread injection" algorithm in the ThreadPool is a tricky beast. There's simply no perfect solution to the problem - so every change is a complex balancing act between improving performance for a specific customer scenario on the one hand, and not creating more problems for other customers on the other.
We've tried to make these changes as carefully as possible, running them through all the threadpool scenario tests we've accumulated over the years. But we can't possibly test every real-world workload - we don't even know about them all! - and we necessarily have to depend on things like beta releases to find the problems our tests will never find.
If you have code that uses the thread pool, I encourage you to download this beta and try it out. Feel free to leave a comment here about any problems you might find, or you can use the official discussion forum and feedback center for the beta.
And if you have any suggestions for future improvements, and especially if you have examples of workloads that the current ThreadPool does not handle well, please let us know.
I look forward to any feedback you might have.
|
-
Shortly after the release of CLR 2.0 SP1 (a.k.a. Orcas or .NET 3.5), several customers noticed some very odd behavior in the ThreadPool. The ThreadPool is supposed to create threads as fast as possible, up to the current setting for MinThreads - but it turns out that if you queue workitems very quickly (like in a tight loop that just calls QueueUserWorkitem), the pool will expand very slowly - we'll only create one thread every half second (which is the normal behavior once we've reached MinThreads).
Michael C. Kennedy's blog has sample code and some perfmon graphs demonstrating this behavior. He also posted our suggested workaround, which unfortunately boils down to "don't queue the work items so quickly."
We plan to fix this in CLR 2.0 SP2 (i.e., .NET 3.5 SP1). In the meantime, if you should encounter this, please try the workaround.
|
-
The managed ThreadPool provides a way to asynchronously wait for WaitHandles, via ThreadPool.RegisterWaitForSingleObject. This method returns a new instance of RegisteredWaitHandle, which has a single method: Unregister. It's obvious from the name that this will cancel a pending wait operation. What's not obvious is that even after a registered wait has completed, you should still call Unregister.
The reason is that RegisteredWaitHandle holds a reference to your WaitHandle, and holds a refcount on the underlying SafeHandle. If you don't call Unregister, then RegisteredWaitHandle will have to run its finalizer sometime later, before it can release the WaitHandle. Calling Unregister prevents the finalizer from running (which helps perf), and may get your handle closed faster.
RegisteredWaitHandle really, really should implement IDisposable.
|
-
In this article, Vance Morrison describes some of the issues involved in writing managed multithreaded code that avoids the use of locks. In particular, he discusses the impact of memory models on lock-free (or low-lock) programming. For managed code running on the CLR, there are two models that matter: - The ECMA CLI memory model
- The Microsoft CLR memory model
The ECMA model is very "weak," meaning that there are many possibilities for reordering of memory operations. The CLR model incorporates the ECMA model, but strengthens it with additional rules that prevent reorderings. According to Chris Brumme, the CLR memory model was strengthened because it was thought that a stronger model was easer to code to. So, when you're writing your own managed code, you should code to the strong CLR model, to make your life easier, right? I'm not sure. The process of writing correct lock-free code is inherently difficult. You have to think very carefully about every single memory operation in your algorithm, and how each might interact with operations on other threads. You must meticulously identify the required orderings. Then you must consult the rules of the underlying memory model to find a way to enforce each ordering requirement. In a weak memory model, such as the ECMA standard, most ordering requirements will not be automatically satisfied by the rules of the model; you will have to do something explicit (insert a memory barrier, mark a variable as volatile, etc) to enforce your requirements. In a strong model, many requirements will be automatically satisfied - but some will not, and you still need to think about each one and make a decision about what needs to be done (or not done). In my experience, programming to a strong model is not appreciably easier than a weak model. If anything, it's a bit harder, because there are so many more rules to keep in mind. Add to that the fact that the CLR model is not standardized, and is not portable to other CLI implementations (such as our own Compact Framework), and it's hard for me to see why anyone would want to target this model. So what is the value of the CLR memory model? The key is in the statement above, that "many requirements will be automatically satisfied." If, in the course of your analysis, you should happen to miss something (don't worry, it happens to everybody), you have a chance of being "saved" by the underlying strong memory model, and your code will run correctly. The CLR memory model acts as a sort of "safety net," just in case you make a mistake. In my opinion, this is a good thing to have. Lock-free/low-lock programming is so hard to get right, that it's good to know that the runtime may be able to save me from a mistake. My advice: code to the ECMA model for portability, and take comfort in the fact that you may be able to get away with some mistakes.
|
-
The .NET framework currently offers several synchronization primitives: - Monitors (via the Monitor class. C# users know this as the "lock" statement)
- Two varieties of reader/writer lock (ReaderWriterLock / ReaderWriterLockSlim)
- Events (ManualResetEvent/AutoResetEvent)
- Mutexes (for inter-process locking)
- Semaphores
- Static constructors (which synchronize initialization of a class)
We also provide some basic building blocks for rolling your own synchronization mechanisms, via the Interlocked class, volatile variables, and Thread members like MemoryBarrier. It's not uncommon to encounter a situation where the synchronization primitives we supply are not adequate. So all too often, people end up building their own (by far the most common case of this is the "double-checked locking" pattern). Building your own synchronization is time-consuming, and very difficult to get right. See my previous article for a simple example, or do a web search for double-checked locking. I think there's clearly a need for some more specialized synchronization primitives in the Framework. Some ideas on my mind are: - SpinLock (a super lightweight lock for very small critical sections)
- Some sort of lazy initialization primitive (see Joe Duffy's idea)
- Barriers
But I'd really like to hear from you. Have you found the existing primitives to be inadequate for your needs? Have you ever had to create your own? What could we add to the Framework to make your life easier?
|
-
Can the program below ever print “oops?”
#include <stdio.h>
#include <process.h>
struct Globals {
volatile int start;
int a;
int b;
volatile int end;
};
Globals globals;
void WriterThread(void*) {
int i = 0;
while (true) {
globals.start = i;
globals.a = i;
globals.b = i;
globals.end = i;
i++;
}
}
void ReaderThread(void*) {
while (true) {
int end = globals.end;
int a = globals.a;
int b = globals.b;
int start = globals.start;
if (start == end)
if (a != b)
printf("oops! %d != %d\n", a, b);
}
}
void main() {
_beginthread(WriterThread, 0, NULL);
_beginthread(ReaderThread, 0, NULL);
getchar();
}
Answer below....
Keep going....
The code does indeed print “oops” fairly reliably.
So what exactly is going on with this code? First, let’s see what it’s trying to do. Here’s the code again – I’ve annotated some important lines with “statement numbers” to make them easier to discuss.
#include <stdio.h>
#include <process.h>
struct Globals {
volatile int start;
int a;
int b;
volatile int end;
};
Globals globals;
void WriterThread(void*) {
int i = 0;
while (true) {
globals.start = i; //S1
globals.a = i; //S2
globals.b = i; //S3
globals.end = i; //S4
i++;
}
}
void ReaderThread(void*) {
while (true) {
int end = globals.end; //S5
int a = globals.a; //S6
int b = globals.b; //S7
int start = globals.start; //S8
if (start == end)
if (a != b)
printf("oops! %d != %d\n", a, b);
}
}
void main() {
_beginthread(WriterThread, 0, NULL);
_beginthread(ReaderThread, 0, NULL);
getchar();
}
We have one thread that is writing values to globals.a and globals.b, and another thread that is reading these values. The reader expects to see the values in a consistent state – which here means that they satisfy the invariant that a == b, but in real life would probably be much more complex. The simple way to do this, of course, would be to protect these variables with a lock – but presumably the author decided that this would not perform adequately, and so chose to go with a hand-rolled synchronization mechanism.
The idea is that the writer will increment “start” just before updating “a” and “b”, and then after making the updates it will increment “end” to match “start.” The reader starts by reading “end”, then goes ahead and reads “a” and “b”. Having done that, it reads “start”. If start does not equal the previous value of end, then that means the writer might have written to a or b while we were reading them, so we might have read these variables in an inconsistent state – so we throw out the values and try again.
Assuming everything happens as described, this should avoid the race that would otherwise exist between the reader and the writer.
(Yes, this can be foiled by the extremely unlikely case where the reader thread is interrupted for exactly 3^32 iterations of the writer. Let’s ignore that problem for now.)
Our algorithm requires that statement S1 take place before S2, S3, and S4. S4 must take place after S1, S2, and S3. The order of S2 and S3 does not matter. Similarly, S5 must take place before S6, S7, and S8; and S8 must take place after S5, S6, and S7. These sorts of temporal relationships are sometimes notated as “A->B”, meaning that A must precede B. So we can state all of the required temporal relationships in our algorithm as:
S1->S2
S1->S3
S1->S4
S2->S4
S3->S4
S5->S6
S5->S7
S5->S8
S6->S8
S7->S8
C++ generally permits a program to be executed in any order that would produce the same results as if it was executed in “program order,” on a single thread. In fact, the C++ standard says absolutely nothing about threading, and this is one practical consequence of that fact. So without any special treatment, none of our temporal relationships will be enforced. We might get lucky, or we might not. Guess which one is more likely.
The only “escape hatch” that C++ provides for controlling the order of execution is this thing called “volatile,” which our program does use on the “start” and “end” variables. So what does “volatile” do?
The C++ standard has little to say about volatile. It was originally intended for the very specific application of writing device drivers that communicate with hardware via memory-mapped I/O. To make that work, compilers must follow these rules:
-
A read or write to a volatile variable must not be eliminated by the compiler.
-
Reads and writes to volatile variables must not be reordered by the compiler, relative to one-another.
Note that this does not place any restrictions on non-volatile accesses, nor does it say anything about what the hardware is allowed to do. It is assumed that if you’re reading and writing memory-mapped I/O ports, then the hardware will know better than to reorder things.
Visual C++ has an augmented definition of volatile. It adds the following rules:
-
Reads and writes to any global memory (even if not volatile) must never be moved before a volatile read.
-
Reads and writes to any global memory must never be moved after a volatile write.
-
These rules will be enforced at the hardware level as well.
Back to the quiz problem: this code attempts to enforce its required temporal relationships by tagging the “start” and “end” variables with the “volatile” keyword. Given the VC++ rules for volatile variables, this enforces the following:
S1->S4
S2->S4
S3->S4
S5->S6
S5->S7
S5->S8
In other words, nothing may move after the volatile write in S4, and nothing may move before the volatile read in S5.
Notice that we are missing some of our required temporal relationships. This is because the volatile write in S1 does not affect anything occurring after it in program order, nor does the volatile read in S8 affect anything coming before it.
(Careful readers may notice that there are two additional temporal relationships that must be enforced for this algorithm to work. As we go around each loop, the last statement must come before the first one (i.e., a previous iteration of S4 must occur before a subsequent iteration of S1). I didn’t know how to notate this, so I left it out. :) Luckily, these relationships are satisfied by the original program, since S1, S4, S5, and S8 are all volatile accesses, which cannot be reordered relative to each other.)
So this program is not correct. But do we get lucky? Let’s look at how VC++ compiles this code for x86:
void WriterThread(void*) {
int i = 0;
00401000 xor eax,eax
00401002 jmp WriterThread+10h (401010h)
00401004 lea esp,[esp]
0040100B jmp WriterThread+10h (401010h)
0040100D lea ecx,[ecx]
while (true) {
globals.start = i; //S1
00401010 mov dword ptr [globals (403370h)],eax
globals.a = i; //S2
00401015 mov dword ptr [globals+4 (403374h)],eax
globals.b = i; //S3
0040101A mov dword ptr [globals+8 (403378h)],eax
globals.end = i; //S4
0040101F mov dword ptr [globals+0Ch (40337Ch)],eax
i++;
00401024 add eax,1
}
00401027 jmp WriterThread+10h (401010h)
void ReaderThread(void*) {
00401030 push esi
00401031 mov esi,dword ptr [__imp__printf (4020A0h)]
00401037 jmp ReaderThread+10h (401040h)
00401039 lea esp,[esp]
while (true) {
int end = globals.end; //S5
00401040 mov edx,dword ptr [globals+0Ch (40337Ch)]
int a = globals.a; //S6
int b = globals.b; //S7
int start = globals.start; //S8
if (start == end)
00401046 cmp dword ptr [globals (403370h)],edx
0040104C mov eax,dword ptr [globals+4 (403374h)]
00401051 mov ecx,dword ptr [globals+8 (403378h)]
00401057 jne ReaderThread+10h (401040h)
if (a != b)
00401059 cmp eax,ecx
0040105B je ReaderThread+10h (401040h)
printf("oops! %d != %d\n", a, b);
0040105D push ecx
0040105E push eax
0040105F push offset string "oops! %d != %d\n" (4020F4h)
00401064 call esi
00401066 add esp,0Ch
}
00401069 jmp ReaderThread+10h (401040h)
Notice that we did get lucky in WriterThread. The compiler did not reorder any of our global variable accesses. Another compiler, or a new version of this one, might decide to compile this code differently.
ReaderThread is a different story. S5 precedes S6, S7, and S8, as required by the semantics of volatile reads, but S8 has been moved to before S6 and S7!
When I run this program on my machine, it does in fact display “oops.”
Now here’s a surprise: on my machine it doesn’t print “oops” until approximately 5 million iterations have passed. Even though the code emitted by the compiler is clearly not a correct implementation of this synchronization algorithm, it still has only a 0.00002% chance of failing. Imagine this was just a small piece of a much larger system; what are the chances that anyone would find this bug in testing?
So how do we fix this? I know you all think I’m going to say, “use a lock.” And I am. Later. For now, let’s figure out how to make the original algorithm work.
Recall our required temporal relationships:
S1->S2
S1->S3
S1->S4
S2->S4
S3->S4
S5->S6
S5->S7
S5->S8
S6->S8
S7->S8
I’ve highlighted the ones that were not satisfied by our use of volatile. We need a mechanism to prevent S2 and S3 from moving before S1, and S6 and S7 from moving past S8. We can do this with explicit “memory barriers,” which are instructions to the compiler and the hardware to restrict reorderings in a specific way. There is nothing in the C++ standard that provides for memory barriers, but luckily most compilers do step outside of the standard and provide some sort of support for these.
In VC++, we have a function called MemoryBarrier() which implements a “full memory barrier,” meaning that nothing will be moved in either direction across a call to MemoryBarrier(). In our case, if we want to prevent S2 and S3 from moving before S1, we can insert a memory barrier after S1:
void WriterThread(void*) {
int i = 0;
while (true) {
globals.start = i; //S1
MemoryBarrier();
globals.a = i; //S2
globals.b = i; //S3
globals.end = i; //S4
i++;
}
}
Similarly, we need a memory barrier before S8:
void ReaderThread(void*) {
while (true) {
int end = globals.end; //S5
int a = globals.a; //S6
int b = globals.b; //S7
MemoryBarrier();
int start = globals.start; //S8
if (start == end)
if (a != b)
printf("oops! %d != %d\n", a, b);
}
}
Now all of our required temporal relationships are satisfied, and the program no longer prints “oops.”
Now to say the least that was an awful lot of mental effort to arrive at a working implementation of a very simple algorithm. You certainly wouldn’t want to have to do this kind of analysis for every line of your code! Hopefully the end product is worth it; it had better be much faster than it would have been with a lock, or we just wasted a lot of time (not to mention that in real-life code, we’d probably get it wrong even with all this analysis).
Attached you will find two versions of the original problem, one using the original custom synchronization mechanism, and the other implemented with a basic spin lock. I’ve added some simple timing code so we can see how fast each one goes.
(Please, please, please don’t take the spin lock implementation from the attached code and use it in production code. It is meant as an illustration only).
On my machine, with maximum contention (meaning each thread is running full-speed-ahead on its own CPU), the lock-free implementation does about 0.3 successful iterations of the reader per millisecond. The spin lock version does about 3500 iterations/ms. So in this scenario, all our hard work bought us a 12,000-fold decrease in performance.
Of course, this is unfair - the algorithm is clearly designed to perform best with little contention - i.e., the writer only writes every so often, while the reader reads very frequently.
We can simulate minimum contention by simply not running the writer thread. In this scenario, the lock-free reader does 38,000 iterations/ms, while the spin lock implementation does 33,000 iterations/ms. So the lock-free implementation is just 6% faster.
So was all this work worth a best-case perf improvement of 6%, with a risk of such horrible worst-case performance? And the very real risk that we might have gotten it wrong? Well, it depends on the situation. :) What I can say for sure is that this sort of code should not be entered into lightly. Getting it right requires the sort of rigorous, mind-numbing analysis we had to do here, but typically on a larger scale. And it usually has very little, if any, benefit.
|
-
Welcome to my new blog. I just moved to the Common Language Runtime team here in rainy Redmond, after spending nearly seven years in sunny California developing storage services for Hotmail (now Windows Live Mail). Luckily, I spent three years prior to that here in Redmond developing tests for Windows 2000, during which time I learned to love the rain. Hopefully my wife and kids will come around to my view soon. :)
I plan to use this blog to discuss the usual interesting details of our system that I come across in the course of my work here, and that I think might be valuable to the developer community at large, but for now I probably don't know much more than you do about the inner workings of the CLR. I'd also like to use this forum to discuss some of the philosophy of software development, and how it relates to the CLR, programming languages, and you.
Enjoy, and please let me know if there's anything else you'd like to see here.
|
|
|
|