Parallel Scalability Isn’t Child’s Play, Part 3: The Problem with Fine-Grained Parallelism

In the last blog entry in this series, I introduced the model for parallel program scalability proposed by Neil Gunther, which I praised for being a realistic antidote to more optimistic, but better known, formulas. Gunther’s model adds a new parameter to the more familiar Amdahl’s law. The additional parameter k, representing coherence-related delays, enables Gunther’s formula to model behavior where the performance of a parallel program can actually degrade at higher and higher levels of parallelization. Although I don’t know that the coherence delay factor that Gunther’s formula adds fully addresses the range and depth of the performance issues surrounding fine-grained parallelism, it is certainly one of the key factors Gunther’s law expresses that earlier formulations do not.

Developers experienced in building parallel programs recognize that Gunther’s formula echoes an inconvenient truth, namely, that the task of achieving performance gains using parallel programming techniques is often quite arduous. For example, in a recent blog entry entitled “When to Say No to Parallelism,” Sanjiv Shah, a colleague at Intel, expressed similar sentiments. One very good piece of advice Sanjiv gives is that you should not even be thinking about parallelism until you have an efficient single-threaded version of your program debugged and running.

Let’s continue, for a moment, in the same vein as “When to Say No to Parallelism.” Let’s look at the major sources of coherence-related delays in various kinds of parallel programs, how and why they occur, and what, if anything, can be done about them. Ultimately, I will try to tie this discussion into one about tools, especially some great new tools in Visual Studio Team System (see Hazim Shafi’s blog for details) that help you understand contention in your multi-threaded apps. When you use these new tools to gather and analyze the thread contention data for your app, it helps when you understand some of the common patterns to look for.

The first aspect of the coherence delays Gunther is concerned with that we will look at are the bare minimum additional costs that a parallel program running multiple threads must pay, compared to running the same program single threaded. To simplify matters, we will look at the best possible prospect for parallel programming, an algorithm that is both embarrassingly parallel and easy to partition into roughly equal sized subprogram chunks. There are two basic costs to consider: one that is paid up front for initialization of the parallel runtime environment, and one that is paid incrementally each time one of the parallel tasks executes. It is also worth noting that these are unavoidable costs. I will lump both costs into an overhead category associated with Gunther’s coherence delay factor k. The embarrassingly parallel programs we will consider here will incurr these minimum processor overhead penalties when they are transformed to execute in parallel.

Fine grained parallelism.

To frame this part of the discussion, let’s also consider the characterization of workloads and their amenability to parallelization into either fine-grained or coarse-grained ones. This distinction implicitly recognizes the impact of coherency delay factors on scalability. With fine-grained parallelism, the overhead of setting up the parallel runtime & executing the tasks in parallel can easily exceed the benefits. By definition, the initialization overhead of spinning up multiple threads and dispatching them is not nearly so big an issue when the program lends itself to coarse-grained parallelism. Plus, when you are looking at a very long running process with many opportunities to exploit parallelism, it is important to understand you should only have to incur the setup cost once.

Coarse-grained parallelism occurs when each parallel worker thread is assigned to computing a relatively long running function:

coarse-grained parallelism

Figure 5. Coarse-grained parallelism.

while fine-grained parallelism looks more like this:

fine-grained parallelism

Figure 6. Fine-grained parallelism.

The difference, of course, lies in how long, relatively speaking, the worker thread processing the unit of work executes.

The rationale for the fine-grained:coarse-grained distinction is its significance to performance. We identify those parallel algorithms that execute in the worker thread long enough to recover the costs of setting up and running the parallel environment as coarse-grained. The benefits of running such programs in parallel are much easier to realize. On the other hand, the one full proof way to identify fine-grained parallelism is to find embarrassingly parallel programs with very short execution time spans for each parallel task. When executing fine-grained parallel programs, there is a very high risk of slowing down the performance of the program, instead of improving it. (If this sounds like a bit of circular reasoning, it most surely is.)

Now let’s drill into these costs. In the .NET Framework, setting up a parallel execution environment is usually associated with the ThreadPool object. (If you are not very familiar with how to set up and use a ThreadPool in .NET, you might want to read this bit of documentation first that shows some simple C# examples. If you really want to understand all the ins and outs of the .NET ThreadPool, you should considering picking up a copy of Joe Duffy’s very thorough and authoritative book, Concurrent Programming in Windows.)

The thing about using the ThreadPool object in .NET is that you don’t need to write a lot of code on your own to get it up and running. With very little coding effort, you can be running a parallel program. In the .NET Framework, there are newer programming constructs in the Task Parallel Library that are designed to make it easier for developers to express parallelism and exploit multi-core and many-core computers. But underneath the covers, many of the new TPL constructs are still using the CLR ThreadPool. So whatever overheads are associated with this older, less elegant approach still apply to any of the newer parallel programming constructs.

The ThreadPool in .NET

The basic pattern for using the ThreadPool is to call the QueueUserWorkItem() method on the ThreadPool class, passing a delegate that performs the actual processing of the request, along with a set of parameters that are wrapped into a singleton Object. The parameters delineate the unit of work that is being requested. Typically, you also pass to the delegate a ManualResetEvent, what is known as a synchronization primitive. This event is used by the delegate to signal the Main task that the worker thread is finished processing the Work Item request.

Structurally, you have to write:

  1. a class that wraps the Work Item parameter list,
  2. a C# delegate that runs in the worker thread to process the Work Item request,
  3. a dispatcher routine that queues work items for the thread pool delegate to process, and
  4. finally, an event handler to get control when the worker thread completes.

For example, to implement the Fork/Join pattern in .NET using the built-in ThreadPool Object, create (1) a wrapper for the parameter list:

public class WorkerThreadParms

{

private ManualResetEvent _thisevent;

      public ManualResetEvent thisevent

      {

get { return _thisevent;

}

      public WorkerThreadParms(ManualResetEvent signalwhendone, …,)

      {

      _thisevent = signalwhendone;

              …

      }

}

(2) the ThrealPool delegate that unwraps the parameter list, performs some work, then signals the main thread when it is done:

public void ThreadPoolDelegate(Object parm)

{

WorkerThreadParms p = (WorkerThreadParms) parm;

      ManualResetEvent signal = p.thisevent;

            … //Do some work here …

signal.Set(); //Signal main task when done

}

(3) a (simple) dispatcher loop:

ManualResetEvent[] thisevent = new ManualResetEvent[tasks];

for (int j = 0; j < tasks; j++)

{

thisevent[j] = new ManualResetEvent(false);

WorkerThreadParms p = new WorkerThreadParms(thisevent[j],…);

WorkerThread worker = new WorkerThread();

ThreadPool.QueueUserWorkItem (worker.ThreadPoolDelegate,p);

}

followed by (4) a WaitForMultipleObjects in the Main thread:

ManualResetEvent.WaitAll(thisevent);

You can see there isn’t very much code for you to write to get up and running in parallel and start taking advantage of all those multi-core processor resources. You should take Sanjiv Khan’s advice and write & debug the delegate code you intend to parallelize by testing it in a single threaded mode first. Once the single threaded program is debugged and optimized, you can easily restructure the program to run in parallel by encapsulating that processing in your worker thread delegate following this simple recipe.

Even though there isn’t very much code for you to write, there are overhead considerations that you need to be aware of. Let’s look at the simplest case where the program is embarrassingly parallel (as discussed in the previous blog entry). This allows us to ignore complications introduced by serialization and locking (which we will get to later). These overheads include

(1) work done in the Common Language Runtime (CLR) on your behalf to spin up the worker threads in the ThreadPool initially, and

(2) the additional cost when running the parallel program associated with queuing work items, dispatching them to a thread pool thread to process, and signaling the main dispatcher thread when done.

In the case of the initialization work, this is something that only needs to be done once. The other costs accrue each time you need to dispatch a worker thread. For parallelism to be an effective scaling strategy, it is necessary to amortize this overhead cost over the life of the parallel threads. The worker thread delegates need to execute long enough that there is a benefit to executing in parallel. And this is for a best case for parallelism where the underlying program is both embarrassingly parallel and easy to partition into roughly equivalent work requests.

A Parallel.For example.

Using the new Parallel.For construct in the Task Parallel Library, by the way, there is even less code you have to write. All you need to code the Parallel.For is write is the worker thread delegate, because the TPL library handles the remaining boilerplate tasks. However, the underlying overhead considerations are almost identical.

The challenge of speeding up programs that have fine-grained parallelism grows in tandem with making it easier for developers to write concurrent programs. Take a look at the following example of parallelization in C# using the new Parallel.For construct taken verbatim from a Microsoft white paper entitled “Taking Parallelism Mainstream.” The white paper describes some of the new language facilities in the Task Parallel Library that make it easier for developers to write parallel programs. These new language facilities include Parallel For loops, Parallel LINQ, Parallel Invoke, Futures and Continuations, and messaging using asynchronous agents. To the extent that these parallel computing initiatives succeed, they will generate the need for better performance analysis tools to understand the performance of concurrent programs because not everyone who implements these new constructs is going to see impressive speed-up of his or her applications. Some developers will even see the retrograde performance predicted by Gunther’s scalability formula.

Here’s the C# program that illustrates one of the new parallel programming constructs:

IEnumerable<StockQuote> Query(IEnumerable<StockQuote> stocks) {

var results = new ConcurrentBag<StockQuote>();

Parallel.ForEach (stocks, stock => {

if (stock.MarketCap > 100000000000.0 &&

stock.ChangePct < 0.025 &&

stock.Volume > 1.05 * stock.VolumeMavg3M) {

results.Add(stock);

}

});

return results;

}

The example uses a Parallel.ForEach enumeration loop, along with one of the new concurrent collection classes, the ConcurrentBag, to evaluate stock prices based on some set of selection criteria. The problem, the white paper author notes, is one that is considered “embarrassingly parallel” because it is easily decomposed into independent sub-problems that can be executed concurrently. The new C# Task Parallel Library language features provide an elegant way to express this parallelism. Underneath the Parallel.For expression is a run-time library that understands how to partition the body of the parallel For loop into multiple work items, and dispatches them to separate worker threads that are then scheduled to execute concurrently.

The .NET Task Parallel Library (TPL) provides the run-time machinery to turn this program into a parallel program. At run-time, it automatically parallelizes the lambda expression that the Parallel.For construct references. In this example, the Task Parallel Library takes the If Statement in the lambda expression and queues it up to run in parallel using the concurrent runtime library in .NET. The concurrent runtime library creates a thread pool and then delegates the processing of the lambda expression to these worker threads. The concurrent runtime attempts to allocate and schedule an optimal number of worker threads to this parallel task.

While TPL makes it easier to express parallelism in your programs and eliminates most of the grunt work in setting up the runtime environment associated with parallel threads, it cannot guarantee that running this program in parallel on a machine with four or eight processors will actually speed up its execution time. That is the essence of the challenge of fine-grained parallelism. There is overhead associated with queuing work items to the thread pool the parallel run-time manages. This is overhead that the serial version of the program does not encounter. Only when the amount of work done inside the lambda expression executes for a long enough time does the benefit of parallelizing the lambda expression exceeds this cost, which must be amortized over each concurrent execution of the inner body of the Parallel.For loop. Note that this particular set of overheads is unavoidable. When you are dealing with fine-grained parallelism, the overhead of setting up the parallel run-time environment alone often exceeds the potential benefit of executing in parallel, notwithstanding other possible sources of contention-related delays that could further slow down execution time.

So, it is important to realize that a code sample like the one I have taken here from the parallel programming white paper was chosen to illustrate the expressive power of the new language constructs. The example was intended to show the pattern that developers should adhere to -- I am certain it wasn't intended to illustrate something specific you would actual do. When you are taking advantage of these new parallel programming language extensions, you need to be aware of the fine-grained:coarse-grained distinction. This is emphatically not an example of a program that you will necessarily speed up by running it in parallel. It will take considerably more anlysis to figure out if parallelism is the right solution here. Speeding up a serial program by running portions of it in parallel isn’t always easy – even when that program has sections that are “embarrassingly parallel.”

What we’d like to be able to do is to estimate the actual performance improvement we can expect of this parallel program, compared to its original serial version. “Embarrassingly parallel” is another way of saying that, once parallelized, the serial portion of this program is expected to be quite small. However, this is also an example of fine-grained parallelism because the amount of code associated with the lambda expression that is passed to worker thread delegate to execute is also quite small. An experienced developer at this point should be asking whether the overhead of creating these working threads and dispatching them might, in fact, be greater than the benefit of executing this task in parallel. This relatively fixed overhead is especially important when the amount of work performed by each delegate is quite small – there is only a limited opportunity to amortize this overhead to initiate and manage the multithreaded operation across the execution time of each of the worker threads. It is extremely important to understand this in the case of fine-grained parallelism.

Next, we will turn to coarse-grained parallelism, where the odds are much better that you may be able to speed up program execution substantially using concurrent programming techniques. In the next blog entry, I will start to tackle more promising examples. The analysis of the performance costs associated with parallel programming tasks will become more complex. I will try to illustrate this analysis using a concrete programming example that will simulate coarse-grained parallelism. As we look at how this simple programming example scales on a multi-core machine, it will bring us face-to-face with the pitfalls even experienced developers can expect to encounter when they attempt to parallelize their existing serial applications.

We will also start to look into the analysis tools that we are available in the next version of the Visual Studio Profiler that greatly help with understanding the performance of your .NET parallel program. In the meantime, if you'd like to get a head start on these new tools in the Visual Studio Profiler , be sure to check out Hazim Shafi’s blog for more details.