Achieving Speedups with Small Parallel Loop Bodies

Achieving Speedups with Small Parallel Loop Bodies

  • Comments 14

The Parallel class represents a significant advancement in parallelizing managed loops.  For many common scenarios, it just works, resulting in terrific speedups.  However, while ideally Parallel.For could be all things to all people, such things rarely work out, and we’ve had to prioritize certain scenarios over others.

One area Parallel.For may fall a bit short is in attempts to use it with very small loop bodies, such as:

int [] array = new int[100000000];
Parallel.For(0, array.Length, i=>
{
    array[i] = i*i*i;
});

Such an operation should be readily parallelizable; after all, every iteration is completely independent of every other iteration, and we’re dealing with a big data parallel problem.  The devil is in the details, however.  To handle typical scenarios where the time it takes to complete each iteration ends up being non-uniform, Parallel.For’s implementation takes a lot of care to load balance across all threads participating in the loop, and that load balancing comes at a small performance cost.  This cost is typically trumped by the performance benefits that come from doing the load balancing; however, when the body of the loop is as tiny as it is in the example above, even small overheads add up.  Another overhead that also contributes is the delegate invocation required to invoke the loop body.  It can be easy to forget when looking at a Parallel.For call that Parallel.For is really just a method, accepting as a parameter a delegate to be invoked for every iteration.  That invocation isn’t free, and in the above case may even be more expensive than the body of the loop itself.

Fear not, however, as there exist ways to still achieve good speedups on such cases.  One way is based on creating larger chunks of work for Parallel to operate on: as the chunk size increases, the overhead costs start to pale in comparison, and speedups are realized. 

Consider a new ForRange method you could implement:

public static ParallelLoopResult ForRange(
    int fromInclusive, int toExclusive, Action<int, int> body);

Unlike For, which invokes the body once per iteration, ForRange will invoke the body with a start and end of a range.  Thus, given an initial sequential loop like the following:

for(int i=0; i<N; i++)
{
    DoWork(i);
}

with For it would be parallelized by replacing the for loop with a Parallel.For:

Parallel.For(0, N, i=>
{
    DoWork(i);
});

and with ForRange, it would be parallelized by wrapping the for loop with a ForRange:

ForRange(0, N, (from,to) =>
{
    for(int i=from; i<to; i++)
    {
        DoWork(i);
    }
});

There are several ways we can now implement ForRange.  The first is simply by doing a little math.  We can calculate the boundaries of each range and use a Parallel.For to run the user-supplied body action for each range, e.g.

public static ParallelLoopResult ForRange(
    int fromInclusive, int toExclusive, Action<int, int> body)
{
    int numberOfRanges = Environment.ProcessorCount;
    int range = toExclusive - fromInclusive;
    int stride = range / numberOfRanges;
    if (range <= 0) numberOfRanges = 0;
    return Parallel.For(0, numberOfRanges, i => {
        int start = i * stride;
        int end = (i == numberOfRanges - 1) ? toExclusive : start + stride;
        body(start, end);
    });
}

Another way is actually by using Parallel.ForEach under the covers.  Rather than doing the math as was done above, we can write an iterator in C# that yields the ranges, and then Parallel.ForEach over those ranges, e.g.

public static ParallelLoopResult ForRange(
    int fromInclusive, int toExclusive, Action<int, int> body)
{
    int rangeSize = (toExclusive - fromInclusive) / Environment.ProcessorCount;
    if (rangeSize == 0) rangeSize = 1;
    return Parallel.ForEach(
        CreateRanges(fromInclusive, toExclusive, rangeSize), range =>
    {
        body(range.Item1, range.Item2);
    });
}

private static IEnumerable<Tuple<int,int>> CreateRanges(
    int fromInclusive, int toExclusive, int rangeSize)
{
    // Enumerate all of the ranges
    for (int i = fromInclusive; i < toExclusive; i += rangeSize)
    {
        int from = i, to = i + rangeSize;
        if (to > toExclusive) to = toExclusive;
        yield return Tuple.Create(from, to);
    }
}

(You can download an implementation of ForRange as part of the Beta 1 samples at http://code.msdn.microsoft.com/ParExtSamples.)

In general, we expect the design and implementation of Parallel.For will be right for the vast majority of scenarios.  However, solutions like those above can be used to accommodate cases that don’t quite fit the mold.

Leave a Comment
  • Please add 8 and 6 and type the answer here:
  • Post
  • PingBack from http://dougfinke.com/blog/index.php/2009/06/06/samples-for-parallel-programming-with-the-net-framework-4/

  • Thanks! Just the guidance I was looking for. I implemented something similar to the first example, but I like the elegance of the second solution.

    Any news on when we might see the TPL in Silverlight? Working on an interactive computational tool for light simulation in Biomedical Optics...would love to use all those procs!

  • Interesting Finds: June 7, 2009

  • using foreach will be faster the for, because when using for, system will check the range automatically.

  • Hi David-

    Thanks for the enthusiasm and feedback.  Regarding Silverlight, no news, but thanks for the scenario and it's great to know you're interested.

  • Hi Stephen,

    Thanks for letting me know re: Silverlight. Could you say whether there's anything truly "missing" in the CoreCLR to port a library like the June CTP, or is it just a productization issue?

    Thanks,

    David

  • Hi David-

    There's nothing I know of inherent to Silverlight or CoreCLR that would prevent a library similar to TPL from working.  Building TPL purely in transparent code would require some changes and would likely suffer from a performance perspective (to what extent I can't say), but the bulk of the public API surface can surely be made to work.

  • That's great to hear, thanks for following up! FWIW, the types of parallelzation we use PFX for on the desktop (that we'd like to port to our SL tool) are very granular parameter sweeps, so any performance hit for coordination/task-stealing/etc would be minimal for us. We wait in patient anticipation. ;) In the meantime, we have the ThreadPool, PowerThreading, and Joe Duffy's masterpiece.

  • I notice that many examples (and I know that they're for the purposes of example) use the ProcessorCount as a guide to division of work.

    However, in a highly parallel environment such as those we hope to be building as standard practice in the coming years or in ASP or WCF services today, the code may already be executing down a "branch" of concurrent work, so it may not be desirable to parallelize a task further.

    Has the team thought about a supervising component that can be 'asked' about how many concurrent tasks are executing and whether to split.

    Where I have recursive calls to a parallel method I use a static field to keep a count of DOP; work may continue on the current thread and then if some other work completes the next iteration takes up the slack by going parallel again.

  • Hi Luke-

    Thanks for the note.  Yes, we've have thought about it, and in fact that's similar in principle to how Parallel.For/ForEach behave.  They don't immediately partition into N==# of logical partitions.  Rather, they scale up dynamically as resources become available; if all of the other threads in the ThreadPool are busy doing other work, Parallel.For/ForEach will in effect execute sequentially, until such time as other threads are available, at which point they'll increase their degree of parallelism appropriately. (There are a few issues with this behavior in Beta 1, but they're being addressed.)

    As you mention, this is important for environments like ASP.NET.  You should be able to implement a parallelized component and use that component in ASP.NET.  At peak times when your servers are saturated with requests, the loops will likely run sequentially.  But at off-hours when there's more CPU resources to go around, the loops will be able to increase their degree of parallelism, helping to decrease the latency for individual requests.

    Now, there may still be times when you want to employ a counter like what you've described.  As much as we try to keep the cost down, there is some overhead to using Parallel.* and Tasks, so it's possible on a scenario by scenario basis you could avoid some of this cost by using interlocks to keep track of a work count and only going parallel if that current work count is below some threshold.

  • Thanks very much for the prompt response, Stephen.

    From what you say, I now understand that I can code with Parallel.* and not have to think about this issue.

    I'm considering the view of a developer that knows not of her code's 'position' within a call stack, or a component that may end up being called recursively (even if indirectly via side-effects).

    So last question before lunch:

    Would your guidance be to base sync/async decisions on the available ThreadPool threads or is there a TPL way to just keep on banging out new tasks safe in the knowledge that the DOP is being managed by people much smarter than me :) ?

    (I'm sorry if I've drifted away from topic, but I hope I'm not the only one who worries about writing something that if used unexpectedly, explodes into zillions of queued items) --Luke

  • Hi Luke-

    In an ideal world, I'd like to say that Task and Parallel.For/ForEach have no overhead so you can use them without concern for that.  Unfortunately, parallelization does incur overheads, and while we work hard to keep those overheads minimal, they still exist.  Thus, there are several facets to an answer to your question.

    We have explicitly designed our APIs with composition in mind, such that if you for example create nested Parallel.For calls, they'll play nicely together.  It doesn't matter whether they're both in your assembly, or whether your assembly has a Parallel.For that calls into another assembly that has a Parallel.For, they should still play nicely together.  So, if you have an individual safe loop that has enough work to warrant parallelizing it, by all means, parallelize it.  Others can consume your library in their code parallelized with other TPL constructs, and they'll play together nicely.

    Now, a key statement in my previous paragraph was "loop that has enough work to warrant parallelizing it".  It's often the case that we want to use Tasks, for example, to parallelize a large operation, such as a recursive divide-and-conquer decomposition, e.g. quicksort.  Using Tasks towards the root of the decomposition is likely a good thing to do, as the work represented by those tasks will be significant.  If, however, you continue to divide-and-conquer, using tasks for every level of the recursion, eventually you'll end up sorting just one or two elements in the body of a Task, which is extremely little work when compared to the cost of a Task (which, as I mentioned, while small, isn't negligable).  Thus, in order to get the best speedups, you do want to consider switching over to a sequential implementation at some point in the recursion, similar to how in a recursive sequential implementation of quicksort, at some point in the recursion you likely switch over to a different algorithm, like insertion sort, for better overall performance.  At what point you switch over to sequential in such a problem really needs to be based on the details of the problem and on performance testing.  If, however, each of the Tasks you're spinning up contains enough work in and of itself to justify the costs of that Task, then by all means keep on "banging out new tasks", as that will enable your code to scale well on larger and larger machines.

    I hope that helps.

  • PingBack from http://www.vishwatech.com/globalnews/?p=1719

  • Thanks again. You're remarkably perceptive because I am indeed working on a Conqueror<T> class which is designed to take a Lambda to split the sides of a collection. It then, depending upon DOP and a 'work size' threshold, works on one side of the results (using anon methods or recursive calls) on a new thread and while the calling thread falls through to execute the other 'side' delegate.

    It got me thinking about how even MS approach this puzzle, considering even within the AppDomain some code may call into, say, the Robotics SDK with its CCR API (if both look to the same threadpool to make decisions then no worries I guess)

    -- its like "I'm going parallel", "Well, so am I", "Well, that'd just create a switching storm, back away from my threadpool!"

    Anyway, its way off topic now and I'm suitably furnished with what you've said. Thanks.

Page 1 of 1 (14 items)