Is it ok to use nested Parallel.For loops?

Is it ok to use nested Parallel.For loops?

Rate This
  • Comments 5

Every now and then, I get this question: “is it ok to use nested Parallel.For loops?” The short answer is “yes.”  As is often the case, the longer answer is, well, longer.

Typically when folks ask this question, they’re concerned about one of two things.  First, they’re concerned that each nested loop will assume it “owns the machine” and will thus try to use all of the cores for itself, e.g. if their code was running on an eight-core machine, they’re concerned that the outer loop would run eight of the nested loops in parallel, each of which would use eight threads, such that there would be 64 threads all churning on the inner loop.  Second, they’re concerned that the outer loop’s threads will end up blocking waiting for the inner parallel loops to finish.

Let me start by waving my hands a bit and saying that, for the most part, you don’t need to worry about such things, in that Parallel.For was designed to accommodate this usage pattern.  To understand why, it helps to have a high-level understanding for how Parallel.For is implemented.  Parallel.For doesn’t just queue MaxDegreeOfParallelism tasks and block waiting for them all to complete; that would be a viable implementation if we could assume that the parallel loop is the only thing doing work on the box, but we can’t assume that, in part because of the question that spawned this blog post.  Instead, Parallel.For begins by creating just one task.  When that task is executed, it’ll first queue a replica of itself, and will then enlist in the processing of the loop; at this point, it’s the only task processing the loop.  The loop will be processed serially until the underlying scheduler decides to spare a thread to process the queued replica.  At that point, the replica task will be executed: it’ll first queue a replica of itself, and will then enlist in the processing of the loop.  Starting to see a pattern?  In effect, Parallel.For makes sure there’s always an additional task hanging around that will join into the loop’s processing if another thread becomes available.  If no additional threads become available, when the loop’s processing has completed, that additional task will be canceled.  Further, the first task the loop creates is actually run synchronously (using Task.RunSynchronously).

What can we infer from this hand-wavy description of how Parallel.For achieves parallelism?  Because the first task the loop creates is run synchronously, and because the loop only spreads itself out to additional threads when the underlying scheduler gives it more threads, if the underlying scheduler has no threads to share (because they’re all occupied processing, say, other iterations of the outer parallel loop), the Parallel.For will be content to complete on just the thread that invoked it.  Going back to the original question, if I have a nested set of Parallel.For loops, the outer loop can spread itself out across available threads, and the inner loops can mostly complete on just the thread used by the outer loop to invoke the inner loop.

We can see this with a small code example:

using System;
using System.Threading;
using System.Threading.Tasks;

class Program
{
    static void Main()
    {
        const int OUTER_ITERS = 2000;
        const int INNER_ITERS = 2000;

        Parallel.For(0, OUTER_ITERS, i =>
        {
            int innerConcurrencyLevel = 0;
            int innerConcurrencyLevelSum = 0;
            Parallel.For(0, INNER_ITERS, j =>
            {
                Interlocked.Increment(ref innerConcurrencyLevel);
                for (int spin = 0; spin < 50000; spin++);
                Interlocked.Add(ref innerConcurrencyLevelSum, Volatile.Read(ref innerConcurrencyLevel));
                for (int spin = 0; spin < 50000; spin++);
                Interlocked.Decrement(ref innerConcurrencyLevel);
            });
            Console.Write("{0}, ", Math.Round(innerConcurrencyLevelSum / (double)INNER_ITERS));
        });
    }
}

Here I have nested parallel for loops.  The inner loop’s body keeps track of how many other iterations of this loop are currently running, and outputs the average for the loop to the console.  Here’s what I see when I run this on my quad-core:

2, 2, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2,
1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 1, 2, 1, 1, 2, 1, 2, 2, 1, 2, 1, 1, 1, 2
, 1, 2, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 2, 1, 1, 1, 2, 1, 2,
1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
… // and on and on

Note that while there are some “2”s sprinkled in there (in particular towards the beginning as processing was getting going), for the most part each inner loop was running with a concurrency level close to 1.

We can see this as well by using the Concurrency Visualizer in Visual Studio 11 Beta.  I tweaked the above example to remove the concurrency-level tracking code (since that introduces unnecessary I/O and synchronization for the purposes of this experiment), and here’s what I see:

image

For the purposes of this screenshot, I hid some things the Concurrency Visualizer was showing me (e.g. the GPU and disk I/O channels) so that the important results here really pop.  The key things to see are that 1) only five threads are being used in the processing of this loop (not 16 as we might expect if, on my quad-core, each inner parallel loop was utilizing four threads), and 2) there’s no red here, meaning there’s little unnecessary blocking happening.

Now, I’m not saying that there’s no way to mess with this logic; I’m sure you could come up with a case.  I’m also not saying that I advocate using nested parallel loops; if it’s possible to easily transform your code into one that uses a single parallel loop, that’s likely going to result in better performance, as you’ll be minimizing the number of parallel loops you invoke (each of which has some overhead) and you’ll be avoiding any potential problems that could arise from corner cases.  What I am saying is that we designed Parallel.For (and Parallel.ForEach, of course) to behave decently well in these cases, so that if you need to use nested parallel loops, you should feel comfortable in doing so.  This is particularly important from a composability perspective.  If you implement a method in a library that uses a parallel loop, a consumer of your library should be able to invoke your method from within a parallel loop and not be overly thoughtful about how your library method was implemented.  This is also important for non-nested cases, for other cases where multiple parallel loops may be run in parallel.  For example, with ASP.NET, we want to make sure the system behaves well if the processing of a request involves a parallel loop; many requests may be processed in parallel, and the parallel loop for any given request should not dominate the system and starve the processing of other requests.

I hope that helps.

Leave a Comment
  • Please add 1 and 1 and type the answer here:
  • Post
  • Could you also explain why Parallel.For does not always utilize all available hardware threads when running long running tasks. E.g. in our case we have 24 long running task we want to run in parallel 8 at a time on a 8 hardware thread machine, but typically at the end of the run less than 8 tasks are running, slowing the total run down considerably. It is as if tasks are not executed even though a thread should be available...

  • Harry, if/when this occurs, typically it's because either:

    1) You have a significant load imbalance in terms of how long it takes to process certain items.  It could be that the threads you see running for a while at the end are all processing their last item, which is just taking longer than some of the other items, and thus the other threads have all finished with no more work to be done.

    2) The parallel loop partitioned the data in chunks such that, based on the how the data was distributed, some of the threads got starved for data towards the end of the loop.  If that's the case, you could control the partitioning in a manner that's better for your data set.  The limit of this is ensuring that each thread never takes more than 1 item at a time... to achieve that, you could use a partitioner like the one described at drdobbs.com/.../224600406, or if you're using .NET 4.5, you can use the built-in variant of this via Partitioner.Create(source, EnumerablePartitionerOptions.NoBuffering).

  • Is't Parallel.For is a locking call?

    I assume that it locks and waits till child tasks will complete it's execution.

    Would't 15 nested Parallel.For cause thread starvation on ThreadPool and deadlock it this case?

  • Hi Valera-

    Parallel.For is a blocking call in the same way that a for loop is blocking: the caller won't progress past the loop until the loop has finished its work.

    It's true that the Parallel.For won't return until the tasks it's used for processing have completed, but as noted in this post, the loop will run the first task it creates synchronously on the current thread, it will only have a single additional task sitting around for an additional thread to pick up, and it'll cancel any outstanding tasks that aren't being used when all of the work for the loop has completed.

    Your example of 15 nested Parallel.Fors should not cause thread starvation or deadlock... give it a try.

  • Too messy explanation. I've read it several times but still not understand completely.

Page 1 of 1 (5 items)