Parallel loop performance

Parallel loop performance

Rate This
  • Comments 10

We've received several questions on the MSDN Forums for Parallel Extensions about the performance of the Parallel class, and specifically of the loop constructs we provided in the CTP.  We're very much aware that the performance of Parallel.For/ForEach in the CTP is not optimal, and that for some situations, the overhead for these constructs will be too much for a variety of situations.  That's ok: we released the preview of the Task Parallel Library in the December CTP to solicit early feedback on the APIs we're providing (e.g. Are they useful?  What's missing? etc.), and to do so and to get the preview out early, we did so at the expense of performance and reliability (it is a technology preview, after all).  We'll be spending a whole lot of time getting the performance of these loops to be stellar; for now, please let us know whether the APIs are right so that we can adjust if they're not.  And if you have interesting situations where the performance isn't great, we'd love to know about that, too, so that we factor that in when determining the best implementations to ship.

That said, there will always be some cases that we won't handle perfectly, due simply to the sheer number of variations and use cases that exist.  And for those situations, we provide the lower-level Task construct with its replication capabilities to allow you to write custom loops that are best suited to your specific needs (as an example, we previously posted about how to write a parallel while loop).

Consider another situation: you want a for loop, but the body of the for loop contains so little work that even the cost of a delegate invocation (to execute the body) is too much.  For example, consider this implementation of a very simple ParallelFor loop:

static void ParallelFor(
    int fromInclusive, int toExclusive, Action<int> body)
{
    int index = fromInclusive;
    Task.Create(delegate
    {
        int i;
        while ((i = Interlocked.Increment(
            ref index) - 1) < toExclusive)
        {
            body(i);
        }
    },
    TaskCreationOptions.SelfReplicating).Wait();
}

For loop bodies with a significant amount of work, this implementation isn't all that bad.  But for small bodies, there is a fairly large overhead here per iteration, in that each iteration incurs both an interlocked operation (to advance the index variable) and a delegate invocation (to run the body).  Following this same approach, I could cut down on the number of interlocked operations by incrementing each time by an amount larger than 1:

static void ParallelFor(
    int fromInclusive, int toExclusive, Action<int> body)
{
    const int STRIDE = 8;
    int index = fromInclusive;
    Task.Create(delegate
    {
        int i;
        while ((i = Interlocked.Add(
            ref index, STRIDE) - STRIDE) < toExclusive)
        {
            int endExclusive =
                Math.Min(i + STRIDE, toExclusive);
            for(; i<endExclusive; i++)
            {
                body(i);
            }
        }
    },
    TaskCreationOptions.SelfReplicating).Wait();
}

In spirit, this is actually very similar to the implementation of Parallel.For in the CTP (which, based on my previous comments about CTP performance, should hint at the performance quality of this example ;).  Each CPU grabs 8 elements using an interlocked addition.  This decreases the number of interlocked operations per iteration from 1 to approximately 1/8, but it does so at the expense of load balancing.  And still, for each iteration we have the cost of a delegate invocation in order to execute the body of the loop as provided by the caller.  In fact, no matter what we do here, with this overall design of ParallelFor, we won't be able to avoid that delegate invocation, which could be prohibitive for certain problems with very, very small bodies.

To fix that, we could go with a different design:

static void ParallelForRange(
    int fromInclusive, int toExclusive,
    Action
<int, int> body);

The idea here is that rather than providing a loop body that accepts the current iteration, the loop body accepts two integers, one representing the start of a range and the other representing the end (just as with fromInclusive and toExclusive, but on a smaller scale):

static void ParallelForRange(
    int fromInclusive, int toExclusive,
    Action<int, int> body)
{
    const int STRIDE = 8;
    int index = fromInclusive;
    Task.Create(delegate
    {
        int i;
        while ((i = Interlocked.Add(
            ref index, STRIDE) - STRIDE) < toExclusive)
        {
            body(i, Math.Min(i + STRIDE, toExclusive));
        }
    },
    TaskCreationOptions.SelfReplicating).Wait();
}

To use this, rather than doing the following as you would with the ParallelFor method:

ParallelFor(0, N, i =>
{
    // original loop body here
});

you could do the following:

ParallelForRange(0, N, (fromInclusive, toExclusive) =>
{
    for (int i = fromInclusive; i < toExclusive; i++)
    {
        // original loop body here
    }
});

Just as with the previous ParallelFor implementation, where we used a stride to decrease the number of interlocked operations, here in ParallelForRange we've used that same stride to decrease the number of delegate invocations, at the expense of forcing the user to implement their own loop inside of the parallel loop.

Now, as previously mentioned, this striding approach has a downside of decreasing the possibility for load balancing.  The implementation of Parallel.For that we eventually ship will do its best to both decrease overhead per iteration and maximize load balancing.  But there may still be cases where the cost of the delegate invocation per iteration will be too much.  Forgetting the implementation details for a moment, would it be useful for you if in addition to Parallel.For, we also provided a Parallel.ForRange with a design similar to the method I just showed? We'd love your feedback on issues like this, as we want to make sure we're providing the right solutions for your problem domain.  And are there other common looping constructs that you'd like to see us provide?  Now is a great time to download the CTP, try it out, and send us your feedback... we're listening.

Leave a Comment
  • Please add 7 and 4 and type the answer here:
  • Post
  • It would be great if TPL had a Parallel.For variant that just broke up for loop into NUM_THREAD blocks and then each thread worked on one of the blocks.  This would work well when the loop body did a trivial amount of work.

    For example (poor example I know):

    int n = 1e10 + 1;

    ...

    Parallel.ForXXX(0, n, i =>

    {

      y[i] = x[i]*someScalar + y[i];

    });

    Assuming a quad processor with NUM_THREAD set to 4, threads 1-3 process 2.5e9 element blocks and and thread four gets 2.5e9+1 elements.  This would be similar to your last example, but would require no thought on part of the TPL user.

  • I think adding an overloaded version of Parallel.For that would allow the user to give the TPL a hint on how it should divide up the tasks would work too:

    enum PfxHint

    {

     SHORT_RUNTIME,

     LONG_RUNTIME,

     UNKNOWN_RUNTIME

     ... others? ...

    }

    then you could do something like:

    int max = 1E9;

    squares[] = new int[max];

    Parallel.For(0, max, PfxHint.SHORT_RUNTIME,

     n => { squares[n] = n*n; } );

    or, you could do this...

    Func<int, bool> SomeNonUniformFunction = ...;

    bool[] answers = new bool[max];

    Parallel.For(0, max, PfxHint.UNKNOWN_RUNTIME

    n =>

     {

       answers[n] = SomeNonUniformFunction(n);

     }

    );

    Using a construct like this might give the TPL enough information to make a smart decision about how to partition up the work.  If you expect short runtimes for each iteration, then TPL can use the Parallel.ForRange() without too much fear of messing up the load balance.  If the runtimes are long, or you simply don't know, then you might want to deal w/ the overhead of more, but smaller tasks to keep the load balanced.  

    As a developer, I think I would have enough insight into the loop that I'm writing to be able to provide a reasonable hint.  If not, then I would call the standard Parallel.For() and let TPL use the default behavior.

    -Frank

  • nobody:

    Thanks for the suggestion!  Is this a performance suggestion, or is there a use case where you actually need the iteration space divided evenly like this?  Note that while the implementation I showed previously is similar to the implementation in the CTP, the current implementations we're working on are different from all of these, optimizing at the same time for issues such as overhead, load balancing, etc. Underlying partitioning aside, your ForXXX example and the ParallelForRange example I showed differ in that the ParallelForRange limits the number of delegate invocations to one per chunk, whereas the ForXXX you showed (just like Parallel.For) requires a delegate invocation for each iteration.

    Frank:

    Also thanks for the suggestion.  This is similar to CallbackMayRunLong and WT_EXECUTELONGFUNCTION in the Windows thread pools, and we have been and are considering implementing hints somewhat similar to this for Task and the higher-level abstractions built on top of it.

  • >Is this a performance suggestion, or is there a use

    >case where you actually need the iteration space

    >divided evenly like this?

    It is a performance suggestion to minimize delegate creation and interlocking for cases where the the loop body does a trivial amount of work.

    > whereas the ForXXX you showed (just like Parallel.For)

    >requires a delegate invocation for each iteration.

    It looks like I wasn't clear.  The example was to suggest that the ForXXX would only create four delegates, one for each thread - not each iteration.  Since the method body is trivial there really isn't a need for load balancing - though it could be tweaked a bit, maybe NUM_THREAD * N partitions are created where N is very small say 2 or 4 (so NUM_THREAD * N delegates would be created). The idea was to simplify the need for users to define their own ranges - as I understand your ForRange example.  The ForXXX variant would calculate the ranges automatically based on the number of threads being used.

    This variant would come in handy when you doing simple  operations on elements of a large array.  

  • Thanks, nobody.  

    I think the confusion here is stemming from the fact that none of the examples in this blog post create a delegate for each iteration: only one delegate is instantiated and provided to the various Parallel* methods.  So the issue isn't the creation of one delegate per iteration, it's the invocation of one delegate per iteration, which both the Parallel.For method and your suggested ForXXX method incur.

    Regarding load balancing, even with small body sizes there's still a need for load balancing.  There are lots of other things happening on your computer while your application using Parallel.For is running; the OS is doing processing, the shell is doing processing, your virus scanner may kick in, network packets are being received and analyzed, and so on (my laptop right now has 85 processes running).  All of those require processors, the same processors that are being used to run the work of your parallel loop.  If one of those processors is chosen to do evenly slightly more extraneous work than other processors (which is extremely likely), load balancing becomes important, so that one processor doesn't sit there twiddling its thumbs without work to do waiting on the other processors to finish their work associated with the parallel loop.

    In my ForRange example, the user of ForRange doesn't need to create their own range.  Rather than the body being passed a single iteration value i, it's passed a range from->to, and so rather than processing a single iteration, the body just loops from from->to processing each iteration in that range.  If the average range size provided to the delegate is N, that cuts down on the number of delegate invocations to execute iterations by a factor of approximately N (at the expense, as mentioned, of some load balancing opportunities).  Internally, the developer for ForRange could choose from a variety of techniques to decide how big the ranges should be.  What I've implemented above is a simple "take the next 8 elements" approach, but one could definitely implement more sophisticated logic.

    Does that help?

    Thanks, again, for your suggestions!  We appreciate them, so please keep it up :)

  • Hi,

     One other thing I find myself dealing with in my Parallel.For. The system will perform the best as long as the various threads are working on data that is close together in the index variable. For example, as index is incremented it casues some data to be read in. If there is a lot of skipping around (because the STRIPE size is large), then that will cause unnecessary I/O.

     Another edge case to consider.

     This was also a very nice sample of how to code up my own loops, which I might do to see if I can further tune the behavior.

  • Gordon, absolutely.  In addition to interlocked/delegate invocation overhead and load balancing issues and the like, there are certainly other forms of overhead that we need to factor in when deciding on the best general implementation for our loops, and one of these factors is locality.  Not just for disk I/O, but also for things like cache and memory.  You should definitely see this improve in future releases.  Of course, there are always counter examples one can create where locality of index may translate into poor memory/cache locality; we want to optimize for the most common use cases (which is why feedback from folks like yourself at this stage of the game is so important), but still provide a mechanism for folks to get the performance they need in other cases, even if it means writing a bit more code on top of lower-level primitives we provide (like replicating tasks).

    I'm glad you found the sample useful!  I know from your forum posts that you were interested in thread-local state as well, which this sample doesn't provide.  I'll add that to the list of future blog posts to write ;)

  • PingBack from http://mdavey.wordpress.com/2008/03/25/catching-up-from-30c-blew-chess-net-framework-treemaps-and-more/

  • Some more stuff to remember when dealing with self-replicating tasks. (See my earlier post for an introduction

  • Hallo,

    I am experimenting with the CPT and tried benchmarking the ParralelFor against a normal For loop

    NonThreaded Total = 2.66666668666615E+25 In 1828.1601 ms

    Threaded Total =    2.66666668666614E+25 In 4516.0957 ms

    NonThreaded Total = 2.66666668666615E+25 In 1765.6589 ms

    Threaded Total =    2.66666668666614E+25 In 4781.7258 ms

    Can you see what I am doing wrong, or suggest a better way of doing this.

    Thanks...

    Nigel...

          public void UnThreadedDemo()

           {

               double a = 10;

               double b = 20;

               double c = 30;

               double Total = 0;

               int Itterations = 200000000;

               DateTime Start = DateTime.Now;

               for (int x = 0; x < Itterations; x++)

               {

                   Total += a * x * x + b * x + c;

               }

               DateTime Finish = DateTime.Now;

               TimeSpan delta = Finish.Subtract(Start);

               Console.WriteLine("NonThreaded Total = " + Total + " In " + delta.TotalMilliseconds + " ms");

           }

           static void ParallelFor(

               int fromInclusive, int toExclusive, Action<int> body)

           {

               int index = fromInclusive;

               Task.Create(delegate

               {

                   int i;

                   while ((i = Interlocked.Increment(

                       ref index) - 1) < toExclusive)

                   {

                       body(i);

                   }

               },

               TaskCreationOptions.SelfReplicating).Wait();

           }

           object locker = new object();

           public void ThreadedDemo()

           {

               double a = 10;

               double b = 20;

               double c = 30;

               double Total = 0;

               int Itterations = 200000000;

               DateTime Start = DateTime.Now;

               //Parallel.For(0, Itterations, x => { Total += a * x * x + b * x + c; });

               ParallelFor(0, Itterations, x => {

                   lock(locker)

                   {

                       Total += a * x * x + b * x + c;

                   }

                                                }) ;

               DateTime Finish = DateTime.Now;

               TimeSpan delta = Finish.Subtract(Start);

               Console.WriteLine("Threaded Total = " + Total + " In " + delta.TotalMilliseconds + " ms");

           }

       }

Page 1 of 1 (10 items)