Concurrency Runtime on Channel 9
09 October 08 11:46 AM | rickmolloy | 2 Comments   

A few weeks ago several of us from the Parallel Computing Platform sat down with Charles Torre from Channel 9 to discuss the Concurrency Runtime that we've blogged about here.  We talk about the motivation of the Concurrency Runtime, how its scheduler and resource manager interact and discuss the C++ programming models built on top of it, the Parallel Pattern Library mentioned here earlier here by Don McCrady and myself and the Asynchronous Agents Library (a concurrent data flow and message passing library) that we haven’t discussed yet but will likely soon.

 

The  video is now available on Channel 9: http://channel9.msdn.com/posts/Charles/The-Concurrency-Runtime-Fine-Grained-Parallelism-for-C/.

 

We hope you enjoy it, and as always feedback more than welcome and greatly appreciated!

Avoiding Contention using Combinable Objects
25 September 08 09:51 AM | dmccrady | 10 Comments   

When attempting to parallelize an algorithm, programmers are frequently thwarted by the presence of shared state.  Any state that can potentially be modified by multiple threads simultaneously during a parallel operation must be synchronized somehow to prevent race conditions, data corruption, and all kinds of nasty behaviour.  If the shared state is modified infrequently, the easiest and arguably best solution is simply to lock it using a critical section or some other simple mutual exclusion primitive.  However, if the shared state is modified very frequently, then this approach will exhibit a correspondingly frequent amount of lock contention, which will ultimately result in a loss of scalability.  Sometimes this loss of scalability is so severe that the parallel algorithm will run more slowly than the serial version.

One common pattern that programmers use to reduce the effect of shared state is to split the data into separate thread-local pieces.  Each thread in the parallel algorithm then operates only on its thread-local piece, thus removing all contention on it.  After the parallel algorithm completes, all the thread-local pieces are combined to produce the final result.  If the operations performed on this shared state are associative (i.e., (x + y) + z = x + (y + z)) and commutative (i.e., x + y = y + x), this pattern will produce the same result as the serial algorithm.  In Microsoft’s Parallel Pattern Library, this pattern is exposed via the combinable<T> construct.

To illustrate, let us use a very simple example:

int sum = 0;

for (vector<int>::iterator it = myVec.begin(); it != myVec.end(); ++it) {

    int element = *it;

    SomeFineGrainComputation(element);

    sum += element;

}

 

Now we would like to parallelize this.  Using the Microsoft Parallel Pattern Library, we can simply replace the for loop with a call to parallel_for_each, and use C++’s fancy new lambda syntax:

int sum = 0;

parallel_for_each(myVec.begin(),  myVec.end(), [&sum] (int element) {

    SomeFineGrainComputation(element);

    sum += element;

});

 

The parallelized code looks almost as simple (perhaps simpler) than the original code thanks to lambdas.  However, as written it has an obvious race condition:  multiple threads can be updating our shared “sum” variable at the same time, and this will almost certainly produce an incorrect result.  We have to synchronize it somehow.  As I mentioned above, the naïve way is to use a mutual exclusion primitive, such as a Windows critical section.  This would look like:

CRITICAL_SECTION sumCS;

InitializeCriticalSection(&sumCS);

 

int sum = 0;

parallel_for_each(myVec.begin(),  myVec.end(), [&sum] (int element) {

    SomeFineGrainComputation(element);

 

    EnterCriticalSection(&sumCS);

    sum += element;

    LeaveCriticalSection(&sumCS);

});

 

DeleteCriticalSection(&sumCS);

 

This fixes our race condition and produces the same result as the serial algorithm.  However, since the shared “sum” variable is accessed every iteration, the critical section we’ve introduced will cause a lot of contention, quite likely enough to cause the parallel algorithm to run slower than the serial algorithm.  (Ninja programmers may note that in this case we could use “InterlockedExchangeAdd” instead of a critical section to achieve the same effect; however, this only works for integers.  What if the shared state was not an integer or we were doing some operation other than “+”?  As well, although interlocked operations are usually faster than full-blown locks, they still have a large negative impact on performance because they involve expensive cache coherency operations.)

Using Microsoft’s Parallel Pattern Library and the combinable<T> object we can virtually eliminate the shared state altogether and thus eliminate the need for a lock or an interlocked operation.  The initial portion of the modified parallel algorithm looks strikingly similar to the original serial algorithm above:

combinable<int> localSums;

parallel_for_each(myVec.begin(),  myVec.end(), [&localSums] (int element) {

   SomeFineGrainComputation(element);

   localSums.local() += element;

});

 

The body of the loop now performs its computation on a thread-local integer by calling “localSums.local()”.  This member function returns a reference to the thread local integer, so the body of the loop can perform whatever operation(s) on it are necessary, subject to the associativity and commutativity requirements.  No locks are required and we have virtually eliminated all contention.

Hey wait a minute, didn’t we simply replace a shared “int” variable with a shared “combinable<int>” variable?  How has this solved our contention problem?  Yes, each thread still operates on a shared combinable object.  However, the combinable object employs a simple trick such that synchronization (i.e., a fence) is only required the first time a thread-local sub-computation is initialized.  Subsequent accesses to the thread-local value are fenceless.  (For example, if a 10000 iteration loop is parallelized over 8 threads, there will be only 8 fences.)  The combinable object hides all this trickiness so you don’t have to be concerned about it.

Now the only thing missing is the final sum.  When this parallel algorithm finishes, it has computed some number of thread-local sub-computations which we must combine (hence the name combinable) to produce the final result.  Using lambdas again, this looks like:

int sum = localSums.combine([] (int left, int right) {

    return left + right;

});

 

Or we could be even more succinct by using the “std::plus” functor from the standard <functional> header:

int sum = localSums.combine(std::plus<int>);

The performance improvement that can be realized by using combinable varies, of course, depending on how much the contention dominates the other computations done during the parallel algorithm.  In extreme cases (e.g. no other work done except updating shared state), my 4-core machine showed a 5-7x speedup over using a critical section.  The improvement for the same extreme case was even more dramatic on a 24-core machine we have in our lab:  over 100x faster!

The combinable object has a second combine function “combine_each”.  Consider a slightly different algorithm that collects items into a result set<int>.

combinable<set<int>> localSets;

parallel_for_each(myVec.begin(), myVec.end(), [&localSets] (int element) {

    if (SomeFineGrainComputation(element))

        localSets.local().insert(element);

});

 

If we were to use the binary combine function above, it would look like:

set<int> result = localSets.combine([] (const set<int>& left,

                                        const set<int>& right) -> set<int> {

    set<int> temp = left;

    temp.insert(right.begin(), right.end());

    return temp;

});

 

Unfortunately, this form of combine will create and destroy lots of “set<int>” objects, which would be very expensive and wasteful.  The “combine_each” function will accumulate each thread-local sub-computation into a final result using no temporary values:

set<int> result;

localSets.combine_each([&result] (const set<int>& localSet) {

    result.insert(localSet.begin(), localSet.end());

});

 

The combinable<T> object is a very simple but powerful construct that can help eliminate shared state in parallel algorithms that would otherwise limit scalability.  It adds very little extra complexity, and can be many times faster than alternative solutions that require traditional synchronization primitives.  While it is optimized for use with the Microsoft Parallel Pattern Library, it also works just fine in any other parallel algorithm, whether it’s built on OpenMP or your own hand-rolled threading.

Guided Tour of the Concurrency Runtime and Parallel Libraries Part 1: supporting fine-grained tasks
01 July 08 08:48 AM | rickmolloy | 10 Comments   

In the next several blog posts I want to continue the early look of what we're considering in the Concurrency Runtime and the Parallel Libraries built on top of it.  Today, I'll share primitives for expressing fine-grained parallelism and concurrency.  In the next few posts in this series, I'll provide a look at the additional primitives for expressing concurrency that we are considering, touch on higher level APIs like the parallel_for I showed briefly last time, and finally take a deeper look into the Concurrency Runtime and how it works.

Introducing task_handle and task_group

When we think about enabling programmers to decompose logical work into a collection of finer-grained “tasks” suitable for concurrent execution there are two fundamental abstractions that we want to ensure are provided: 

·       task is a computation that may be internally decomposed into additional tasks. 

·       A task group is a collection of tasks that form a logical computation or sub-computation for the purpose of either completion or exception processing.

The task object we're considering is an extremely lightweight template object; effectively it contains a reference to some “work” as described by a functor or lambda and looks like this:

 template <class Function>
  class
task_handle{
     
public:
        task_handle(
const Function& f );

};

The task group object has a few more methods on it which in addition to construction allow scheduling tasks with the overloaded run methods and waiting for or cancelling the scheduled tasks to complete.  Also note the constructor which takes a function object parameter, this is intended to help process exceptions on the task_group.

class task_group {

   public:

task_group();
     
template <class
_ExceptionFunc>
      task_group(_ExceptionFunc Fn);

      template <class _Func>

      void run(const _Func& Fn);

      template <class _Func>

      void run(task_handle<Function>& t);

      void interrupt();

      task_group_status wait();

     

};

Dividing and conquering concurrently with tasks

So let’s take a look at a recursive example, quicksort.  One thing that’s nice about divide and conquer algorithms is that the divide step often ends up splitting the work into two disjoint sets. When this happens it makes parallelizing them significantly easier because there isn’t any data sharing.  Here’s an abbreviated serial implementation of quicksort:

void quicksort(int * a, int n) {
   if (n <= 1)
      return;    
   int s = partition(a,n);
   quicksort(a,s);
   quicksort(a+s,n-s);
}

We can parallelize this relatively easily with task_group by first creating a task_group, then scheduling the two recursive calls with C++0x lambdas via the overloaded task_group::run template method, finally there’s a call to task_group::wait which waits for both halves to finish.  I’ll leave parallelizing the partition step for a future blog post.

void quicksort(int * a, int n) {
   if (n <= 1)
      return;    
   int s = partition(a,n);
  
task_group g;
   g.run([&]{quicksort(a,s);});
   g.run([&]{quicksort(a+s,n-s);});
   g.wait();
}

Note that you can also do this directly with a task_handle if you’d like to manage the storage more directly, here I’m declaring 2 task_handles on the stack and the scheduling them by reference with the recursive calls to quicksort, but the tasks could just as easily been allocated with another custom allocator:

void quicksort(int * a, int n) {
   if (n <= 1)
      return;    
   int s = partition(a,n);
  
task_group g;
   task_handle t1([&]{quicksort(a,s);});
   g.run(t1);
   task_handle t2([&]{quicksort(a+s,n-s);});
   g.run(t2);
   g.wait();
}

That’s a very brief look at task_handle and task_group.  In part 2, I’ll continue talking about task_group, specifically handling exceptions and cooperatively cancelling work with the task_group::interrupt method.

As always your feedback & discussion is incredibly valuable, it’s a major reason for providing this early look into what we’re thinking about so if you’re reading this, please take a moment to comment or ask questions.

-Rick

Welcome to the Native Concurrency blog!
06 June 08 04:22 AM | rickmolloy | 5 Comments   

Welcome to the Parallel Programming in Native Code blog.  I started this blog so that I and others on my team would have a place to talk about topics relating to native concurrency.  I want to use this blog to provide early looks into what we’re thinking about, give announcements about any publicly available content or CTPs and of course respond to feedback that we receive from readers and customers.

We’ve talked before about the shift in hardware that is occurring as hardware vendors move towards multi- and many-core machines. We’ve also suggested that to take full advantage of this new hardware, the development platform will very likely need to grow and change. 

Today, I’d like to share some of the recurring major obstacles we see for C++ developers, provide insight into our design goals and a very brief glimpse of a technology we’re exploring to help address this, the Concurrency Runtime.  I’ll also touch upon a parallel library built on top of it intended to help C++ developers be more productive at building applications that scale.

If you’re at TechEd 2008 in Florida you may have already seen this content in part of a talk titled “Parallelize Your C++ Applications with the Concurrency Runtime.”  I’m really only talking about the first few minutes here and in future posts, I’ll walk through most of what was presented in that talk.

What are the major problems?

I’ve talked to a lot of C++ developers and ISVs about concurrency and the technical challenges that they face.   There are 2 major themes that always seem to come up. 

First and almost invariably I hear that multi-threading programming is considered quite difficult because of the challenges ensuring that concurrent code is correct, reliable and race and deadlock free.  Often folks will tell me that only a very small number of senior developers or concurrency experts do this work within an organization or company.

Another thing I hear a lot is that the amount of actual work and code required to take a portion of code and make it concurrent is significant.  The APIs for threading, events and locks are relatively low level and it isn’t always particularly obvious how to build concurrency into an application with them.  Building an application that scales well across a mix of hardware with these APIs is even harder.

Enabling productive concurrency

This is where our team comes in, we’re looking to help overcome these barriers and improve productivity for C++ developers.

We want to make expressing concurrency easier by adding abstractions for describing opportunities for parallelism that maintain the original intent, readability and composability of the code. 

We’re trying to minimize the number of new concepts we introduce to ensure that the model remains approachable and familiar to mainstream C++ developers.

We’re exploring ways for developers to overcome the challenges of shared memory by providing a means of describing applications as isolated components that communicate with a rich message passing interface.

We’re looking at providing a common and efficient Concurrency Runtime that supports a broad range of parallel abstractions and removes the need for developers to build this infrastructure.

A simple example with matrix multiply

I’ll provide a brief example before I close this post down…  Here’s an example of a naive matrix multiplication, I’d like to show how easy it can be to express concurrency in something simple like a for loop:

void MatrixMult(int size, double** m1, double** m2,double** result){

    for (int i = 0; i < size; i++){

        for (int j = 0; j < size; j++){

            for (int k = 0; k < size; k++) {

                result[i][j] += m1[i][k] * m2[k][j];

            }

        }

    }

}

 

And here’s a possible parallel version (note how similar it is):

void MatrixMult(int size, double** m1, double** m2,double** result){

    parallel_for (0,size,1,[&](int i){

        for (int j = 0; j < size; j++){

            for (int k = 0; k < size; k++){

                result[i][j] += m1[i][k] * m2[k][j];

            }

        }

    });

}

In this example I’ve taken the outer for loop and replaced it with a call to a parallel_for template function.  I’m also using a new C++0x language feature called lambdas to automate the work of manually creating a functor and capturing the variables used in the function.

We’ll be sure to discuss in more detail the libraries and runtime in future posts, but that’s all for now.  See you next time!

-Rick

Page view tracker