# Recursion and Concurrency

## Parallel Programming with .NET

### Recursion and Concurrency

When teaching recursion in an introductory computer science course, one of the most common examples used involves a tree data structure.  Trees are useful in this regard as they are simple and recursive in nature, with a tree's children also being trees, and allow for teaching different kinds of traversals (in-order, pre-order, post-order, level-order, and so forth).  But when these introductory concepts meet multithreading, the concepts are no longer so simple, at least not with the tools available in mainstream languages today like C#, C++, and Java. Consider a simple Tree data structure:

class Tree<T>
{
public Tree<T> Left, Right; // children
public T Data; // data for this node
}

Let's say we want to execute an Action<T> for each datum stored in the tree, and let's assume I don't care about order (introducing parallelism could be questionable if I did).  That's straightforward to do sequentially and recursively:

public static void Process<T>(Tree<T> tree, Action<T> action)
{
if (tree == null) return;

// Process the current node, then the left, then the right

action(tree.Data);
Process(tree.Left, action);
Process(tree.Right, action);
}

I could also do so without recursion by maintaining an explicit stack (or queue, or some other data structure with different ordering guarantees):

public static void Process<T>(Tree<T> tree, Action<T> action)
{
if (tree == null) return;
var toExplore = new Stack<Tree<T>>();

toExplore.Push(tree);
while (toExplore.Count > 0)
{
// Grab the next node, process it, and push its children

var current = toExplore.Pop();
action(current.Data);
if (current.Left != null)

toExplore.Push(current.Left);
if (current.Right != null)

toExplore.Push(current.Right);
}
}

Now, let's assume the action we're performing on each node of the tree is independent, relatively expensive and/or that the tree is relatively large, and as such we want to process the tree in parallel (we're of course also assuming that the action delegate is thread-safe), meaning that we want multiple threads each running the action delegate on distinct tree nodes.  How do we do this with what we have in .NET today?

There are multiple approaches, some more valid than others.  The first thing someone might try is to follow the original recursive implementation but using the ThreadPool, which could look something like this:

public static void Process<T>(Tree<T> tree, Action<T> action)
{
if (tree == null) return;

// Use an event to prevent this method from
// returning until its children have completed

using (var mre = new ManualResetEvent(false))
{
// Process the left child asynchronously

{
Process(tree.Left, action);
mre.Set();
});

// Process current node and right child synchronously
action(tree.Data);
Process(tree.Right, action);

// Wait for the left child

mre.WaitOne();
}
}

The idea behind this implementation is to, given a node, spin up a work item to process that node's left child in parallel with the current node, and then process the current node's data as well as its right child.  Of course, I could be losing out on some parallelism here as I delay processing of the right child until I'm done processing the current data.  So we modify it slightly:

public static void Process<T>(Tree<T> tree, Action<T> action)
{
if (tree == null) return;

// Use an event to wait for the children
using (var mre = new ManualResetEvent(false))
{
int count = 2;

// Process the left child asynchronously
{
Process(tree.Left, action);
if (Interlocked.Decrement(ref count) == 0)
mre.Set();
});

// Process the right child asynchronously

{
Process(tree.Right, action);
if (Interlocked.Decrement(ref count) == 0)
mre.Set();
});

// Process the current node synchronously
action(tree.Data);

// Wait for the children
mre.WaitOne();
}
}

Obviously, we need a better implementation. Next up, we can walk the tree sequentially, queuing up a work item in the pool for each node in the tree.  Here we take that approach based on the recursive implementation of a tree walk:

public static void Process<T>(Tree<T> tree, Action<T> action)
{
if (tree == null) return;

// Use an event to wait for all of the nodes to complete
using (var mre = new ManualResetEvent(false))
{
int count = 1;

// Recursive delegate to walk the tree
Action<Tree<T>> processNode = null;
processNode = node =>
{
if (node == null) return;

// Asynchronously run the action on the current node
Interlocked.Increment(ref count);
{
action(node.Data);
if (Interlocked.Decrement(ref count) == 0)
mre.Set();
});

// Process the children
processNode(node.Left);
processNode(node.Right);
};

// Start off with the root node
processNode(tree);

// Signal that no more work items will be created
if (Interlocked.Decrement(ref count) == 0) mre.Set();

// Wait for all of the work to complete
mre.WaitOne();
}
}

This implementation is better from a resources perspective and won't lead to the same kind of deadlocks as in the previous example.  We're still creating a work item for each node, but now the work item is simply to asynchronously execute the action on the node's data, whereas the tree walk itself is still happening sequentially on the main thread (taking advantage of a recursive delegate).  To maintain blocking behavior similar to the previous examples (where the call to Process won't return until all of the work has completed) though not identical (the processing of children nodes don't block the parent), a counter is incremented before each work item is created and is decremented when each work item finishes; only when that counter reaches 0 will the main thread continue (note how we need to start the counter at 1 and then decrement it after all of the work items have been created... this is to ensure that a work item completing before all of the work has been kicked off doesn't set the event prematurely).  We can take the same approach with a variant of the iterative implementation shown earlier:

public static void Process<T>(Tree<T> tree, Action<T> action)
{
if (tree == null) return;

// Use an event to wait for all of the nodes
// to complete

using (var mre = new ManualResetEvent(false))
{
int count = 1;

var toExplore = new Stack<Tree<T>>();
toExplore.Push(tree);

// Process all of the nodes
while (toExplore.Count > 0)
{
// Get the current node and and push its children
var current = toExplore.Pop();
if (current.Left != null)

toExplore.Push(current.Left);
if (current.Right != null

toExplore.Push(current.Right);

// Asynchronously process the data
Interlocked.Increment(ref count);
{
action(current.Data);
if (Interlocked.Decrement(ref count) == 0)
mre.Set();
});
}

// Signal that no more work items will be created
if (Interlocked.Decrement(ref count) == 0) mre.Set();

// Wait for all work items to be completed

mre.WaitOne();
}
}

These approaches work by queueing a work item per node in the tree.  There are, of course, other approaches.  Rather than adding a work item per node, we can create N work items, and have each work item process approximately 1/Nth of the nodes in the tree, where N is the number of threads.  For example, we could store all of the nodes into a list and then split that list into N pieces and have a work item process each piece:

public static void Process<T>(Tree<T> tree, Action<T> action)
{
if (tree == null) return;

// Create a list of all nodes in the tree

var nodes = new List<Tree<T>>();
var toExplore = new Stack<Tree<T>>();
toExplore.Push(tree);
while (toExplore.Count > 0)
{
var current = toExplore.Pop();
if (current.Left != null

toExplore.Push(current.Left);
if (current.Right != null

toExplore.Push(current.Right);
}

// Divide the list up into chunks
int workItems = Environment.ProcessorCount;
int chunkSize = Math.Max(
nodes.Count / workItems, 1);
int count = workItems;

// Use an event to wait for all work items
using (var mre = new ManualResetEvent(false))

{
// Each work item processes appx 1/Nth of the data items
WaitCallback callback = state =>
{
int iteration = (int)state;
int from = chunkSize * iteration;
int
to = iteration == workItems - 1 ?
nodes.Count :
chunkSize * (iteration + 1);

while (from < to) action(nodes[from++].Data);

if (Interlocked.Decrement(ref count) == 0)
mre.Set();
};

// The ThreadPool is used to process all but one of the
// chunks; the current thread is used for that chunk,
// rather than just blocking.

for (int i = 0; i<workItems; i++)
{
if (i < workItems-1)
else
callback(i);
}

// Wait for all work to complete
mre.WaitOne();
}
}

Rather than statically dividing the work up, I could also rewrite the code such that each of the work items themselves is responsible for pulling work to handle:

public static void Process<T>(Tree<T> tree, Action<T> action)
{
if (tree == null) return;

// Get an enumerator for the tree
IEnumerator<T> enumerator =
GetNodes(tree).GetEnumerator();
int workItems = Environment.ProcessorCount;
int count = workItems;

// Use an event to wait for all work items to complete
using (var mre = new ManualResetEvent(false))
{
// Each work item will continually pull data from the
// enumerator and process it until there is no more data
// to process

WaitCallback callback = delegate
{
while (true)
{
T data;
lock (enumerator)
{
if (!enumerator.MoveNext()) break;
data = enumerator.Current;
}
action(data);
}
if (Interlocked.Decrement(ref count) == 0)
mre.Set();
};

// The ThreadPool is used to process all but one of the
// chunks; the current  thread is used for that chunk,
// rather than just blocking.

for (int i = 0; i < workItems; i++)
{
if (i < workItems-1)
else
callback(i);
}

// Wait for all work to complete
mre.WaitOne();
}
}

// An enumerator for a tree
public static IEnumerable<T> GetNodes<T>(Tree<T> tree)
{
if (tree != null)
{
yield return tree.Data;
foreach (var data in GetNodes(tree.Left))
yield return data;
foreach (var data in GetNodes(tree.Right))

yield return data;
}
}

Here I use a C# iterator to create an enumerator for all of the nodes in the tree.  A work item is queued for each processor, and each of those work items retrieves an item from the enumerator, processes it, and goes back for more.  Since multiple threads are accessing the enumerator (which isn't thread-safe), retrieving an item from it must be done under lock.  Unfortunately, that's expensive; a better implementation would shift to grabbing more items from the enumerator so that the lock could be taken fewer times.

There are still further possible approaches.  For example, we could walk the tree until we get to nodes of a certain depth (such as the log of the number of cores) and then process all of the subtrees at that depth in parallel with each other.

All of these implementations also have potential locality issues.  It's reasonable to assume (though certainly not guaranteed) that nodes in the tree near to each other were created near to each other in time and are likely near to each other in memory (this assumption may be more applicable to other recursive problems, such as ones to be discussed shortly).  For best cache performance, we'd like to try to have nodes near each other processed by the same thread and near to each other in time so that cache hits are optimized.  But all of these solutions have been, in effect, randomly assigning nodes to threads.

Once of the nice things about the tree walk example is that it's very simple; there are no pre- or post- processing steps necessary for a node, which means I don't actually need to block at the end of a parent in order to wait for its children as I did in some of the previous examples.  The problem also allows me to start processing children of a node before I've finished processing its parent.  Many recursive problems are not like that, however.  Consider a typical recursive sort, like quick sort:

public static void Quicksort<T>(T[] arr, int left, int right)
where T : IComparable<T>
{
if (right > left)
{
int pivot = Partition(arr, left, right);
Quicksort(arr, left, pivot - 1);
Quicksort(arr, pivot + 1, right);
}
}

Order here definitely makes a difference.  If we think of the Partition step like the execution of the Action in the tree walk example, I'm not able to do the Partition in parallel with the processing of the children, since the processing of the children is dependent on the outcome of the call to Partition.  That's a pretty easy problem to workaround of course; I just wouldn't start the children executing asynchronously until after the Partition step is complete.  But now consider a recursive sort like merge sort:

public static void Mergesort<T>(T[] arr, int left, int right)
where
T : IComparable<T>
{
if (right > left)
{
int mid = (right + left) / 2;
Mergesort(arr, left, mid);
Mergesort(arr, mid + 1, right);
Merge(arr, left, mid + 1, right);
}
}

The merge step at a particular node must occur after the recursive calls, which means in a parallel implementation either we'd need to block waiting for the two recursive calls to complete before starting the call to Merge, or we'd have to use a form of continuation-passing style to join on the two merge calls and have their completion start the execution of the merge operation.  Either way, this kind of processing with the ThreadPool is incredibly tricky to implement efficiently.

One of our goals when designing Parallel Extensions to the .NET Framework was to make such recursive parallel operations much easier to implement, and much easier to implement efficiently.

Let's start with PLINQ and our tree walk.  Taking the Tree<T> iterator we already implemented, processing the tree in parallel is now a cakewalk:

public static void Process<T>(Tree<T> tree, Action<T> action)
{
if (tree == null) return;
GetNodes(tree).AsParallel().ForAll(action);
}

Under the covers, this is very much like our implementation where we created several work items, and each of those work items pulled items (under lock) from the enumerator.  PLINQ is more efficient, however, and uses varying chunk sizes (rather than always pulling just one element at a time) in an attempt to minimize the number of times the lock must be taken, hopefully reducing contention on the lock.  This, of course, as with the similar previous implementation is doing a sequential tree walk but running the actions in parallel.  If we actually want to walk the tree in parallel, we can use the Task Parallel Library:

public static void Process<T>(Tree<T> tree, Action<T> action)
{
if (tree == null) return;
Parallel.Do(
() => action(tree.Data),
() => Process(tree.Left, action),
() => Process(tree.Right, action));
}

It's almost scary how simple that is.  The Parallel.Do statement potentially runs the three provided operations in parallel, blocking until all three have completed.  A really cool thing about this though is that it will attempt to reuse the current thread as much as possible, so we don't run into the same kinds of issues that we had with the ThreadPool where we were blocking a whole slew of ThreadPool threads.  The underlying work-stealing nature of the implementation also goes a long way towards helping to address the cache locality issues we previously mentioned.  If we want more control, we can dive down below the Parallel class and explicitly use Tasks:

public static void Process<T>(Tree<T> tree, Action<T> action)
{
if (tree == null) return;
delegate { Process(tree.Left, action); });
delegate { Process(tree.Right, action); });
action(tree.Data);

}

The same techniques can apply to the quicksort implementation:

public static void Quicksort<T>(T[] arr, int left, int right)
where
T : IComparable<T>
{
if (right > left)
{
int pivot = Partition(arr, left, right);
Parallel.Do(
() => Quicksort(arr, left, pivot - 1),
() => Quicksort(arr, pivot + 1, right));
}
}

as well as to the mergesort implementation:

public static void Mergesort<T>(
T[] arr, int left, int right)
where T : IComparable<T>
{
if (right > left)
{
int mid = (right + left) / 2;
Parallel.Do(
() => Mergesort(arr, left, mid),
() => Mergesort(arr, mid + 1, right));
Merge(arr, left, mid + 1, right);
}
}

I love how simple the parallelization of this code becomes.  However, this code still has some issues.  There is necessary overhead to an operation like Parallel.Do, which under the covers creates Tasks for the individual operations and waits on those Tasks; that overhead may not be a big deal if the work being done is significant, but for fine-grain concurrency that overhead can end up dominating the computation, potentially causing the parallel implementation to run slower than the sequential implementation.  For now, the best way to work around that is to complicate the code slightly with thresholds.  The idea behind a threshold is that you introduce enough parallelism to keep the system saturated, but once enough has been expressed, you switch over to a sequential implementation so as to avoid the extra overhead.  With the tree walk, using a depth as a threshold might look like this:

private static void Process<T>(
Tree<T> tree, Action<T> action, int depth)
{
if (tree == null) return;
if (depth > 5)
{
action(tree.Data);
Process(tree.Left, action, depth + 1);
Process(tree.Right, action, depth + 1);
}
else
{
Parallel.Do(
() => action(tree.Data),
() => Process(tree.Left, action, depth + 1),
() => Process(tree.Right, action, depth + 1));
}
}

I have both a sequential and a parallel implementation here, and I switch from one to the other based on how deep in the tree I am.  Depth may not always be the right gating factor to use as a threshold, especially if the tree isn't balanced.  We're working through ways to reduce the overhead we currently see for these operations in the early CTP we released in December, and we're testing out some approaches that may eliminate the need to use thresholds in certain cases.  For now, though, as you're playing around with the bits, keep thresholds in mind.  The same technique could be used with both sorting examples, as well.

All in all, parallelism is a very interesting problem when it comes to recursion, and we hope the Parallel Extensions to the .NET Framework go a long way towards making the merging of these two concepts significantly easier.

• Post
• Nice one, Actually I did some similar work (http://portal.acm.org/citation.cfm?id=1272462) for the parallel XML processing in C#, which is mainly about the parallel DOM tree traversal. The thread is keeping traversing the tree by manipulating a stack just like the sequential algorithm does, but other threads, when out of work, can steal the work from the bottom of the stack. I think your solution based on the Task has the same effect.

Also controlling the depth of the parallel traversal is exactly right, my experiments show that it is key to the performance win of the parallel xml processing.

• Cool, thanks for the link and information.

• Great article! I love the work you guys are doing with this, and am looking forward to a solid concurrency framework!

I have a question though: In one section, you mention that creating and waiting on Wait Handles requires a (sometimes) unacceptable number of kernel transistions.

... how does Task.WaitAll (that you reference later in the document) avoid this? Are you using a user-mode wait object in there, and each task enlists in that, or does this mechanism suffer the same issues with Wait handle creation and use?

• Thanks, Chris. Task.WaitAll can, in effect, just wait on each Task in sequence.  In many situations, waiting on a Task that hasn't started yet will actually cause the Task to be executed inline, rather than blocking waiting for it to complete execution on another thread.  In addition, even if we do need to do a true wait, we may spin a bit in user-mode first.

• un excellent article de l' équipe PFX (Parallel Framework), sur comment parallèliser un algorithme récursif

• Synchronization mechanism based on counters and events is very complicated and hard especially for non-qualified programmers.

Some proof of that is that the author overlooked some simplification in the code below:

public static void Process<T>(Tree<T> tree, Action<T> action)

{

if (tree == null) return;

// Use an event to wait for all of the nodes to complete

using (var mre = new ManualResetEvent(false))

{

//int count = 1;

int count = 0;   // NOTE THIS !!!

// Recursive delegate to walk the tree

Action<Tree<T>> processNode = null;

processNode = node =>

{

if (node == null) return;

// Asynchronously run the action on the current node

Interlocked.Increment(ref count);

{

action(node.Data);

if (Interlocked.Decrement(ref count) == 0)

mre.Set();

});

// Process the children

processNode(node.Left);

processNode(node.Right);

};

// Start off with the root node

processNode(tree);

// Signal that no more work items will be created

//if (Interlocked.Decrement(ref count) == 0)

//mre.Set();  WE CAN DROP GIVEN STATEMENT !!!

// Wait for all of the work to complete

mre.WaitOne();

}

}

For synchronization purposes more transparent and easy are so called "chords" costructions

(see languages as Polyphonic C#, COmega, MC#(www.mcsharp.net)).

• Yury, thanks, but that is not a valid simplification. Let's say the count did begin at 0, as you suggest.  We then kick off a work item to process the first node; before doing so, we increment the count to 1, and after doing so, we decrement the count and check whether it's now 0, setting the ManualResetEvent if it is.  Now, let's say the tree has 500 nodes in it, but we end up finishing the first work item before we even get to the pre-increment for the second work item.  The completion of the first item sets the event, and now Process will not wait for all of the nodes to complete before exiting.  That's why I initialize the count to 1 and then decrement it after all of the work items have been queued, so that I can be sure Process will wait for all items to complete before returning.

• >The completion of the first item sets the event, and >now Process will not wait for all of the nodes to >complete before exiting.

No, it's impossible because the initial call

processNode(tree);

can't be completed ( and so we can't move to check

mre.WaitOne(); ) until

Interlocked.Increment(ref count);

has been performed for all nodes of the tree.

• No, it's not impossible.  Here's the sequence of events:

1) We call Process() on the root node.  The count is 0.

2) We call Interlocked.Increment on count, increasing it to 1, and we queue a work item.

3) Before the main thread that's iterating over the nodes moves on to queue a work item for the next node, the work item for the first node executes and does an Interlocked.Decrement on count, setting count to 0.  It sees that count is 0, and sets the event.

4) The main thread continues to queue work items for the rest of the nodes.

5) When it's done queueing work items, it waits on the event.  But at this point, even though all of the work items may not have completed, the event has already been set (by the completion of the first work item).

• Yes, you are right.

Nevertheless, this subtle issue shows

that a programmer needs more simple and transparent

synchronization constucts/tools.

Does the Parallel.Do uses ThreadPool to run the provided operations?

In any case, how the deadlock possibility is avoided in that case?

To my view, the method

public static void Process<T>(Tree<T> tree, Action<T> action)

{

if (tree == null) return;

Parallel.Do(

() => action(tree.Data),

() => Process(tree.Left, action),

() => Process(tree.Right, action));

}

is an abbreviation of (prone to deadlocks)

public static void Process<T>(Tree<T> tree, Action<T> action)

{

if (tree == null) return;

// Use an event to wait for the children

using (var mre = new ManualResetEvent(false))

{

int count = 2;

// Process the left child asynchronously

{

Process(tree.Left, action);

if (Interlocked.Decrement(ref count) == 0)

mre.Set();

});

// Process the right child asynchronously

{

Process(tree.Right, action);

if (Interlocked.Decrement(ref count) == 0)

mre.Set();

});

// Process the current node synchronously

action(tree.Data);

// Wait for the children

mre.WaitOne();

}

}

Thanks.

• Many thanks, but, if you please, another question

concerning the replicable tasks in TPL.

In the paper

http://msdn.microsoft.com/msdnmag/issues/07/10/Futures/default.aspx?loc=en

there is the following example of using of replicable task (in fact, it's a Parallel.For implementation):

static void For( int from, int to, Action<int> body )

{

int index = from;

int i;

while ((i = Interlocked.Increment(ref index)-1) < to) {

body(i);

}

});

}

Lets look on the body of given method more abstractly as

int index = from;

int i;

F ( i, index );  //  some function

});

or even as

<delagate body>

});

The quiestion is -

how many (ordinary) Tasks ( i.e., action delegates running in other threads) will be created to perform