It’s been over six months since the Parallel Extensions to .NET Framework 3.5, June 2008 CTP release, and I’ve been wanting to play around with that stuff for awhile. It’s all shipping in .NET Framework 4.0 and is considered by Soma to be a key cloud-enabling technology. So I finally jumped in and decided to “parallelize” the reaction-diffusion visualizer I discussed in Using WriteableBitmap to Display a Procedural Texture.

Here’s a snapshot:

Reaction-diffusion visualizer

My first implementation used a single worker thread to compute each frame. Here’s how the single-thread implementation uses the CPU resources on my dual-core Inspiron laptop. Frame time averages about 127 ms.

One processor is at nearly 100% usage, but the other is underutilized.

To parallelize my reaction-diffusion visualizer, I simply replace the outer for loop with a Parallel.For method call:

```            //for (int i = 1; i < vesselHeight - 1; i++)
Parallel.For(1, vesselHeight - 1, i =>
{
for (int j = 1; j < vesselWidth - 1; j++)
{
c = -W1 / weight(2, i, j, reaction);
f = -W2 * weight(1, i, j, reaction);

e_to_c = Math.Exp(c);
e_to_f = Math.Exp(f);
d = 1.0 + K1 * e_to_c;
g0 = (K1 * K2 * e_to_c * e_to_f) / d;

Xc = b * (g0 / (g0 + 1));
Xb = (K1 * e_to_c / (1 + K1 * e_to_c)) * (b - Xc);
Xa = b - Xb - Xc;```
```                    // The out buffer is shared among processors/threads.
reactionOut[2, i, j] = Xc;
reactionOut[1, i, j] = Xb;
reactionOut[0, i, j] = Xa;```
```                };
}); ```

Of course, it can’t be quite this simple. Here’s the output after a few frames:

The problem, of course, is that access to the out buffer, reactionOut, is not synchronized (Not so! See UPDATE below). I can put the inner loop inside a lock, and this produces correct behavior again, but the frame rate is actually slower than in the single-thread case.

Fortunately, it’s easy to solve the problem without the performance penalty caused by lock overhead. I factored the inner loop code into a ComputeConcentrations method – I hadn’t done this before, because I didn’t want the method-call overhead.

```            //for (int i = 1; i < vesselHeight - 1; i++)
Parallel.For(1, vesselHeight - 1, i =>
{
for (int j = 1; j < vesselWidth - 1; j++)
{
Concentrations c = ComputeConcentrations(i, j, reaction);
reactionOut[2, i, j] = c._Xc;
reactionOut[1, i, j] = c._Xb;
reactionOut[0, i, j] = c._Xa;
};
}); ```

Now the output is correct and both cores are fully engaged. Frame time averages about 83 ms, which is a 35% improvement.

Now I need to get my hands on a quad-core machine.

UPDATE: Stephen Toub, lead PM for our concurrency development platform team, kindly reviewed my code and corrected my misconception about access to the out buffer. In fact, the problem is with closure; specifically, I had declared all my variables outside the outer loop. This meant that all threads were sharing those registers, which is almost never what you want. The colors were pretty, though.

UPDATE 2: Here’s the same code running on my 3.6GHz quadcore. Average frame rate is 62ms, which is slightly over 50% faster than the 2GHz single-thread case.

UPDATE 3: The code is now posted on my Code Gallery page: WPF and Parallel .NET.