Parallelizing a query with multiple “from” clauses

Parallelizing a query with multiple “from” clauses

  • Comments 4

Consider a simplified version of Luke Hoban's LINQ ray tracer

var Xs = Enumerable.Range(1, screenWidth);

var Ys = Enumerable.Range(1, screenHeight);

 

var sequentialQuery =  

from x in Xs

from y in Ys

select new { X = x, Y = y, Color = TraceRay(x, y) };

 

If the screen width is much larger than the screen height, we would choose to parallelize the computation along the screen width. This is because PLINQ would be more beneficial when the overhead to parallelize a given of piece work into n “chunks” is much lower than the running the “chunk of work” itself. Since the screen width is larger, there is more potential for creating bigger chunks of work and in turn getting a better payoff for the overhead of parallelizing as compared to the height. To change sequentialQuery to use PLINQ, we need to modify it to   

var parallelAlongWidthQuery =

from x in Xs.AsParallel() //Parallel along screen width

from y in Ys

select new { X = x, Y = y, Color = TraceRay(x, y) };

Let us look at parallelAlongWidthQuery more closely. We can derive the exact methods that will be invoked for this query by referring to the MSDN documentation on C# 3.026.7.1 Query Expression Translation. Section 26.7.1.4 explains that the parallelAlongWidthQuery will be translated to something like -

(Xs.AsParallel()).SelectMany(

x => Ys,

(x, y, Color) => new {X = x, Y = y, Color = TraceRay(x, y)});

 

The AsParallel () call converts Xs to an IParallelEnumerable type. [IParallelEnumerable enables the “parallel” implementations of the standard query operators] So, this query will bind to the Parallel version of SelectMany. When we run the application, we will notice that the query executes across all available processors on the machine.

We have successfully parallelized the application!!

 

However, assume that our dataset was different to begin with and the screen height was much larger than the screen width. Using the same logic for parallelizing along the larger sized dataset, we would choose Ys this time. To achieve this, it might be tempting to change the query to

var parallelAlongHeightQuery =     

from x in Xs

from y in Ys.AsParallel()//Parallel along screen height??

select new { X = x, Y = y, Color = TraceRay(x, y) };

 

Let us apply the query translation to parallelAlongHeightQuery. Using the MSDN documentation on C# 3.0 rules, we can say that parallelAlongHeightQuery will be translated to something like -

(Xs).SelectMany(

x => Ys.AsParallel(),

(x, y, Color) => new {X = x, Y = y, Color = TraceRay(x, y)});

 

Xs is of Enumerable type and the compiler will bind to the LINQ version of SelectMany and not the expected parallel version! When we run the application, we will notice that the query does not execute across all available processors on the machine.

The 2 “from” clauses are not symmetric as they seem!

A good way to force parallelization across the screen height is to specify it as the first data source -

var parallelAlongHeightQuery =     

from y in Ys.AsParallel()//Parallel along screen height

from x in Xs

select new { X = x, Y = y, Color = TraceRay(x, y) };

This will translate to something like -

(Ys.AsParallel()).SelectMany(

x => Xs,

(x, y, Color) => new {X = x, Y = y, Color = TraceRay(x, y)});

Since Ys is now an IParallelEnumerable, this query will bind to the “parallel” version of SelectMany. When executed, we will notice that the query now runs across all available processors.

One more point to note is that even though SelectMany accepts mutliple datasources, PLINQ will partition work across the first datasource only. This is because it assumes that partioning across just one of the dataset will give enough parallelization and speedup.

Leave a Comment
  • Please add 5 and 8 and type the answer here:
  • Post
  • One thing I'm seeing a lot of in the PLINQ samples but I'm not getting is the use of .ToArray(). If Range is returning an Enumerable, what is the difference (outside the extra method and array alloc) between running a select from Array instead of Enumerable?

    (sorry if this is explained somewhere right in front of my face, but I've been saturating myself with all things 3.5 the last few days, and I'm sure I'm missing some stuff)

  • Good question, Keith.

    PLINQ uses the type of the data source to determine the partitioning scheme. For data sources like arrays, lists and IParallelEnumerable.Range, PLINQ can determine the size of the dataset and partition it into "ranges" of pre-calculated lengths. For a data source of type Enumerable, the size cannot be pre-determined. PLINQ uses chunk-partitioning scheme for such datasets. You can read more about Range and Chunk Partitioning at http://blogs.msdn.com/pfxteam/archive/2007/12/02/6558579.aspx

    Now, this important is to know while writing PLINQ queries because some applications might perform better with Range partitioning and some with Chunk partitioning.

    That said ToArray() in this post is probably confusing since it conflicts with the Ray Tracer code. I will edit the post to make it clear. Thanks for pointing it out!

  • PingBack from http://blog.robinzhong.com/index.php/archives/2007/12/15/232.html

  • PingBack from http://www.ekinoderm.com/wordpress/2008/11/closures/

Page 1 of 1 (4 items)