LINQ 101, "Parallelism Blockers," and PLINQ

LINQ 101, "Parallelism Blockers," and PLINQ

Rate This
  • Comments 5

PLINQ is a very cool technology, and I believe it will prove useful for parallelizing operations in a wide range of important scenarios.  Moreover, I believe that the programming model it provides will enable a wide-range of developers to easily take advantage of concurrency in their applications.  However, one of the risks involved in our providing such a simple programming model is that we don't currently shield developers from all of the possible issues they may run into from using it to parallelize code.

As an example, there is a set of 101 LINQ samples available at http://msdn2.microsoft.com/aa336746.aspx.  Unfortunately, many of these samples rely on implementation details and behaviors that don't necessarily hold when moving to a parallel model like the one employed by PLINQ.  In fact, some of them are dangerous when it comes to PLINQ.

On our team, we've been referring to these kinds of dangers as "parallelism blockers," and they're one of the primary reasons we're not planning to automatically replace all LINQ-to-Objects queries with PLINQ queries, instead supporting the opt-in model available through the AsParallel extension method.  We've talked about some of these issues before (see the CTP documentation for a more thorough look at some of these issues), but as a refresher, consider a few examples:

  1. In your Windows Forms application, you're using LINQ to loop over a long list of UI controls, setting properties on each or searching for one control in particular.  You decide to parallelize this with PLINQ.  Boom!  You get an InvalidOperationException, "Cross-thread operation not valid." In general, properties on UI controls can only be accessed by code running on the thread that created those controls; by using PLINQ, background threads were executing portions of the query, and those portions were accessing controls.
  2. You're using LINQ to search a list containing just a few items for an element with a property set to a specific value.  You switch over to PLINQ, and the search takes longer!  Why?  There is overhead in PLINQ to partitioning the work in order to execute on multiple threads, to merging that work back together in order to produce the output, and so forth.  If you're only dealing with a little work, PLINQ likely won't help, and it may make execution slower.  (Of course, if it's not taking much time to execute in the first place, one might question the value of parallelism for it.)
  3. You're using LINQ to search a list, and in the process, you're updating a value, such as a count.  You switch over to PLINQ, and now your count isn't reflecting the amount of work you've actually done.  Why?  LINQ was executing the updates sequentially, but PLINQ is executing them in parallel, and thus without synchronization, you have a bad race condition. (NOTE: please, please, please do not modify shared state in LINQ-to-Objects queries) 

The 101 LINQ samples are not good exemplars for PLINQ.  Consider the Query Execution category.  These samples contain code like:

int i = 0;
var q = from n in numbers select ++i;

NOOOOOOO!!!!!!!!!!!!!! With PLINQ, those increments to i will potentially happen in parallel, and they won't be atomic.  You could fix this by using Interlocked.Increment rather than ++, but then you'd likely destroy your performance.

Luckily, only a few of the 101 samples modify state.  But that doesn't mean switching over to PLINQ will produce the same results as LINQ for all of these. Almost all of the 101 samples have the potential to produce different output orderings than the LINQ versions due to lack of order preservation by default (at least in the CTP; we're reconsidering whether that is a good default, so if you have any opinion on the issue, please let us know).  For most of the samples, this doesn’t matter semantically (except for a hypothetical tool that does a straight output to output comparison between LINQ-to-Objects and PLINQ runs); for example,  the first sample lists all of the numbers from the array { 5, 4, 1, 3, 9, 8, 6, 7, 2, 0 } where the numbers are less than 5, and the second sample lists all of the sold out products from a dynamically generated list of products.  However, there are some where (depending on the intended usage) ordering may matter.  For example, the eighth sample generates the name of a number in the array (e.g. 5 to “five”); the output ordering may not match up to the input array in this case, and thus while you’re successfully generating all of the relevant outputs,  the correlation between input and output is lost (you’d either need to update the query to include in the results the original position, or you’d need to enable order preservation).  There are other sample queries that will have this same issue.

Additionally, almost all of the samples deal with very small data sets, and do very little work on each item.  It's likely that PLINQ's overhead will dominate in many of these cases, producing slower runs than if LINQ-to-Objects were used.  We're spending a lot of time now on PLINQ performance, and we're hoping to see that overhead decrease as much as possible, but even with all of the optimizations we plan to throw at it, it's likely there will still be some queries for which the non-parallel LINQ-to-Objects will still be preferred.  That's one place where we need your help.  If you have queries that you believe should be executing much faster in parallel than they are, please send them our way; we'd love to analyze them to figure out where bottlenecks may be.

Leave a Comment
  • Please add 5 and 8 and type the answer here:
  • Post
Page 1 of 1 (5 items)