This post shows a simple way to write code that takes advantage of multiple processors. You will see that LINQ queries can allow you to side step the difficult tasks normally involved in writing multi-threaded code. To get started, all you need is a little basic knowledge of how to write simple LINQ queries.
The code shown in this post uses a pre-release version of PLINQ called the Microsoft Parallel Extensions to .NET Framework 3.5. When PLINQ finally ships, it will run only on .NET 4.0 or later. The version I'm using that runs on top of 3.5 is for evaluation purposes only. There will never be a shipping version that runs on .NET 3.5.
This LINQ provider is being created at Microsoft by the Parallel Computing team; it is not the work of the C# team that created LINQ to Objects and LINQ to SQL. Here is the website for the Parallel Computing team:
http://msdn.microsoft.com/en-us/concurrency/
At the time of this writing, these extensions were available only in pre-release form. You could download them either as Visual Studio 2008 compatible extensions to .NET 3.5, or as part of the pre-release version of Visual Studio 2010. Since the download sites might change over the coming months, I suggest that you find these resources by going to the Parallel Computing site, or to the Visual Studio site:
http://msdn.microsoft.com/en-us/vs2008
Parallel LINQ, or PLINQ, is only a small part of the Parallel Extensions to the .NET Framework. It is, however, an important part. Since it is a simple and natural extension of the LINQ syntax, I think developers familiar with that technology will find it easy to use.
Consider this code:
var list = Enumerable.Range(1, 10000); var q = from x in list.AsParallel() where x < 3300 select x; foreach (var x in q) { Console.WriteLine(x); }
These lines look nearly identical to the code you have seen in many simple LINQ samples. The only significant difference is the call to AsParallel at the end of the first line. Though we have often used type inference to hide the return type of a LINQ query, I'm going to pause and take a second look at this instance. Rather than returning IEnumerable<T>, this version of PLINQ returns IParallelEnumerable<int>:
IParallelEnumerable<int> q = from x in list.AsParallel() etc….
In the near future, PLINQ queries of this type will probably return ParallelQuery<int>. Because this product is still evolving, it might be simplest to use var, at least during the pre-release phase, and let the compiler choose the type. That way you can save typing, avoid problems with anonymous types, and you need not concern yourself about changes in the API as the product develops. It is almost always appropriate to use var to designate the return type of a LINQ query, and there are only special circumstances when you would do otherwise.
Here are the results from this first PLINQ query:
2 1 3 4 6 512 5 7 513 8 12 514 9 13 515 10 14 516 11 15 517 16 72 518 17
The numbers shown here are in a relatively random order because they are being returned from different threads. It is important to remember that the sequence of values returned by LINQ is not always guaranteed to be presented in a particular order. If Order is important in your code, you can add a call to AsOrdered to the query after the call to AsParallel. Alternatively, you could insert a GroupBy clause to establish the desired ordering. Otherwise developers should assume that the ordering from a PLINQ query will be entirely random
Now that you understand the basics of Parallel LINQ, let’s move on to look at a more interesting example. Improved performance is the main reason to write code that can run in parallel. The program shown in this post uses a timer to demonstrate how PLINQ can improve performance in a program.
Performance improvements become more evident when our code has access to more processors. The code I show here runs faster on a two processor machine, but it really starts to come into its own on a four processor machine. Moving up to even more processors yields more powerful results. Here, for instance, are the results showing an improvement of 1.33 times when using two processors, and almost two times when using 4 processors:
2 Processors = 1.44 x improvement: Linear: 00:00:13.15 Parallels: 00:00:09.10 4 Processors = 1.96 x improvement: Linear: 00:00:15.00 Parallel: 00:00:07.68
These tests are being running against pre-release software, so these numbers are almost certain to change before release, and of course different machines will yield different results. Furthermore, the degree of improvement that you see is likely to change depending on the type of algorithm you run, the number of cores on your machine, the architecture of the machine, how many caches there are and how they’re laid out, etc. Though it is rare, some queries show superlinear performance enhancements. In other words, there is a greater than 4x speedup on a 4-core box. An improvement of 2 times, such as the one shown, or even a 3 time improvement, is common.
This sample program is called FakeWeatherData, and it is available for download from the LINQ Farm on Code Gallery. It features a simple LINQ to XML query run against a file with 10,000 records in it. The data I'm querying is not real, but consists of random dates and temperatures generated by a simple algorithm included in the FakeWeatherData program.
The XML file is structured like this:
<?xml version="1.0" encoding="utf-8" ?> <Samples> <Sample> <Year>1973</Year> <Month>May</Month> <Day>15</Day> <Temperature>10</Temperature> </Sample> <Sample> <Year>1970</Year> <Month>Feb</Month> <Day>10</Day> <Temperature>14</Temperature> </Sample> <Sample> <Year>1970</Year> <Month>Jan</Month> <Day>15</Day> <Temperature>11</Temperature> </Sample>
... Many lines of code omitted here
</Samples>
There is also a simple C# class used by the program to encapsulate the data from the XML file:
class WeatherData { public string Year { get; set; } public string Month { get; set; } public string Day { get; set; } public string Temperature { get; set; } }
The parallel version of the query in the program looks like this:
for (int i = 0; i < NUM_REPS; i++) { var list = (from x in doc.Root.Elements("Sample").AsParallel() where x.Element("Year").Value == "1973" && x.Element("Month").Value == "Apr" && x.Element("Day").Value == "15" select new WeatherData { Day = x.Element("Day").Value, Month = x.Element("Month").Value, Temperature = x.Element("Temperature").Value, Year = x.Element("Year").Value }).ToList(); }
Accompanying this code is a similar LINQ query that does not use PLINQ
for (int i = 0; i < NUM_REPS; i++) { var list = (from x in doc.Root.Elements("Sample") where x.Element("Year").Value == "1973" && x.Element("Month").Value == "Apr" && x.Element("Day").Value == "15" select new WeatherData { Day = x.Element("Day").Value, Month = x.Element("Month").Value, Temperature = x.Element("Temperature").Value, Year = x.Element("Year").Value }).ToList(); }
The program queries the data in the XML file first using the Parallel code, then using standard LINQ. By comparing the time it takes each block of code to execute you can get a sense of the relative improvement available through PLINQ. I'll show you how to make such comparisons in just a moment. I will also discuss some tools that will become available to help profile code of this type.
You can see that the PLINQ query contains a call to AsParallel, while the other query does not. Other than that the two queries are identical. The fact that the two queries look so much alike points to a primary strength of PLINQ: very little specialized knowledge is necessary in order to begin using it. This does not mean that the subject is trivial, but only that the barrier to entry is low. This is not the case with most concurrent programming models.
LINQ queries are designed to be read-only, working with immutable data. This is a good model for parallelism, because it makes it unlikely that data will mutate, thereby setting up the potential for a race condition. You should note, however, that PLINQ does nothing to prevent this from happening, it is simply that LINQ is designed to make it unlikely.
Note also that the declarative LINQ programming style ensures that developers specify what they want done, rather than how it should be done. This leaves PLINQ free to ensure that concurrent LINQ queries run in the safest manner possible. If LINQ had been defined more strictly, such that it had to process each element in a certain order, then the PLINQ team would have had a much more difficult task.
The code in both these queries pulls out only the records from the XML file that have their date set to April 15, 1973. Because of deferred execution, the query would not do anything if I did not call ToList(). As a result, I added that call and converted the result into a List<WeatherData>. Though hardly earthshaking in import, these calls ensure that the code actually does something, and thus gives PLINQ scope to take advantage of the multiple processers on your system.
Simple timers are created to measure the difference between the standard LINQ query and the PLINQ query. I've also used a method used in many of Parallel LINQ team's samples for displaying the time elapsed during a test run:
private static void RunTest() { XDocument doc = XDocument.Load("XMLFile1.xml"); Stopwatch sw = new Stopwatch(); sw.Start(); LinqOrdinarie(doc); sw.Stop(); ShowElapsedTime("Ordinaire", sw.Elapsed); sw.Reset(); sw.Start(); ParallelLinq(doc); sw.Stop(); ShowElapsedTime("Parallels", sw.Elapsed); }
private static TimeSpan ShowElapsedTime(string caption, TimeSpan ts) { string elapsedTime = String.Format("{0}: {1:00}:{2:00}:{3:00}.{4:00}", caption, ts.Hours, ts.Minutes, ts.Seconds, ts.Milliseconds / 10); Console.WriteLine(elapsedTime, "RunTime"); return ts; }
At least with the pre-release version of PLINQ that I've played with, I've found it very useful to set up timers to confirm that PLINQ is actually able to speed up an operation. My record at guessing which code will benefit from running in parallel is not good, and so I find that confirming the effectiveness of the code by explicitly measuring it is worthwhile. You can either use the simple StopWatch class from the System.Diagnostics namespace, as shown here, or else you can use a profiler. Note that a thread aware profiler might ship with some versions of Visual Studio 2010.
I've found that the advantages of concurrent LINQ become more obvious the longer the operation I'm timing lasts. As a result, I've placed the query inside a loop, and added a variable to the program called NUM_REPS. By setting NUM_REPS to a large number, say 500, you can clearly see the benefits that can be accrued when you run LINQ queries in parallel on multiple processors. Note that the first time PLINQ is used, its assembly will need to be loaded, the relevant types will need to be JIT compiled, and new threads will need to be spun up, etc. As a result, many developers see improved performance after they get past the initial warm-up time.
Though it is very easy to get started with PLINQ, there are still complexities inherent in the subject that you need to consider. For instance, PLINQ will sometimes develop a different partitioning scheme for your data depending on whether you are working with an Enumerable or an Array. To learn more about this subject, see the following post from the Parallel Programming team:
http://blogs.msdn.com/pfxteam/archive/2007/12/02/6558579.aspx
The simple PLINQ examples shown in this post should help you get started with this powerful and interesting technology. Parallel LINQ is still in its infancy, but already it provides means of greatly simplifying tasks that are not normally easy to perform.
You've been kicked (a good thing) - Trackback from DotNetKicks.com
I hope that the C# team will consider building this in behind the scenes. It seems to me that adding the parallel functionality to IEnumerable, ToList etc would not only make upgrades easier but further simplify the transition for developers.
I second Adam's comment - why not make it the default?
If your machine only has one processor, it'll be a linear operation, but on multi-core or multi-cpu, you'll get the benefits without having to write code to specifically target it.
Are there any gotchas that means it couldn't be the default? (other than the ParallelQuery<> return type..)
Hi Charlie,
an interesting article (as always). At the PDC 2008, I had a nice talk with the PM of the PLINQ team (who gave a really great presentation on the technology - of which most had already been made presented before, but still, it was nice to get the full overview and thoughts).
When we look at the typical examples today for parallel LINQ, most use fairly long working sets (including those presented in this post). The common scenarios I face in the ECM industry(enterprise content management), the working sets are comprised of fewer items - however, having each item require more complex processing (thus, still a canditate for parallelization).
I think it's important to communicate that parallelization may be of great use, even though you're facing smaller working sets of, say, 16 or 32 items. I overhead a few conversations at the PDC where other developers were unsure whether PLINQ would make a difference.
So, I guess the best answer is: "measure measure measure" (stopwatch approach or profiling - of which the latter usually yields interesting results because the hotspots are not always where you'd think).
With all the cores available, it would be really nice to see the PLINQ library have a lightning fast start-up, so that smaller and smaller working sets (with low processing requirements) can be run in parallel.
Perhaps a future CLR (version 5.0) would be able to make assumptions like this and parallelize when it makes sense. This would remove the requirement of explicit using the .AsParallel() etc. operators (although I personally find it important that this part of the code is explicit).
Charlie Calvert published a blog entry on the subject. It is a nice, quick read on PLinQ. Check it out:
I hope this in behind
I can't believe that's don't work by default, have no sense.
Any plans of adding stuff from the CCR (http://msdn.microsoft.com/en-us/library/bb905450.aspx) in this library?
This statement is incorrect:
Rather than returning IEnumerable<T>, this version of PLINQ returns IParallelEnumerable<int>:
IParallelEnumerable<T> ^is^ an IEnumerable<T> as it derives from IEnumerable<T>.
Hi Guys,
Regarding;
>“I hope that the C# team will consider building this in behind the scenes. It seems to me that adding the parallel functionality to IEnumerable, ToList etc would not only make upgrades easier but further simplify the transition for developers.”
>“I second Adam's comment - why not make it the default?”
There are so many exciting technologies targeting the parallel computing problem space it is not yet clear what the best approach or solution is going to be. Anders has said as much in a number of interviews/posts.
PLINQ may turn out to be the solutions to all parallel programming problems, or they might just replace the CLR with a PCLR! ;)
Cheers,
James
RE: "turn on by default" comment:
If your query mutates shared state (which is easily done) then PLINQ by default would yield incorrect results. Same goes for ordering expectations (if you have any). So, right now you need to opt-in for good reasons – only you know if the semantics will be preserved while gaining the speedup...
RE: CCR
We are looking at bringing a similar programming model into a future version of .NET, but that will not happen in v4.
Cheers
Daniel
Thanks everyone for the great comments here.
James is correct. Anders H. has indeed said that the team is looking at all of these technologies, and will probably incorporate some of them into C# proper, once there has been a chance to see what solutions work best in which circumstances. That might take time.
Daniel Moth also raises good points about the issues surrounding mutable and immutable code and ordering expectations. It is not always possible to reliably detect these issues at compile time within a reasonable period of time, and that sets some restraints on our (and your) ability to detect which code is a good candidate for parallelism.
The comments by Anders Borum about the importance of measuring your code to see if it is yielding the results you expect are very relevant. The startup time issue, as is often the case with .NET code, is one that needs to be watched. The team, of course, is looking at this issue carefully.
Finally, remember that PLINQ sets the entry bar to writing concurrent code very low. This is just an introductory article to help you get started. As the comments shown here demonstrate, however, this is complex subject that deserves careful study. How does Andre Gide's old saying go: "Please don't understand me too quickly...." Use this article to help you get started, but continue to explore this subject as it develops over the coming months.
Its pretty obvious why this isnt made default, really. For example.
int i = 0;
foreach(int x in list)
x = i++;
this simple code would screw up pretty much since the elements doesnt come in a given order.
I am highly anticipating this. I can see this really improving our ability to handling multi-threaded collections. It is not like it is such a hard task without it but having much leaner code is always a good thing.
#.think.in infoDose #12 (15th Dec - 19th Dec)
It's nice to see some examples of LINQ and PLINQ queries....I mean this was a very interesting reading for me...I mean I knew very little about LINQ..but after reading your post I got a lot clearer picture about what LINQ really is..
Erik Cox
http://www.notionsolutions.com