<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://blogs.msdn.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Parallel loop performance</title><link>http://blogs.msdn.com/b/pfxteam/archive/2008/03/12/8179013.aspx</link><description>We've received several questions on the MSDN Forums for Parallel Extensions about the performance of the Parallel class, and specifically of the loop constructs we provided in the CTP. We're very much aware that the performance of Parallel.For/ForEach</description><dc:language>en-US</dc:language><generator>Telligent Evolution Platform Developer Build (Build: 5.6.50428.7875)</generator><item><title>re: Parallel loop performance</title><link>http://blogs.msdn.com/b/pfxteam/archive/2008/03/12/8179013.aspx#9257103</link><pubDate>Tue, 30 Dec 2008 17:14:34 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9257103</guid><dc:creator>Nigel Findlater</dc:creator><description>&lt;p&gt;Hallo, &lt;/p&gt;
&lt;p&gt;I am experimenting with the CPT and tried benchmarking the ParralelFor against a normal For loop&lt;/p&gt;
&lt;p&gt;NonThreaded Total = 2.66666668666615E+25 In 1828.1601 ms&lt;/p&gt;
&lt;p&gt;Threaded Total = &amp;nbsp; &amp;nbsp;2.66666668666614E+25 In 4516.0957 ms&lt;/p&gt;
&lt;p&gt;NonThreaded Total = 2.66666668666615E+25 In 1765.6589 ms&lt;/p&gt;
&lt;p&gt;Threaded Total = &amp;nbsp; &amp;nbsp;2.66666668666614E+25 In 4781.7258 ms&lt;/p&gt;
&lt;p&gt;Can you see what I am doing wrong, or suggest a better way of doing this.&lt;/p&gt;
&lt;p&gt;Thanks...&lt;/p&gt;
&lt;p&gt;Nigel...&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; public void UnThreadedDemo()&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;{&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;double a = 10;&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;double b = 20;&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;double c = 30;&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;double Total = 0;&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;int Itterations = 200000000;&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;DateTime Start = DateTime.Now;&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;for (int x = 0; x &amp;lt; Itterations; x++)&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;{&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Total += a * x * x + b * x + c;&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;}&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;DateTime Finish = DateTime.Now;&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;TimeSpan delta = Finish.Subtract(Start);&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Console.WriteLine(&amp;quot;NonThreaded Total = &amp;quot; + Total + &amp;quot; In &amp;quot; + delta.TotalMilliseconds + &amp;quot; ms&amp;quot;);&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;}&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;static void ParallelFor(&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;int fromInclusive, int toExclusive, Action&amp;lt;int&amp;gt; body)&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;{&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;int index = fromInclusive;&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Task.Create(delegate&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;{&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;int i;&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;while ((i = Interlocked.Increment(&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;ref index) - 1) &amp;lt; toExclusive)&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;{&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;body(i);&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;}&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;},&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;TaskCreationOptions.SelfReplicating).Wait();&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;}&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;object locker = new object();&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;public void ThreadedDemo()&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;{&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;double a = 10;&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;double b = 20;&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;double c = 30;&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;double Total = 0;&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;int Itterations = 200000000;&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;DateTime Start = DateTime.Now;&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;//Parallel.For(0, Itterations, x =&amp;gt; { Total += a * x * x + b * x + c; });&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;ParallelFor(0, Itterations, x =&amp;gt; { &lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;lock(locker)&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;{&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Total += a * x * x + b * x + c; &lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;}&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; }) ;&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;DateTime Finish = DateTime.Now;&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;TimeSpan delta = Finish.Subtract(Start);&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Console.WriteLine(&amp;quot;Threaded Total = &amp;quot; + Total + &amp;quot; In &amp;quot; + delta.TotalMilliseconds + &amp;quot; ms&amp;quot;);&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;}&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp;}&lt;/p&gt;
&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9257103" width="1" height="1"&gt;</description></item><item><title>More on self-replicating tasks</title><link>http://blogs.msdn.com/b/pfxteam/archive/2008/03/12/8179013.aspx#8353392</link><pubDate>Thu, 03 Apr 2008 13:26:47 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8353392</guid><dc:creator>Pedram Rezaei's Ramblings</dc:creator><description>&lt;p&gt;Some more stuff to remember when dealing with self-replicating tasks. (See my earlier post for an introduction&lt;/p&gt;
&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=8353392" width="1" height="1"&gt;</description></item><item><title>Catching up from -30C. Blew, CHESS, .NET Framework Treemap&amp;#8217;s and More &amp;laquo; Tales from a Trading Desk</title><link>http://blogs.msdn.com/b/pfxteam/archive/2008/03/12/8179013.aspx#8336729</link><pubDate>Wed, 26 Mar 2008 01:58:31 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8336729</guid><dc:creator>Catching up from -30C. Blew, CHESS, .NET Framework Treemap’s and More « Tales from a Trading Desk</dc:creator><description>&lt;p&gt;PingBack from &lt;a rel="nofollow" target="_new" href="http://mdavey.wordpress.com/2008/03/25/catching-up-from-30c-blew-chess-net-framework-treemaps-and-more/"&gt;http://mdavey.wordpress.com/2008/03/25/catching-up-from-30c-blew-chess-net-framework-treemaps-and-more/&lt;/a&gt;&lt;/p&gt;
&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=8336729" width="1" height="1"&gt;</description></item><item><title>re: Parallel loop performance</title><link>http://blogs.msdn.com/b/pfxteam/archive/2008/03/12/8179013.aspx#8236957</link><pubDate>Sun, 16 Mar 2008 00:46:25 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8236957</guid><dc:creator>Stephen Toub - MSFT</dc:creator><description>&lt;p&gt;Gordon, absolutely. &amp;nbsp;In addition to interlocked/delegate invocation overhead and load balancing issues and the like, there are certainly other forms of overhead that we need to factor in when deciding on the best general implementation for our loops, and one of these factors is locality. &amp;nbsp;Not just for disk I/O, but also for things like cache and memory. &amp;nbsp;You should definitely see this improve in future releases. &amp;nbsp;Of course, there are always counter examples one can create where locality of index may translate into poor memory/cache locality; we want to optimize for the most common use cases (which is why feedback from folks like yourself at this stage of the game is so important), but still provide a mechanism for folks to get the performance they need in other cases, even if it means writing a bit more code on top of lower-level primitives we provide (like replicating tasks).&lt;/p&gt;
&lt;p&gt;I'm glad you found the sample useful! &amp;nbsp;I know from your forum posts that you were interested in thread-local state as well, which this sample doesn't provide. &amp;nbsp;I'll add that to the list of future blog posts to write ;)&lt;/p&gt;
&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=8236957" width="1" height="1"&gt;</description></item><item><title>re: Parallel loop performance</title><link>http://blogs.msdn.com/b/pfxteam/archive/2008/03/12/8179013.aspx#8236760</link><pubDate>Sat, 15 Mar 2008 23:59:15 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8236760</guid><dc:creator>Gordon Watts</dc:creator><description>&lt;p&gt;Hi,&lt;/p&gt;
&lt;p&gt; &amp;nbsp;One other thing I find myself dealing with in my Parallel.For. The system will perform the best as long as the various threads are working on data that is close together in the index variable. For example, as index is incremented it casues some data to be read in. If there is a lot of skipping around (because the STRIPE size is large), then that will cause unnecessary I/O.&lt;/p&gt;
&lt;p&gt; &amp;nbsp;Another edge case to consider.&lt;/p&gt;
&lt;p&gt; &amp;nbsp;This was also a very nice sample of how to code up my own loops, which I might do to see if I can further tune the behavior.&lt;/p&gt;
&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=8236760" width="1" height="1"&gt;</description></item><item><title>re: Parallel loop performance</title><link>http://blogs.msdn.com/b/pfxteam/archive/2008/03/12/8179013.aspx#8233096</link><pubDate>Sat, 15 Mar 2008 19:32:32 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8233096</guid><dc:creator>Stephen Toub - MSFT</dc:creator><description>&lt;P&gt;Thanks, nobody. &amp;nbsp;&lt;/P&gt;
&lt;P&gt;I think the confusion here is stemming from the fact that none of the examples in this blog post create a delegate for each iteration: only one delegate is instantiated and provided to the various Parallel* methods. &amp;nbsp;So the issue isn't the creation of one delegate per iteration, it's the invocation of one delegate per iteration, which both the Parallel.For method and your suggested ForXXX method incur.&lt;/P&gt;
&lt;P&gt;Regarding load balancing, even with small body sizes there's still a need for load balancing. &amp;nbsp;There are lots of other things happening on your computer while your application using Parallel.For is running; the OS is doing processing, the shell is doing processing, your virus scanner may kick in, network packets are being received and analyzed, and so on (my laptop right now has 85 processes running). &amp;nbsp;All of those require processors, the same processors that are being used to run the work of your parallel loop. &amp;nbsp;If one of those processors is chosen to do evenly slightly more extraneous work than other processors (which is extremely likely), load balancing becomes important, so that one processor doesn't sit there twiddling its thumbs without work to do waiting on the other processors to finish their work associated with the parallel loop.&lt;/P&gt;
&lt;P&gt;In my ForRange example, the user of ForRange doesn't need to create their own range. &amp;nbsp;Rather than the body being passed a single iteration value i, it's passed a range from-&amp;gt;to, and so rather than processing a single iteration, the body just loops from from-&amp;gt;to processing each iteration in that range. &amp;nbsp;If the average range size provided to the delegate is N, that cuts down on the number of delegate invocations to execute iterations by&amp;nbsp;a factor of approximately N&amp;nbsp;(at the expense, as mentioned, of some load balancing opportunities). &amp;nbsp;Internally, the developer for ForRange could choose from a variety of techniques to decide how big the ranges should be. &amp;nbsp;What I've implemented above is a simple "take the next 8 elements" approach, but one could definitely implement more sophisticated logic.&lt;/P&gt;
&lt;P&gt;Does that help?&lt;/P&gt;
&lt;P&gt;Thanks, again, for your suggestions! &amp;nbsp;We appreciate them, so please keep it up :)&lt;/P&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=8233096" width="1" height="1"&gt;</description></item><item><title>re: Parallel loop performance</title><link>http://blogs.msdn.com/b/pfxteam/archive/2008/03/12/8179013.aspx#8198350</link><pubDate>Fri, 14 Mar 2008 11:23:40 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8198350</guid><dc:creator>nobody</dc:creator><description>&lt;p&gt;&amp;gt;Is this a performance suggestion, or is there a use &lt;/p&gt;
&lt;p&gt;&amp;gt;case where you actually need the iteration space &lt;/p&gt;
&lt;p&gt;&amp;gt;divided evenly like this?&lt;/p&gt;
&lt;p&gt;It is a performance suggestion to minimize delegate creation and interlocking for cases where the the loop body does a trivial amount of work.&lt;/p&gt;
&lt;p&gt;&amp;gt; whereas the ForXXX you showed (just like Parallel.For) &lt;/p&gt;
&lt;p&gt;&amp;gt;requires a delegate invocation for each iteration.&lt;/p&gt;
&lt;p&gt;It looks like I wasn't clear. &amp;nbsp;The example was to suggest that the ForXXX would only create four delegates, one for each thread - not each iteration. &amp;nbsp;Since the method body is trivial there really isn't a need for load balancing - though it could be tweaked a bit, maybe NUM_THREAD * N partitions are created where N is very small say 2 or 4 (so NUM_THREAD * N delegates would be created). The idea was to simplify the need for users to define their own ranges - as I understand your ForRange example. &amp;nbsp;The ForXXX variant would calculate the ranges automatically based on the number of threads being used.&lt;/p&gt;
&lt;p&gt;This variant would come in handy when you doing simple &amp;nbsp;operations on elements of a large array. &amp;nbsp;&lt;/p&gt;
&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=8198350" width="1" height="1"&gt;</description></item><item><title>re: Parallel loop performance</title><link>http://blogs.msdn.com/b/pfxteam/archive/2008/03/12/8179013.aspx#8190331</link><pubDate>Fri, 14 Mar 2008 02:43:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8190331</guid><dc:creator>Stephen Toub - MSFT</dc:creator><description>&lt;p&gt;nobody:&lt;/p&gt;
&lt;p&gt;Thanks for the suggestion! &amp;nbsp;Is this a performance suggestion, or is there a use case where you actually need the iteration space divided evenly like this? &amp;nbsp;Note that while the implementation I showed previously is similar to the implementation in the CTP, the current implementations we're working on are different from all of these, optimizing at the same time for issues such as overhead, load balancing, etc. Underlying partitioning aside, your ForXXX example and the ParallelForRange example I showed differ in that the ParallelForRange limits the number of delegate invocations to one per chunk, whereas the ForXXX you showed (just like Parallel.For) requires a delegate invocation for each iteration.&lt;/p&gt;
&lt;p&gt;Frank:&lt;/p&gt;
&lt;p&gt;Also thanks for the suggestion. &amp;nbsp;This is similar to CallbackMayRunLong and WT_EXECUTELONGFUNCTION in the Windows thread pools, and we have been and are considering implementing hints somewhat similar to this for Task and the higher-level abstractions built on top of it.&lt;/p&gt;
&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=8190331" width="1" height="1"&gt;</description></item><item><title>re: Parallel loop performance</title><link>http://blogs.msdn.com/b/pfxteam/archive/2008/03/12/8179013.aspx#8182093</link><pubDate>Thu, 13 Mar 2008 20:07:38 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8182093</guid><dc:creator>Frank Levine</dc:creator><description>&lt;p&gt;I think adding an overloaded version of Parallel.For that would allow the user to give the TPL a hint on how it should divide up the tasks would work too:&lt;/p&gt;
&lt;p&gt;enum PfxHint&lt;/p&gt;
&lt;p&gt;{&lt;/p&gt;
&lt;p&gt; &amp;nbsp;SHORT_RUNTIME,&lt;/p&gt;
&lt;p&gt; &amp;nbsp;LONG_RUNTIME,&lt;/p&gt;
&lt;p&gt; &amp;nbsp;UNKNOWN_RUNTIME&lt;/p&gt;
&lt;p&gt; &amp;nbsp;... others? ...&lt;/p&gt;
&lt;p&gt;}&lt;/p&gt;
&lt;p&gt;then you could do something like:&lt;/p&gt;
&lt;p&gt;int max = 1E9;&lt;/p&gt;
&lt;p&gt;squares[] = new int[max];&lt;/p&gt;
&lt;p&gt;Parallel.For(0, max, PfxHint.SHORT_RUNTIME,&lt;/p&gt;
&lt;p&gt; &amp;nbsp;n =&amp;gt; { squares[n] = n*n; } );&lt;/p&gt;
&lt;p&gt;or, you could do this...&lt;/p&gt;
&lt;p&gt;Func&amp;lt;int, bool&amp;gt; SomeNonUniformFunction = ...;&lt;/p&gt;
&lt;p&gt;bool[] answers = new bool[max];&lt;/p&gt;
&lt;p&gt;Parallel.For(0, max, PfxHint.UNKNOWN_RUNTIME&lt;/p&gt;
&lt;p&gt;n =&amp;gt; &lt;/p&gt;
&lt;p&gt; &amp;nbsp;{&lt;/p&gt;
&lt;p&gt; &amp;nbsp; &amp;nbsp;answers[n] = SomeNonUniformFunction(n);&lt;/p&gt;
&lt;p&gt; &amp;nbsp;}&lt;/p&gt;
&lt;p&gt; );&lt;/p&gt;
&lt;p&gt;Using a construct like this might give the TPL enough information to make a smart decision about how to partition up the work. &amp;nbsp;If you expect short runtimes for each iteration, then TPL can use the Parallel.ForRange() without too much fear of messing up the load balance. &amp;nbsp;If the runtimes are long, or you simply don't know, then you might want to deal w/ the overhead of more, but smaller tasks to keep the load balanced. &amp;nbsp;&lt;/p&gt;
&lt;p&gt;As a developer, I think I would have enough insight into the loop that I'm writing to be able to provide a reasonable hint. &amp;nbsp;If not, then I would call the standard Parallel.For() and let TPL use the default behavior.&lt;/p&gt;
&lt;p&gt;-Frank&lt;/p&gt;
&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=8182093" width="1" height="1"&gt;</description></item><item><title>re: Parallel loop performance</title><link>http://blogs.msdn.com/b/pfxteam/archive/2008/03/12/8179013.aspx#8180527</link><pubDate>Thu, 13 Mar 2008 16:33:40 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8180527</guid><dc:creator>nobody</dc:creator><description>&lt;p&gt;It would be great if TPL had a Parallel.For variant that just broke up for loop into NUM_THREAD blocks and then each thread worked on one of the blocks. &amp;nbsp;This would work well when the loop body did a trivial amount of work.&lt;/p&gt;
&lt;p&gt;For example (poor example I know):&lt;/p&gt;
&lt;p&gt;int n = 1e10 + 1;&lt;/p&gt;
&lt;p&gt;...&lt;/p&gt;
&lt;p&gt;Parallel.ForXXX(0, n, i =&amp;gt;&lt;/p&gt;
&lt;p&gt;{&lt;/p&gt;
&lt;p&gt; &amp;nbsp; y[i] = x[i]*someScalar + y[i];&lt;/p&gt;
&lt;p&gt;});&lt;/p&gt;
&lt;p&gt;Assuming a quad processor with NUM_THREAD set to 4, threads 1-3 process 2.5e9 element blocks and and thread four gets 2.5e9+1 elements. &amp;nbsp;This would be similar to your last example, but would require no thought on part of the TPL user.&lt;/p&gt;
&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=8180527" width="1" height="1"&gt;</description></item></channel></rss>