<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://blogs.msdn.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>CLR 4.0 ThreadPool Improvements: Part 1</title><link>http://blogs.msdn.com/ericeil/archive/2009/04/23/clr-4-0-threadpool-improvements-part-1.aspx</link><description>This is the first in a series of posts about the improvements we are making to the CLR thread pool for CLR 4.0 (which will ship with Visual Studio 2010). This post will cover changes to the queuing infrastructure in the thread pool, which aim to enable</description><dc:language>en-US</dc:language><generator>CommunityServer 2.1 SP1 (Build: 61025.2)</generator><item><title>CLR 4.0 ThreadPool Improvements: Part 1 | ASP NET Hosting</title><link>http://blogs.msdn.com/ericeil/archive/2009/04/23/clr-4-0-threadpool-improvements-part-1.aspx#9565262</link><pubDate>Thu, 23 Apr 2009 21:45:48 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9565262</guid><dc:creator>CLR 4.0 ThreadPool Improvements: Part 1 | ASP NET Hosting</dc:creator><description>&lt;p&gt;PingBack from &lt;a rel="nofollow" target="_new" href="http://asp-net-hosting.simplynetdev.com/clr-40-threadpool-improvements-part-1/"&gt;http://asp-net-hosting.simplynetdev.com/clr-40-threadpool-improvements-part-1/&lt;/a&gt;&lt;/p&gt;
</description></item><item><title>re: CLR 4.0 ThreadPool Improvements: Part 1</title><link>http://blogs.msdn.com/ericeil/archive/2009/04/23/clr-4-0-threadpool-improvements-part-1.aspx#9565383</link><pubDate>Thu, 23 Apr 2009 23:01:51 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9565383</guid><dc:creator>bmeardon</dc:creator><description>&lt;p&gt;Eric,&lt;/p&gt;
&lt;p&gt;This is really fantastic and useful information. &amp;nbsp;I did read your reply on the PFX forum in regards to using Parallel.For instead of repeated calls to QUWI, but I don't think that will work. &amp;nbsp;Let me describe the situation a bit more...&lt;/p&gt;
&lt;p&gt;I'm trying to process a very high number of messages per second that arrive through a socket. &amp;nbsp;As each message is received, I need to invoke a callback (passing it the message). &amp;nbsp;The callback needs to be called in the exact order the messages are recevied. &amp;nbsp;I currently accomplish this by using a FifoExecutor class (Stephen Toub's implementation), which queues the work items up, grabs a thread from the pool (using UQUWI) and sits in a loop dequeing each work item, executing the callback, and then repeats until the queue is empty. &amp;nbsp;This is clearly quite fraught with a lot of enqueing an dequeing. &amp;nbsp;Do you have any suggestions on how I may be able to leverage the new CLR 4.0 ThreadPool and TPL in anyway to improve this scenario's peformance?&lt;/p&gt;
&lt;p&gt;Thanks,&lt;/p&gt;
&lt;p&gt;Brandon&lt;/p&gt;
</description></item><item><title>re: CLR 4.0 ThreadPool Improvements: Part 1</title><link>http://blogs.msdn.com/ericeil/archive/2009/04/23/clr-4-0-threadpool-improvements-part-1.aspx#9565694</link><pubDate>Fri, 24 Apr 2009 02:41:32 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9565694</guid><dc:creator>ericeil</dc:creator><description>&lt;p&gt;Brandon,&lt;/p&gt;
&lt;p&gt;It definitely sounds like you may benefit from the improvements in QUWI. &amp;nbsp;It may well be worth trying the naive QUWI solution, which almost certainly will outperform previous CLR versions.&lt;/p&gt;
&lt;p&gt;Otherwise, given your strict FIFO ordering requirements, there may be little that can be done. &amp;nbsp;Optimization often requires reordering. &amp;nbsp;If you are able to deviate a little from the FIFO goal, you may be able to take advantage of the additional perfomance unlocked by child tasks, or maybe even Parallel.ForEach. &amp;nbsp;I, personally, would try QUWI first, given that this seems to require the least change in your existing code. &amp;nbsp;One of our goals for 4.0 is to provide significant performance improvements even without changes in existing application code.&lt;/p&gt;
&lt;p&gt;I am very interested in hearing about the results. &amp;nbsp;Once 4.0 Beta 1 has shipped, I hope you will have a chance to try it out. &amp;nbsp;Please let me know how this turns out.&lt;/p&gt;
</description></item><item><title>re: CLR 4.0 ThreadPool Improvements: Part 1</title><link>http://blogs.msdn.com/ericeil/archive/2009/04/23/clr-4-0-threadpool-improvements-part-1.aspx#9566781</link><pubDate>Fri, 24 Apr 2009 17:28:37 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9566781</guid><dc:creator>bmeardon</dc:creator><description>&lt;p&gt;Eric,&lt;/p&gt;
&lt;p&gt;I figured that would be the case. &amp;nbsp;Sounds like the queuing operations in the new ThreadPool will be ultra fast with the use of a lock free queue data structure (presumably using Interlocked methods). &amp;nbsp;Its also nice to hear that this lock free queue will be publically available through the TPL's ConcurrentQueue&amp;lt;T&amp;gt; class.&lt;/p&gt;
&lt;p&gt;Given the information you provided about the huge performance increase attainable in the new TPL APIs by removing ordering requirements on the work items, we'll probably look for some ways to take advantage of this by looking to paralellize our work as early as possible.&lt;/p&gt;
&lt;p&gt;One thing that never really got answered from my original post - Is there a way to control how Task objects are created using the new TaskFactory class? &amp;nbsp;If we decide to use the new TPL APIs, it would be nice to be able to return a Task from a pre-allocated pool of our own somehow and get rid of this allocation (I know we'll still have the queue node allocation from the ThreadPool though).&lt;/p&gt;
&lt;p&gt;I'm looking forward to the 4.0 Beta 1 release. &amp;nbsp;I won't waste any time to test out our performance with the new ThreadPool and will be sure to let you know how it goes. &amp;nbsp;Thanks again for your help Eric.&lt;/p&gt;
&lt;p&gt;Regards,&lt;/p&gt;
&lt;p&gt;Brandon&lt;/p&gt;
</description></item><item><title>re: CLR 4.0 ThreadPool Improvements: Part 1</title><link>http://blogs.msdn.com/ericeil/archive/2009/04/23/clr-4-0-threadpool-improvements-part-1.aspx#9566848</link><pubDate>Fri, 24 Apr 2009 18:13:48 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9566848</guid><dc:creator>Jordan</dc:creator><description>&lt;p&gt;Eric - Can you give us a clue as to when VS 2010 Beta 1 will be available?&lt;/p&gt;
</description></item><item><title>re: CLR 4.0 ThreadPool Improvements: Part 1</title><link>http://blogs.msdn.com/ericeil/archive/2009/04/23/clr-4-0-threadpool-improvements-part-1.aspx#9567007</link><pubDate>Fri, 24 Apr 2009 19:48:03 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9567007</guid><dc:creator>ericeil</dc:creator><description>&lt;p&gt;Brandon: Task objects are not currently re-usable. &amp;nbsp;I believe this is partly due to the complications it would add to the programming model, but it also has important performance implications - though not what you might think! &amp;nbsp;References to Task objects must be stored in many different places throughout a Task's lifetime, including the ThreadPool's queues. &amp;nbsp;Much of these data structures are lock-free, as you know. &amp;nbsp;Lock-free manipulation of object references is subject to the &amp;quot;ABA&amp;quot; problem (&lt;a rel="nofollow" target="_new" href="http://en.wikipedia.org/wiki/ABA_problem"&gt;http://en.wikipedia.org/wiki/ABA_problem&lt;/a&gt;). &amp;nbsp;This problem turns out to be a non-problem in a garbage-collected environment, so long as you never reuse objects - but if you try to do that, then you need to deal with this problem. &amp;nbsp;There are techniques for dealing with this, but they inevitably require additional expensive synchronization (and other overhead). &amp;nbsp;So while we do incur some overhead for allocating Task objects, we avoid that synchronization overhead; in the end, it usually works out for the better to not reuse these objects.&lt;/p&gt;
&lt;p&gt;Jordan: I'm sorry, but I don't have any information for you about the Beta 1 release date. &amp;nbsp;I certainly do wish I did.&lt;/p&gt;
</description></item><item><title>re: CLR 4.0 ThreadPool Improvements: Part 1</title><link>http://blogs.msdn.com/ericeil/archive/2009/04/23/clr-4-0-threadpool-improvements-part-1.aspx#9567240</link><pubDate>Fri, 24 Apr 2009 22:57:33 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9567240</guid><dc:creator>bmeardon</dc:creator><description>&lt;p&gt;Eric,&lt;/p&gt;
&lt;p&gt;Thanks again for the insightful information. &amp;nbsp;Its also interesting to know that the ThreadPool actually uses Task objects now as its &amp;quot;work item&amp;quot;.&lt;/p&gt;
&lt;p&gt;Thanks,&lt;/p&gt;
&lt;p&gt;Brandon&lt;/p&gt;
</description></item><item><title>re: CLR 4.0 ThreadPool Improvements: Part 1</title><link>http://blogs.msdn.com/ericeil/archive/2009/04/23/clr-4-0-threadpool-improvements-part-1.aspx#9575270</link><pubDate>Wed, 29 Apr 2009 11:08:57 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9575270</guid><dc:creator>Michael Cederberg</dc:creator><description>&lt;p&gt;When you cover the thread injection algorithm, can you please talk about how it handles scenarious with a high number of tasks scheduled concurrently.&lt;/p&gt;
&lt;p&gt;Our setup is that we have a service that continously does calculations. If I use the .NET 3.5 threadpool to schedule those calculations, then I very quickly end up with 1000 threads &amp;nbsp;(there will continously be around 1000 workitems scheduled). If I try to reduce the number of threads (using ThreadPool.SetMaxThreads) to roughly the number of cores in my system, then I run the risk of deadlocks. To get around this, we currenly use a custom scheduler that schedules tasks on IO completion port threads, but this is not ideal either.&lt;/p&gt;
</description></item><item><title>re: CLR 4.0 ThreadPool Improvements: Part 1</title><link>http://blogs.msdn.com/ericeil/archive/2009/04/23/clr-4-0-threadpool-improvements-part-1.aspx#9576404</link><pubDate>Wed, 29 Apr 2009 23:09:42 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9576404</guid><dc:creator>MDFD</dc:creator><description>&lt;p&gt;Eric, &lt;/p&gt;
&lt;p&gt;I have a few questions please:&lt;/p&gt;
&lt;p&gt;1. What's the limit of Parallel.For loops? I mean, how many iterations can I loop simultaneously? Does it depend on the amount of RAM, CPU?&lt;/p&gt;
&lt;p&gt;2. If my issue is to run 1,000,000 threads in parallel, should I use Parallel.For? or TaskFactory would be better?&lt;/p&gt;
&lt;p&gt;Thanks a lot,&lt;/p&gt;
</description></item><item><title>Interesting Finds: 2009 04.23~04.30</title><link>http://blogs.msdn.com/ericeil/archive/2009/04/23/clr-4-0-threadpool-improvements-part-1.aspx#9577861</link><pubDate>Thu, 30 Apr 2009 05:35:51 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9577861</guid><dc:creator>gOODiDEA.NET</dc:creator><description>&lt;p&gt;.NET C# COM Object for Use In JavaScript / HTML, Including Event Handling Show me the memory: Tool for&lt;/p&gt;
</description></item><item><title>Interesting Finds: 2009 04.23~04.30</title><link>http://blogs.msdn.com/ericeil/archive/2009/04/23/clr-4-0-threadpool-improvements-part-1.aspx#9577870</link><pubDate>Thu, 30 Apr 2009 05:37:51 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9577870</guid><dc:creator>gOODiDEA</dc:creator><description>&lt;p&gt;.NETC#COMObjectforUseInJavaScript/HTML,IncludingEventHandlingShowmethememory...&lt;/p&gt;
</description></item><item><title>CLR 4.0: улучшения в ThreadPool Improvements, часть 1</title><link>http://blogs.msdn.com/ericeil/archive/2009/04/23/clr-4-0-threadpool-improvements-part-1.aspx#9578545</link><pubDate>Thu, 30 Apr 2009 08:20:05 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9578545</guid><dc:creator>progg.ru</dc:creator><description>&lt;p&gt;Thank you for submitting this cool story - Trackback from progg.ru&lt;/p&gt;
</description></item><item><title>re: CLR 4.0 ThreadPool Improvements: Part 1</title><link>http://blogs.msdn.com/ericeil/archive/2009/04/23/clr-4-0-threadpool-improvements-part-1.aspx#9580962</link><pubDate>Thu, 30 Apr 2009 21:21:20 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9580962</guid><dc:creator>ericeil</dc:creator><description>&lt;p&gt;MDFD: &amp;nbsp;Parallel.For is limited by the size of the index variable (i.e., an int can only count to 2^31), but otherwsie is not limited (as far as I know) to some number of iterations. &amp;nbsp;If you have 1,000,000 peices of work to do, it is likely that Parallel.For is what you want. &amp;nbsp;This will not create 1,000,000 threads, but rather will automatically partition the work into manageable chunks which are queued as just a few work items to the thread pool.&lt;/p&gt;
&lt;p&gt;There is also no reason why you cannot queue 1,000,000 Task objects to the thread pool, though it will use considerably more memory, and will have much higher CPU overhead, than Parallel.For - because we will need to allocate, queue, and dequeue each of those million Tasks. &amp;nbsp;But if you have a million completely different tasks (which seems unlikely) then Task is the way to go - Parallel.For is generally only useful if you want to run the same code against many individual peices of data.&lt;/p&gt;
</description></item><item><title>re: CLR 4.0 ThreadPool Improvements: Part 1</title><link>http://blogs.msdn.com/ericeil/archive/2009/04/23/clr-4-0-threadpool-improvements-part-1.aspx#9581428</link><pubDate>Fri, 01 May 2009 01:38:34 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9581428</guid><dc:creator>miridfd</dc:creator><description>&lt;p&gt;Thanks for your reply!&lt;/p&gt;
&lt;p&gt;I ran the same code against only (!) 1000 pieces of data, but the CPU reached 100% of usage, and got stuck. The whole process was blocked, and never terminated. (I've made several tests, even with less loops - e.g. 500, it sometimes got stuck sometimes not)&lt;/p&gt;
&lt;p&gt;I must note that the computation in the code (for each thread) is not complicated at all. &lt;/p&gt;
&lt;p&gt;Computer's configuration:&lt;/p&gt;
&lt;p&gt;OS Vista&lt;/p&gt;
&lt;p&gt;Processor: Intel(R) Pentium(R) D CPU 3.40GHz 3.39GHz&lt;/p&gt;
&lt;p&gt;Ram: &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 4.00 GB&lt;/p&gt;
&lt;p&gt;System type:32-bit OS&lt;/p&gt;
&lt;p&gt;What do you think is the reason? When I ran the code sequentially (by regular &amp;quot;for&amp;quot; statement), everything was all right.&lt;/p&gt;
&lt;p&gt;Thank you very much&lt;/p&gt;
</description></item><item><title>re: CLR 4.0 ThreadPool Improvements: Part 1</title><link>http://blogs.msdn.com/ericeil/archive/2009/04/23/clr-4-0-threadpool-improvements-part-1.aspx#9581489</link><pubDate>Fri, 01 May 2009 02:21:13 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9581489</guid><dc:creator>ericeil</dc:creator><description>&lt;p&gt;miridfd: I'd need to understand better what exactly you're trying to do, before I could say anything about why it's not behaving the way you expect. &amp;nbsp;The best thing would probably be to post a sample that exhibits this behavior on the TPL discussion forum: &amp;nbsp;&lt;a rel="nofollow" target="_new" href="http://social.msdn.microsoft.com/Forums/en-US/parallelextensions"&gt;http://social.msdn.microsoft.com/Forums/en-US/parallelextensions&lt;/a&gt;.&lt;/p&gt;
</description></item><item><title>ThreadPool improvements in CLR v4.0</title><link>http://blogs.msdn.com/ericeil/archive/2009/04/23/clr-4-0-threadpool-improvements-part-1.aspx#9589419</link><pubDate>Tue, 05 May 2009 21:28:22 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9589419</guid><dc:creator>CLR Team Blog</dc:creator><description>&lt;p&gt;Eric Eilebrecht , a developer on our team, has just started a multi-part series on TheadPool improvements&lt;/p&gt;
</description></item><item><title>re: CLR 4.0 ThreadPool Improvements: Part 1</title><link>http://blogs.msdn.com/ericeil/archive/2009/04/23/clr-4-0-threadpool-improvements-part-1.aspx#9604033</link><pubDate>Mon, 11 May 2009 21:53:22 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9604033</guid><dc:creator>bmeardon</dc:creator><description>&lt;p&gt;Eric,&lt;/p&gt;
&lt;p&gt;You mentioned that the new queue used by the ThreadPool is more &amp;quot;GC friendly&amp;quot; by allocating smaller nodes in blocks. &amp;nbsp;I was curious about this, as I've been experimenting a bit with my own lock free stack (since its a bit easier than a queue).&lt;/p&gt;
&lt;p&gt;I've been trying to completely avoid allocating any new nodes on push operations, so I have my stack maitain a pool of pre-allocated nodes. &amp;nbsp;This pool is basically implemented as a seperate stack of nodes with a seperate top reference. &amp;nbsp;Each push operation gets a node from the pool and each pop operation returns a node. &amp;nbsp;The drawback is that I need to perform the same type of CAS operation to get/return nodes from/to the node pool to ensure that another thread hasn't swooped in a changed my top node reference. &amp;nbsp;This results in an algorithm that under purely sequential conditions is almost twice as slow as allocating a new node for each push. &amp;nbsp;Under concurrent situations with high contention on the stack, it is also more likely to encouter situations where other threads have changed either the node top reference or the item top reference. &amp;nbsp;I imagine that these situations are what you were refering regarding performance situations when reusing task objects.&lt;/p&gt;
&lt;p&gt;I'd like to get rid of the node pooling in my implementation to avoid the double CAS looping, but I don't want to just allocate a node object in the managed heap for each call to push. &amp;nbsp;Nodes allocated in this fashion that stay in the stack for any lengthly period of time stand a good chance of getting promoted up GC generations, causing large GC interference. &amp;nbsp;So I'm curious as to what you guys are doing to make your nodes in the ThreadPool queues so GC friendly and if the new TPL ConcurrentQueue/Stack will make use of this GC friendly node allocation mechnism.&lt;/p&gt;
&lt;p&gt;Thanks again for all of your help.&lt;/p&gt;
&lt;p&gt;Regards,&lt;/p&gt;
&lt;p&gt;Brandon&lt;/p&gt;
</description></item><item><title>re: CLR 4.0 ThreadPool Improvements: Part 1</title><link>http://blogs.msdn.com/ericeil/archive/2009/04/23/clr-4-0-threadpool-improvements-part-1.aspx#9607839</link><pubDate>Tue, 12 May 2009 21:02:25 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9607839</guid><dc:creator>ericeil</dc:creator><description>&lt;p&gt;Brandon,&lt;/p&gt;
&lt;p&gt;The main thing we did to improve the GC-friendliness of the new queues was to reduce the size of the nodes. &amp;nbsp;They weren't particularly large to begin with (just a few object references) but that meant that removing just a couple of those references could make a huge difference. &amp;nbsp;Smaller objects means less GC pressure, so the GC doesn't have to run as often. &amp;nbsp;This helps with the problem you mention: the less frequently the GC runs, the fewer objects need to be promoted.&lt;/p&gt;
&lt;p&gt;We also found that arrays of objects seem to be scanned much faster than linked lists. &amp;nbsp;This probably has a lot to do with data locality - a pass through a bunch of contiguous references is going to make much better use of the CPU's cache than hopping all over the heap visiting lots of individual nodes.&lt;/p&gt;
&lt;p&gt;My advice for approaching this sort of optimization problem is to follow a methodology something like this:&lt;/p&gt;
&lt;p&gt;1) Implement something simple. &amp;nbsp;Simple things usually perform better, all else being equal.&lt;/p&gt;
&lt;p&gt;2) Construct some test cases you believe are representative of real-world usage. &amp;nbsp;Overly simplistic tests often lead you to optimize only for those cases, which can sometimes actually lead to worse performance in the field.&lt;/p&gt;
&lt;p&gt;3) Run your tests under a sampling profiler. &amp;nbsp;The VM functions involved in garbage collection are clearly named such that you can pretty easily tell which phase of the GC is taking the most time. &amp;nbsp;I was surprise to learn that in my tests we were spending much more time in the &amp;quot;mark&amp;quot; phase than I would have thought, which lead me to consider locality a little more carefully.&lt;/p&gt;
&lt;p&gt;4) Make some changes, and try again.&lt;/p&gt;
&lt;p&gt;Anyway, that's how I did the thread pool work. &amp;nbsp;Nearly every line of code is influenced in some way by profiler feedback. &amp;nbsp;You do have to be very careful, though, not to optimize for the wrong cases. &amp;nbsp;Good tests cases are key here, but also it's important to not take the profiler output too literally. &amp;nbsp;Rather than sayiing &amp;quot;hey, method Foo takes 50% of the time, I should optimize the heck out of that,&amp;quot; instead use the profiler to update your mental model of the problem and then reconsider the design from that point of view. &amp;nbsp;Often Foo took more time because of a bunch of little things happening elsewhere in the system, and all the hacking on Foo in the world won't help.&lt;/p&gt;
</description></item><item><title>re: CLR 4.0 ThreadPool Improvements: Part 1</title><link>http://blogs.msdn.com/ericeil/archive/2009/04/23/clr-4-0-threadpool-improvements-part-1.aspx#9608262</link><pubDate>Wed, 13 May 2009 00:44:46 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9608262</guid><dc:creator>bmeardon</dc:creator><description>&lt;p&gt;Eric,&lt;/p&gt;
&lt;p&gt;Thanks for the feedback. &amp;nbsp;Since my last post, I tried a new design that seems to exhibit superior performance to my pooling approach. &amp;nbsp;It also seems to be very similar to what you've implemented in the ThreadPool queue. &amp;nbsp;I have basically taken the approach you've outlined and thought I'd present my design here in hopes that you may be able to point out some pitfalls you ran into (since the designs seem so similar).&lt;/p&gt;
&lt;p&gt;1) The first thing I do is allocate an array of references to my my node class objects (which only consist of the item they contain and a ref to the next node). &amp;nbsp;I then create a new node object for each item in the array.&lt;/p&gt;
&lt;p&gt;2) On a push operation, I first &amp;quot;allocate&amp;quot; a node from my array, which I accomplish by simply returning the node at the next index in my array (their's a CAS on the next node index). &amp;nbsp;I then set my top node to the node allocated from my array using another CAS loop.&lt;/p&gt;
&lt;p&gt;3) In the event that I run out of nodes during a push operation (current index is past the length of the array), I allocate a new array and set of nodes.&lt;/p&gt;
&lt;p&gt;4) The pop operation is just a very simple return of the top node and a CAS loop to set it to the next node. &amp;nbsp;The node is not returned to any sort of a pool, but will remain referenced in the node array until the node array is discarded.&lt;/p&gt;
&lt;p&gt;5) When CAS operations fail on any of my operations, I fall back to a spin wait that increases in iterations to a max exponentially.&lt;/p&gt;
&lt;p&gt;I did a lot of performance testing on this version vs using a the BCL Stack&amp;lt;T&amp;gt; class wrapped in lock statments. &amp;nbsp;I set up scenarios on boxes with 2, 4, and 8 CPUs. &amp;nbsp;Basically, I would have N (one half the CPUs on the box) producer threads pushing items in the stack and N (the other half of the CPUs) consumer threads poping items out. &amp;nbsp;I was suprised by just how much my setting of the number of spin iterations per failed CAS impacted the performance. &amp;nbsp;If I had this set too low, the BCL locked Stack&amp;lt;T&amp;gt; implementation would outperform my fancy lock free implementation. &amp;nbsp;In the end, I found that a max spin iteration of around 20K per CPU worked best and gave me roughly 2-3 times the throughput as the BCL locking implementation. &amp;nbsp;My spin wait actually starts at a number of around 4k for a 4 CPU box and works its way up to 80K under very rare circumstances.&lt;/p&gt;
&lt;p&gt;One area I think there may be some opprotunity for improvment on is how I allocate the array of nodes and initialize it with new nodes. &amp;nbsp;For my peformance tests, I allocated enough nodes upfront so that no push operations would need to allocate another block. &amp;nbsp;When running the test with 30 million items, this allocation takes quite a bit of time.&lt;/p&gt;
&lt;p&gt;Anyway, let me now what you think. &amp;nbsp;Thanks for your help.&lt;/p&gt;
&lt;p&gt;Regards,&lt;/p&gt;
&lt;p&gt;Brandon&lt;/p&gt;
</description></item><item><title>.NET 4 Beta 1 is now available, with parallelism!</title><link>http://blogs.msdn.com/ericeil/archive/2009/04/23/clr-4-0-threadpool-improvements-part-1.aspx#9633013</link><pubDate>Thu, 21 May 2009 01:58:34 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9633013</guid><dc:creator>Parallel Programming with .NET</dc:creator><description>&lt;p&gt;We’re very excited that the .NET Framework 4 Beta is now available for public download, as .NET 4 has&lt;/p&gt;
</description></item><item><title>Visual Studio 2010 Beta 1 est disponible pour tous !!!</title><link>http://blogs.msdn.com/ericeil/archive/2009/04/23/clr-4-0-threadpool-improvements-part-1.aspx#9633611</link><pubDate>Thu, 21 May 2009 13:23:27 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9633611</guid><dc:creator>Développement parallèle</dc:creator><description>&lt;p&gt;Visual Studio 2010 Beta 1 est disponible pour les abonn&amp;#233;s MSDN depuis quelques jours mais maintenant&lt;/p&gt;
</description></item><item><title>Erika Parsons and Eric Eilebrecht : CLR 4 - Inside the Thread Pool</title><link>http://blogs.msdn.com/ericeil/archive/2009/04/23/clr-4-0-threadpool-improvements-part-1.aspx#9679172</link><pubDate>Mon, 01 Jun 2009 20:52:50 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9679172</guid><dc:creator>ComponentGear.com Feed</dc:creator><description>&lt;p&gt;General purpose thread pools are more complicated to get right than you may think. In CLR 4 (the next&lt;/p&gt;
</description></item><item><title>ThreadPool on Channel 9</title><link>http://blogs.msdn.com/ericeil/archive/2009/04/23/clr-4-0-threadpool-improvements-part-1.aspx#9679785</link><pubDate>Mon, 01 Jun 2009 21:22:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9679785</guid><dc:creator>Eric Eilebrecht's blog</dc:creator><description>&lt;p&gt;Charles from Channel 9 stopped by my office a couple of weeks ago to chat with Erika Parsons and I about&lt;/p&gt;
</description></item><item><title>CLR 4 – Inside the ThreadPool</title><link>http://blogs.msdn.com/ericeil/archive/2009/04/23/clr-4-0-threadpool-improvements-part-1.aspx#9680364</link><pubDate>Mon, 01 Jun 2009 22:07:27 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9680364</guid><dc:creator>Parallel Programming with .NET</dc:creator><description>&lt;p&gt;As we’ve mentioned previously, the .NET ThreadPool has undergone some serious renovations in .NET 4,&lt;/p&gt;
</description></item><item><title>re: CLR 4.0 ThreadPool Improvements: Part 1</title><link>http://blogs.msdn.com/ericeil/archive/2009/04/23/clr-4-0-threadpool-improvements-part-1.aspx#9852358</link><pubDate>Wed, 29 Jul 2009 20:40:19 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9852358</guid><dc:creator>Gal Ratner</dc:creator><description>&lt;p&gt;This is great stuff! Cant wait till it comes out with VS 2010&lt;/p&gt;
</description></item><item><title>re: CLR 4.0 ThreadPool Improvements: Part 1</title><link>http://blogs.msdn.com/ericeil/archive/2009/04/23/clr-4-0-threadpool-improvements-part-1.aspx#9878399</link><pubDate>Fri, 21 Aug 2009 12:47:05 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9878399</guid><dc:creator>Tal</dc:creator><description>&lt;p&gt;Hi Eric!&lt;/p&gt;
&lt;p&gt;I have start to use the Microsoft Parallel Extensions Jun08 and I got little to comment:&lt;/p&gt;
&lt;p&gt;To add object to the thread pool(or even create Task) the state of the delegete must be Object(on Task it's Action&amp;lt;object&amp;gt;) and not &amp;quot;T&amp;quot;(generete iterator) for custom type.&lt;/p&gt;
&lt;p&gt;The answer seems simple- just &amp;nbsp;and unbox the state. But it cost a lot of performance the boxing and unboxing.&lt;/p&gt;
&lt;p&gt;Why can't anyone make the functions ThreadPool.QueueUserWorkItem&amp;lt;T&amp;gt;(WaitCallback callBack, T state) and Task.Create&amp;lt;T&amp;gt;(Action&amp;lt;T&amp;gt; action, T state)?&lt;/p&gt;
&lt;p&gt;And why you cant create a delegete for them that take no argument? I know that it possible to do it now but you know that it's cost performance too...&lt;/p&gt;
&lt;p&gt;One more thing that also bother me:&lt;/p&gt;
&lt;p&gt;Why it's impossible to create Task and just later start it?&lt;/p&gt;
&lt;p&gt;Except for this I want to say that you done good job and can't wait for CLR 4.0!&lt;/p&gt;
</description></item><item><title>re: CLR 4.0 ThreadPool Improvements: Part 1</title><link>http://blogs.msdn.com/ericeil/archive/2009/04/23/clr-4-0-threadpool-improvements-part-1.aspx#9882555</link><pubDate>Mon, 24 Aug 2009 13:37:49 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9882555</guid><dc:creator>Tal</dc:creator><description>&lt;p&gt;I read more about the thread pool(also the comments here) and you might think about my offer:&lt;/p&gt;
&lt;p&gt;You said to MDFD about the lot of memory for using a lot of tasks for the threadpool(for 1,000,000).&lt;/p&gt;
&lt;p&gt;I know you said him to use Parallel.For instead(and I agree) but think about how much memory is wasted.&lt;/p&gt;
&lt;p&gt;There is the MTA threads and the STA threads and look about the difference for the memory:&lt;/p&gt;
&lt;p&gt;MTA- using a lot of memory(becouse they many...)&lt;/p&gt;
&lt;p&gt;STA- using a little of memory(becouse it's the only one - main thread)&lt;/p&gt;
&lt;p&gt;The thread pool gather a lot of tasks(it's mean a lot of memory) usualy by the STA- main thread but it's doesn't realy matter STA or MTA.&lt;/p&gt;
&lt;p&gt;The point is that you may take think of making overload to the QUWI that block the current thread until there is a space in the thread pool and just than the current thread will go on.&lt;/p&gt;
&lt;p&gt;It's mean that now the threadpool don't be overloded by a tasks that can block the current thread!&lt;/p&gt;
&lt;p&gt;Think how much memory is wasted in the 1,000,000 tasks case!&lt;/p&gt;
&lt;p&gt;And another thing that is bugging me is the Parallel.For is that it's wait untill the all tasks are done.&lt;/p&gt;
&lt;p&gt;Why???&lt;/p&gt;
&lt;p&gt;I think the answer is becouse if I had quad-core processor it wait becouse we want that we have 4 spaces one for each core(lenght/processor iteration for core)&lt;/p&gt;
&lt;p&gt;But if the task we wait for is heavy, it's waste of time!&lt;/p&gt;
&lt;p&gt;I suggest that it will start immediately when there is space for a task. When there is anouther sapce avilable divide the For loop task to 2 tasks in this fashion.&lt;/p&gt;
&lt;p&gt;I example a process for quad-core:&lt;/p&gt;
&lt;p&gt;1.User ask for doing For loop(1,000 iteration).&lt;/p&gt;
&lt;p&gt;2.The program wait until there is a free space for task.&lt;/p&gt;
&lt;p&gt;3.Another task is done so the For loop begine with i=0 and max of 1,000.&lt;/p&gt;
&lt;p&gt;4.Another task is doneand the for loop is currect in 600(unstarted) so the program divide the the remaining 400 in ratio of 1:3 so the max of the current loop is down to 700 and we create task that doing form 700 to 1,000.&lt;/p&gt;
&lt;p&gt;5.Another task is done(no matter if it's the loop task or simple task) &amp;nbsp;so the 700 to 1,000 loop is divide by ratio of 1:2&lt;/p&gt;
&lt;p&gt;6.The same thing ratio of 1:1.&lt;/p&gt;
&lt;p&gt;7.One of loop task is done so we still divided the same.&lt;/p&gt;
&lt;p&gt;How simple is this!&lt;/p&gt;
&lt;p&gt;Doesn't it?&lt;/p&gt;
&lt;p&gt;I hope my point was clear and I didn't make mistakes of my understanding.&lt;/p&gt;
</description></item></channel></rss>