<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://blogs.msdn.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Developer Division Performance Engineering blog</title><link>http://blogs.msdn.com/b/ddperf/</link><description>News and commentary on developing scalable Windows applications (with Visual Studio)</description><dc:language>en-US</dc:language><generator>Telligent Community 5.6.583.14036 (Build: 5.6.583.14036)</generator><item><title>Help Make Visual Studio Faster</title><link>http://blogs.msdn.com/b/ddperf/archive/2011/05/05/help-make-visual-studio-faster.aspx</link><pubDate>Fri, 06 May 2011 06:45:55 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:10161686</guid><dc:creator>David Berg</dc:creator><slash:comments>0</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/ddperf/rsscomments.aspx?WeblogPostID=10161686</wfw:commentRss><comments>http://blogs.msdn.com/b/ddperf/archive/2011/05/05/help-make-visual-studio-faster.aspx#comments</comments><description>&lt;p&gt;One of the most difficult things about our job is trying to decipher why Visual Studio is slow for a customer.&amp;nbsp; Often it starts with a vague complaint (e.g. "Visual Studio is sluggish") which we then have to narrow down to a particular action that's slow, and try and get a profile.&amp;nbsp; Then we have to look through a large profile and figure out which code is running slower than it should be and why.&amp;nbsp; Sometimes the problem is CPU intensive, sometimes it's disk or network, sometimes it's a different program altogether that just happens to be slow.&amp;nbsp; I know the process is just as frustrating for customers, who have to try and figure out just where it is slow and get us a profile.&amp;nbsp; And of course, if VS is just a little bit slow, then it's often much easier to live with it than go through the hassle of trying to help us isolate it.&lt;/p&gt;
&lt;p&gt;That's why I'm pleased to announce that we now have a better approach.&amp;nbsp; &lt;a href="http://blogs.msdn.com/b/visualstudio/archive/2011/05/02/perfwatson.aspx" title="VS Blog - Perf Watson"&gt;Visual Studio Perf Watson&lt;/a&gt;&amp;nbsp;is now available for download and use with VS2010 SP1.&amp;nbsp; If you download and install Perf Watson it does everything for you (and us).&amp;nbsp; Here's how it works:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Perf Watson monitors VS to make sure it is responsive.&lt;/li&gt;
&lt;li&gt;If VS goes unresponsive for more than 2 seconds, Perf Watson grabs a stack frame.&lt;/li&gt;
&lt;li&gt;Perf Watson then times how long it takes for&amp;nbsp;VS to become responsive again.&lt;/li&gt;
&lt;li&gt;Perf Watson then takes the stack information along with the total time of the hang and sends it to Microsoft.&lt;/li&gt;
&lt;li&gt;We take the data and load it into a database.&lt;/li&gt;
&lt;li&gt;We analyze the data to see which call stacks are causing the longest hangs that impact the most customers, and then we log bugs on those.&lt;/li&gt;
&lt;li&gt;Since we have call stacks, the bugs tend to be very actionable.&lt;/li&gt;
&lt;li&gt;Since the data comes from real customers, with hit counts and severity information, we have no trouble prioritizing these performance hangs appropriately - we know exactly what they're costing you.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;We've been using Perf Watson internally for a while now, and we've been able to identify and fix a lot of problems just based on our internal product usage.&amp;nbsp; We've also seen some nice correlation between issues raised by Perf Watson and issues found through other analysis.&amp;nbsp; This correlation helps us know that we're on the right track and gives extra weight to getting these issues resolved.&lt;/p&gt;
&lt;p&gt;But our usage patterns aren't the same as yours.&amp;nbsp; Everyone uses VS a little bit differently.&amp;nbsp; That's why we're excited to put Perf Watson in your hands so that we can get an accurate picture of the performance issues you're running into, with strong metrics and data to back up the work we need to do.&lt;/p&gt;
&lt;p&gt;I hope you'll decide to download and install Perf Watson.&amp;nbsp; &lt;/p&gt;
&lt;p&gt;Regards,&lt;/p&gt;
&lt;p&gt;David Berg&lt;br /&gt;DDPE&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=10161686" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Performance/">Performance</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Visual+Studio+2010/">Visual Studio 2010</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Perf+Watson/">Perf Watson</category></item><item><title>Performance Troubleshooting Article and VS2010 SP1 Change</title><link>http://blogs.msdn.com/b/ddperf/archive/2011/03/01/visual-studio-troubleshooting.aspx</link><pubDate>Wed, 02 Mar 2011 00:48:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:10135762</guid><dc:creator>David Berg</dc:creator><slash:comments>0</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/ddperf/rsscomments.aspx?WeblogPostID=10135762</wfw:commentRss><comments>http://blogs.msdn.com/b/ddperf/archive/2011/03/01/visual-studio-troubleshooting.aspx#comments</comments><description>&lt;p&gt;Jason Zander just posted an article on &lt;a href="http://blogs.msdn.com/b/jasonz/archive/2011/03/03/performance-troubleshooting-article-and-vs2010-sp1-change.aspx"&gt;Performance Troubleshooting Article and VS2010 SP1 Change&lt;/a&gt;, where he talks about some changes we made in SP1 and links to an article on &lt;a href="http://msdn.microsoft.com/en-us/vstudio/ff716700"&gt;Visual Studio (Performance) Troubleshooting&lt;/a&gt;.&amp;nbsp; Check it out and let us know if it helps, and what other type of information would be useful.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=10135762" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Performance/">Performance</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Visual+Studio/">Visual Studio</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Visual+Studio+2010/">Visual Studio 2010</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Virtual+Memory/">Virtual Memory</category></item><item><title>Visual Studio 2010 Survey</title><link>http://blogs.msdn.com/b/ddperf/archive/2010/10/10/visual-studio-2010-survey.aspx</link><pubDate>Sun, 10 Oct 2010 14:18:38 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:10073795</guid><dc:creator>David Berg</dc:creator><slash:comments>6</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/ddperf/rsscomments.aspx?WeblogPostID=10073795</wfw:commentRss><comments>http://blogs.msdn.com/b/ddperf/archive/2010/10/10/visual-studio-2010-survey.aspx#comments</comments><description>&lt;p&gt;We'd like to know what you think about Visual Studio 2010.&amp;nbsp; &lt;/p&gt;
&lt;p&gt;We are especially interested in hearing about your experience with regards to &lt;strong&gt;performance&lt;/strong&gt;, &lt;strong&gt;reliability&lt;/strong&gt;,&lt;strong&gt; &lt;/strong&gt;and&lt;strong&gt; quality&lt;/strong&gt;.&amp;nbsp; The more details you share with us in this survey, the better we can understand your&amp;nbsp;experience and apply what we learn into future versions.&lt;/p&gt;
&lt;p&gt;The survey is very short and should take you no more than a few minutes. You can get&amp;nbsp;started by clicking on&amp;nbsp;&lt;a href="http://go.microsoft.com/fwlink/?LinkId=203459" title="Visual Studio 2010 Survey"&gt;&lt;strong&gt;&lt;span style="color: #366df4;"&gt;Visual Studio 2010 Survey&lt;/span&gt;&lt;/strong&gt;&lt;/a&gt;.&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;It's important that we here from you whether you love us or hate us.&amp;nbsp; The more people who respond, the better we can understand how we're doing and whether or not we're doing the right things to make a difference.&lt;/p&gt;
&lt;p&gt;Regards and thanks for your time,&lt;/p&gt;
&lt;p&gt;David Berg&lt;br /&gt;Developer Division Performance Engineering Team&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=10073795" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Performance/">Performance</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Visual+Studio/">Visual Studio</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Visual+Studio+2010/">Visual Studio 2010</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Quality/">Quality</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Survey/">Survey</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Reliability/">Reliability</category></item><item><title>VS2010 Performance and Bad Video Drivers/Hardware - Redux</title><link>http://blogs.msdn.com/b/ddperf/archive/2010/09/16/vs2010-performance-and-bad-video-drivers-hardware-redux.aspx</link><pubDate>Fri, 17 Sep 2010 06:19:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:10063790</guid><dc:creator>David Berg</dc:creator><slash:comments>2</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/ddperf/rsscomments.aspx?WeblogPostID=10063790</wfw:commentRss><comments>http://blogs.msdn.com/b/ddperf/archive/2010/09/16/vs2010-performance-and-bad-video-drivers-hardware-redux.aspx#comments</comments><description>&lt;p&gt;&lt;span style="font-size: x-small;"&gt;Since we shipped Visual Studio 2010 we've continued to have a small but notable series of complaints about performance that we've been able to attribute to bugs in video drivers and GPUs.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-size: x-small;"&gt;The issue first came up back during &lt;/span&gt;&lt;a href="http://blogs.msdn.com/b/ddperf/archive/2009/10/29/vs2010-performance-and-bad-video-drivers-hardware.aspx"&gt;&lt;span style="font-size: x-small;"&gt;VS 2010 beta in October of 2009&lt;/span&gt;&lt;/a&gt;&lt;span style="font-size: x-small;"&gt;.&amp;nbsp; Since then we've learned that while old, buggy drivers are the usual cause, some newer drivers and GPUs aren't as good at supporting VS's UI as we'd like.&amp;nbsp; (This is also an issue with VMs and VM hosts as Video Virtualization technology isn't very good and requires CPU level TLB virtualization support for decent performance, which most CPUs don't have.)&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-size: x-small;"&gt;Fortunately, the software rendering inside WPF is pretty good, so the easy fix here is to force WPF to ignore the GPU and use software rendering (I've tested this on my own system, and I found that WPF's software rendering was actually &lt;span style="text-decoration: underline;"&gt;slightly&lt;/span&gt; faster than GPU based rendering on my high end CPU with a mid-range graphics card - your mileage may vary).&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-size: x-small;"&gt;But first, if you're seeing slow / broken screen updates you should verify you have the latest display drivers for your system.&amp;nbsp; (See "&lt;/span&gt;&lt;a href="http://support.microsoft.com/kb/963021"&gt;&lt;span style="font-size: x-small;"&gt;Guidelines for troubleshooting graphics issues in WPF applications&lt;/span&gt;&lt;/a&gt;&lt;span style="font-size: x-small;"&gt;" for more information.)&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-size: x-small;"&gt;If that doesn't fix it, then there are three ways to force WPF to use software rendering.&amp;nbsp; &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-size: x-small;"&gt;First and preferred, the final RTM version of VS2010 includes a &lt;a href="http://support.microsoft.com/kb/2023207"&gt;UI for forcing hardware rendering off&lt;/a&gt; - for just VS.&amp;nbsp; With VS2010 open, go to Tools | Options, then select Environment | General (as shown below).&amp;nbsp; Then uncheck "Automatically adjust visual experience..." and "Use hardware graphics acceleration..."&amp;nbsp; &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-size: x-small;"&gt;&amp;nbsp;&lt;img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/CommunityServer-Blogs-Components-WeblogFiles/00-00-01-00-79/0513.Turn-off-HW-acceleration.png" border="0" /&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-size: x-small;"&gt;That should be sufficient.&amp;nbsp; &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-size: x-small;"&gt;However, if you want to &lt;/span&gt;&lt;span style="font-size: x-small;"&gt;force software rendering mode for ALL WPF applications (not just VS), you have a second option.&amp;nbsp; Change (or add) one registry key:&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-size: x-small;"&gt;&amp;nbsp;[HKEY_CURRENT_USER\Software\Microsoft\Avalon.Graphics]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-size: x-small;"&gt;"DisableHWAcceleration"=dword:00000001&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-size: x-small;"&gt;Note that this key probably won't exist, and you'll probably need to create it.&amp;nbsp; To turn hardware acceleration back on, just change the "1" to a "0".&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-size: x-small;"&gt;A third alternative is to adjust the hardware acceleration options from the display control panel.&amp;nbsp; However, we don't recommend this option as it impacts the entire machine, the details vary by manufacturer, and the exact impact of all the different options is untested.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-size: x-small;"&gt;If you try any of these - let us know how it works for you (&lt;/span&gt;&lt;a href="mailto:DevPerf@Microsoft.com"&gt;&lt;span style="font-size: x-small;"&gt;DevPerf@Microsoft.com&lt;/span&gt;&lt;/a&gt;&lt;span style="font-size: x-small;"&gt;).&amp;nbsp; If it does improve performance, be sure to let us know how much and attach a DXDIAG output so we'll know which video driver / hardware configurations aren't working well.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-size: x-small;"&gt;Regards,&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-size: x-small;"&gt;David Berg&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-size: x-small;"&gt;Developer Division Performance Engineering Team&lt;/span&gt;&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=10063790" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Performance/">Performance</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Visual+Studio+2010/">Visual Studio 2010</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/WPF/">WPF</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Video/">Video</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Display+Drivers/">Display Drivers</category></item><item><title>Visual Studio 2010 runs faster when the Windows Automation API 3.0 is installed</title><link>http://blogs.msdn.com/b/ddperf/archive/2010/08/16/visual-studio-2010-runs-faster-when-the-windows-automation-api-3-0-is-installed.aspx</link><pubDate>Mon, 16 Aug 2010 13:08:34 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:10050496</guid><dc:creator>David Berg</dc:creator><slash:comments>11</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/ddperf/rsscomments.aspx?WeblogPostID=10050496</wfw:commentRss><comments>http://blogs.msdn.com/b/ddperf/archive/2010/08/16/visual-studio-2010-runs-faster-when-the-windows-automation-api-3-0-is-installed.aspx#comments</comments><description>&lt;p&gt;If you're running Visual Studio 2010 on XP or Vista you may benefit from installing this upgrade: &lt;a href="http://support.microsoft.com/kb/981741"&gt;http://support.microsoft.com/kb/981741&lt;/a&gt;&lt;/p&gt;
&lt;p style="padding-left: 30px;"&gt;Applications that use Windows Automation APIs can significantly decrease Microsoft Visual Studio IntelliSense performance if Windows Automation API 3.0 is not installed. For example, the Windows pen and touch services can significantly decrease Visual Studio IntelliSense performance if Windows Automation API 3.0 is not installed. This article describes how to install the Windows Automation API 3.0 update. This update is available as a stand-alone download for 32-bit editions of Windows XP and for Windows Server 2003. This update is not available for 64-bit editions of Windows XP. The Windows Automation API is a component of the platform update for Windows Vista and of the platform update for Windows Server 2008. &lt;/p&gt;
&lt;p&gt;Note that if you're running Windows XP SP2 you'll be told the patch isn't applicable, that's because you need to upgrade to XP SP3 first.&amp;nbsp; If you're running Vista or Windows Server 2008 you may already have the upgrade, since it's part of the automatic updates.&amp;nbsp; If not you can download the upgrade at the link above.&amp;nbsp; Windows 7 and Windows Server 2008 R2 ships with with Windows Automation API 3.0, so no upgrade is required.&lt;/p&gt;
&lt;p&gt;The problem is that earlier versions of the Windows Automation API try to read the entire contents of list boxes that we post to the screen, which blocks us from virualizing them.&amp;nbsp; This is particularly a problem with intellisense as the number of items in an intellisense list is HUGE and it pops up on every character you type.&lt;/p&gt;
&lt;p&gt;The API is typically activated when using accessibility devices (such as screen readers), pen / tablet computers,&amp;nbsp;or touch devices, but some software activates it anyway (including iPhone synchronization).&amp;nbsp; Once activated it affects the entire system.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=10050496" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Performance/">Performance</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Visual+Studio+2010/">Visual Studio 2010</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/WPF/">WPF</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Intellisense/">Intellisense</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Windows+Automation+API/">Windows Automation API</category></item><item><title>Are you a candidate to run Visual Studio 2010 on a 64-bit OS?</title><link>http://blogs.msdn.com/b/ddperf/archive/2010/04/29/your-visual-studio-2010-dream-machine.aspx</link><pubDate>Thu, 29 Apr 2010 07:56:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:10004608</guid><dc:creator>David Berg</dc:creator><slash:comments>0</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/ddperf/rsscomments.aspx?WeblogPostID=10004608</wfw:commentRss><comments>http://blogs.msdn.com/b/ddperf/archive/2010/04/29/your-visual-studio-2010-dream-machine.aspx#comments</comments><description>&lt;P&gt;Brian Harry's just posted an article on configuring an ideal Visual Studio development machine.&amp;nbsp; You can read about it here: &lt;A href="http://blogs.msdn.com/bharry/archive/2010/04/29/your-visual-studio-2010-dream-machine.aspx" mce_href="http://blogs.msdn.com/bharry/archive/2010/04/29/your-visual-studio-2010-dream-machine.aspx"&gt;http://blogs.msdn.com/bharry/archive/2010/04/29/your-visual-studio-2010-dream-machine.aspx&lt;/A&gt;. By the way, if you scroll down and peruse the comments that customers have posted there, you will see recommendations for several other configuration options. Some of these go well beyond the simpler &amp;amp; less expensive ones Brian discussed. If you rely on Visual Studio in your daily work, you may want to give serious consideration to some of these advanced configuration options.&lt;/P&gt;
&lt;P&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;One of topics Brian addresses in his blog was the benefit of running Visual Studio on a 64-bit OS. We’d like to drill into that topic a little deeper in this post.&lt;?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Visual Studio’s devenv.exe process still runs as a 32-bit process. Under a 32-bit OS, a 32-bit process can only address up to 2 GB of private virtual memory. (For the sake of simplicity, we are going to ignore the &lt;B style="mso-bidi-font-weight: normal"&gt;/3 GB&lt;/B&gt; boot option for 32-bit Windows machines.) The remaining 2 GB of the 4 GB virtual address space is reserved for system addresses. This 2 GB max is an architectural limit on the size of a 32-bit process. All the code and data that gets loaded has to fit in this 2 GB virtual memory space. This may seem a little strange, but as developers working on Visual Studio, but we consider your code – and forms, XAML, DBML, DGML, PDBs for debugging, TFS Work items, and whatever else your Solution loads – as the data Visual Studio has to load &amp;amp; process. The problem arises when Visual Studio needs to load some combination of our code and your data that exceeds this 2 GB architectural limit.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Actually, memory fragmentation issues prevent a 32-bit process like devenv.exe from ever reaching its 2 GB architectural limit. Due to fragmentation, when a 32-bit process address space starts to grow into the 1.7 – 1.8 GB range, the risk that a virtual memory allocation request will fail starts to increase sharply. When a virtual memory allocation request fails, Visual Studio encounters an End of Memory error and crashes. &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Obviously, this is a scenario you want to avoid. The cleanest way to get around the problem is to run Visual Studio on a 64-bit OS. On a 64-bit version of Windows, the private area of a 32-bit process expands to encompass the full 4 GB virtual memory addressing range. Due to fragmentation, you can’t quite get to the 4 GB upper limit either, but the effect of the change is to allow devenv.exe to address twice as much private virtual memory. Please don’t take this as a challenge, but&amp;nbsp;we have yet to see a customer scenario that exhausts the full 4 GB address range. &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;So, if you think you could be a good candidate for running Visual Studio 2010 under a 64-bit OS, here is what to look for.&amp;nbsp;We will run through a scenario we ran recently on the final RTM version of Visual Studio 2010 Ultimate. The test project is a smallish Web solution with a handful of simple ASP.NET Pages that use LINQ to query the MS SQL Server AdventureWorks demo database. This is a version of an app&amp;nbsp;one of us&amp;nbsp;built last year originally for stress testing some Visual Studio components he was working on. We would characterize this app as “borderline realistic.” For example, the AdventureWorks database has fifty or so inter-related Tables, and its SaleOrderDetail Table contains in excess of 120,000 rows. So it is not a trivial “Hello World” type of app, but the web forms themselves are pretty basic, lacking many of the UI components that you are likely to put into a real-world e-commerce application. &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Table 1 shows the main steps in my test scenario and the amount of virtual memory Visual Studio consumed at the end of end of each step. The tool&amp;nbsp;we recommend for measuring virtual memory usage at the process level is &lt;/FONT&gt;&lt;A href="http://technet.microsoft.com/en-us/sysinternals/dd535533.aspx"&gt;&lt;FONT color=#0000ff size=3 face=Calibri&gt;VMMap&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;, one of the free Sysinternals utilities. With VMMap, I can see the overall virtual memory usage of the devenv.exe process, broken down according to different types of allocations, which is something&amp;nbsp;we will drill into in a moment.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Here is a summary of the steps of the scenario&amp;nbsp;we ran and the amount of virtual memory allocated by the devenv.exe process at the end of each step.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P mce_keep="true"&gt;
&lt;TABLE style="BORDER-COLLAPSE: collapse; mso-yfti-tbllook: 1184; mso-padding-alt: 0in 0in 0in 0in" class=MsoNormalTable border=0 cellSpacing=0 cellPadding=0 class="MsoNormalTable"&gt;
&lt;TBODY&gt;
&lt;TR style="mso-yfti-irow: 0; mso-yfti-firstrow: yes"&gt;
&lt;TD style="BORDER-BOTTOM: black 1pt solid; BORDER-LEFT: black 1pt solid; PADDING-BOTTOM: 0in; BACKGROUND-COLOR: transparent; PADDING-LEFT: 5.4pt; WIDTH: 45.9pt; PADDING-RIGHT: 5.4pt; BORDER-TOP: black 1pt solid; BORDER-RIGHT: black 1pt solid; PADDING-TOP: 0in" vAlign=top width=61&gt;
&lt;P style="TEXT-ALIGN: center; MARGIN: 0in 0in 0pt" class=MsoNormal align=center&gt;&lt;SPAN style="FONT-SIZE: 12pt"&gt;&lt;FONT face=Calibri&gt;Step&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD style="BORDER-BOTTOM: black 1pt solid; BORDER-LEFT: #f0f0f0; PADDING-BOTTOM: 0in; BACKGROUND-COLOR: transparent; PADDING-LEFT: 5.4pt; WIDTH: 369pt; PADDING-RIGHT: 5.4pt; BORDER-TOP: black 1pt solid; BORDER-RIGHT: black 1pt solid; PADDING-TOP: 0in" vAlign=top width=492&gt;
&lt;P style="TEXT-ALIGN: center; MARGIN: 0in 0in 0pt" class=MsoNormal align=center&gt;&lt;SPAN style="FONT-SIZE: 12pt"&gt;&lt;FONT face=Calibri&gt;Scenario&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD style="BORDER-BOTTOM: black 1pt solid; BORDER-LEFT: #f0f0f0; PADDING-BOTTOM: 0in; BACKGROUND-COLOR: transparent; PADDING-LEFT: 5.4pt; WIDTH: 63.9pt; PADDING-RIGHT: 5.4pt; BORDER-TOP: black 1pt solid; BORDER-RIGHT: black 1pt solid; PADDING-TOP: 0in" vAlign=top width=85&gt;
&lt;P style="TEXT-ALIGN: center; MARGIN: 0in 0in 0pt" class=MsoNormal align=center&gt;&lt;SPAN style="FONT-SIZE: 12pt"&gt;&lt;FONT face=Calibri&gt;VM Usage&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;
&lt;TR style="mso-yfti-irow: 1"&gt;
&lt;TD style="BORDER-BOTTOM: black 1pt solid; BORDER-LEFT: black 1pt solid; PADDING-BOTTOM: 0in; BACKGROUND-COLOR: transparent; PADDING-LEFT: 5.4pt; WIDTH: 45.9pt; PADDING-RIGHT: 5.4pt; BORDER-TOP: #f0f0f0; BORDER-RIGHT: black 1pt solid; PADDING-TOP: 0in" vAlign=top width=61&gt;
&lt;P style="TEXT-ALIGN: center; MARGIN: 0in 0in 0pt" class=MsoNormal align=center&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;1&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD style="BORDER-BOTTOM: black 1pt solid; BORDER-LEFT: #f0f0f0; PADDING-BOTTOM: 0in; BACKGROUND-COLOR: transparent; PADDING-LEFT: 5.4pt; WIDTH: 369pt; PADDING-RIGHT: 5.4pt; BORDER-TOP: #f0f0f0; BORDER-RIGHT: black 1pt solid; PADDING-TOP: 0in" vAlign=top width=492&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;Open VS with empty solution&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD style="BORDER-BOTTOM: black 1pt solid; BORDER-LEFT: #f0f0f0; PADDING-BOTTOM: 0in; BACKGROUND-COLOR: transparent; PADDING-LEFT: 5.4pt; WIDTH: 63.9pt; PADDING-RIGHT: 5.4pt; BORDER-TOP: #f0f0f0; BORDER-RIGHT: black 1pt solid; PADDING-TOP: 0in" vAlign=top width=85&gt;
&lt;P style="TEXT-ALIGN: center; MARGIN: 0in 0in 0pt" class=MsoNormal align=center&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;300&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;
&lt;TR style="mso-yfti-irow: 2"&gt;
&lt;TD style="BORDER-BOTTOM: black 1pt solid; BORDER-LEFT: black 1pt solid; PADDING-BOTTOM: 0in; BACKGROUND-COLOR: transparent; PADDING-LEFT: 5.4pt; WIDTH: 45.9pt; PADDING-RIGHT: 5.4pt; BORDER-TOP: #f0f0f0; BORDER-RIGHT: black 1pt solid; PADDING-TOP: 0in" vAlign=top width=61&gt;
&lt;P style="TEXT-ALIGN: center; MARGIN: 0in 0in 0pt" class=MsoNormal align=center&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;2&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD style="BORDER-BOTTOM: black 1pt solid; BORDER-LEFT: #f0f0f0; PADDING-BOTTOM: 0in; BACKGROUND-COLOR: transparent; PADDING-LEFT: 5.4pt; WIDTH: 369pt; PADDING-RIGHT: 5.4pt; BORDER-TOP: #f0f0f0; BORDER-RIGHT: black 1pt solid; PADDING-TOP: 0in" vAlign=top width=492&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;Open VS with web solution:&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; TEXT-INDENT: -0.25in; MARGIN: 0in 0in 0pt 0.5in; mso-add-space: auto; mso-list: l1 level1 lfo1" class=MsoListParagraphCxSpFirst&gt;&lt;SPAN style="FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol"&gt;&lt;SPAN style="mso-list: Ignore"&gt;&lt;FONT size=3&gt;·&lt;/FONT&gt;&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;Solution Explorer, Team Explorer, and Server Explorer&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; TEXT-INDENT: -0.25in; MARGIN: 0in 0in 0pt 0.5in; mso-add-space: auto; mso-list: l1 level1 lfo1" class=MsoListParagraphCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol"&gt;&lt;SPAN style="mso-list: Ignore"&gt;&lt;FONT size=3&gt;·&lt;/FONT&gt;&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;&amp;nbsp;Connect to TFS&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; TEXT-INDENT: -0.25in; MARGIN: 0in 0in 0pt 0.5in; mso-add-space: auto; mso-list: l1 level1 lfo1" class=MsoListParagraphCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol"&gt;&lt;SPAN style="mso-list: Ignore"&gt;&lt;FONT size=3&gt;·&lt;/FONT&gt;&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;Open one .cs file Opened in the VS Editor&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; TEXT-INDENT: -0.25in; MARGIN: 0in 0in 0pt 0.5in; mso-add-space: auto; mso-list: l1 level1 lfo1" class=MsoListParagraphCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol"&gt;&lt;SPAN style="mso-list: Ignore"&gt;&lt;FONT size=3&gt;·&lt;/FONT&gt;&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;Open one simple web form opened in Split mode&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; TEXT-INDENT: -0.25in; MARGIN: 0in 0in 0pt 0.5in; mso-add-space: auto; mso-list: l1 level1 lfo1" class=MsoListParagraphCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol"&gt;&lt;SPAN style="mso-list: Ignore"&gt;&lt;FONT size=3&gt;·&lt;/FONT&gt;&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;Open one (.dbml) Data Designer &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; TEXT-INDENT: -0.25in; MARGIN: 0in 0in 0pt 0.5in; mso-add-space: auto; mso-list: l1 level1 lfo1" class=MsoListParagraphCxSpLast&gt;&lt;SPAN style="FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol"&gt;&lt;SPAN style="mso-list: Ignore"&gt;&lt;FONT size=3&gt;·&lt;/FONT&gt;&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;Open one TFS Work Item open&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD style="BORDER-BOTTOM: black 1pt solid; BORDER-LEFT: #f0f0f0; PADDING-BOTTOM: 0in; BACKGROUND-COLOR: transparent; PADDING-LEFT: 5.4pt; WIDTH: 63.9pt; PADDING-RIGHT: 5.4pt; BORDER-TOP: #f0f0f0; BORDER-RIGHT: black 1pt solid; PADDING-TOP: 0in" vAlign=top width=85&gt;
&lt;P style="TEXT-ALIGN: center; MARGIN: 0in 0in 0pt" class=MsoNormal align=center&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;910&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;
&lt;TR style="mso-yfti-irow: 3"&gt;
&lt;TD style="BORDER-BOTTOM: black 1pt solid; BORDER-LEFT: black 1pt solid; PADDING-BOTTOM: 0in; BACKGROUND-COLOR: transparent; PADDING-LEFT: 5.4pt; WIDTH: 45.9pt; PADDING-RIGHT: 5.4pt; BORDER-TOP: #f0f0f0; BORDER-RIGHT: black 1pt solid; PADDING-TOP: 0in" vAlign=top width=61&gt;
&lt;P style="TEXT-ALIGN: center; MARGIN: 0in 0in 0pt" class=MsoNormal align=center&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;3&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD style="BORDER-BOTTOM: black 1pt solid; BORDER-LEFT: #f0f0f0; PADDING-BOTTOM: 0in; BACKGROUND-COLOR: transparent; PADDING-LEFT: 5.4pt; WIDTH: 369pt; PADDING-RIGHT: 5.4pt; BORDER-TOP: #f0f0f0; BORDER-RIGHT: black 1pt solid; PADDING-TOP: 0in" vAlign=top width=492&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;Step 2, plus F5 to Debug Solution, and run to a breakpoint&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; TEXT-INDENT: -0.25in; MARGIN: 0in 0in 0pt 0.5in; mso-add-space: auto; mso-list: l0 level1 lfo2" class=MsoListParagraph&gt;&lt;SPAN style="FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol"&gt;&lt;SPAN style="mso-list: Ignore"&gt;&lt;FONT size=3&gt;·&lt;/FONT&gt;&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Intellitrace is active at its low (default) setting&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD style="BORDER-BOTTOM: black 1pt solid; BORDER-LEFT: #f0f0f0; PADDING-BOTTOM: 0in; BACKGROUND-COLOR: transparent; PADDING-LEFT: 5.4pt; WIDTH: 63.9pt; PADDING-RIGHT: 5.4pt; BORDER-TOP: #f0f0f0; BORDER-RIGHT: black 1pt solid; PADDING-TOP: 0in" vAlign=top width=85&gt;
&lt;P style="TEXT-ALIGN: center; MARGIN: 0in 0in 0pt" class=MsoNormal align=center&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;975&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;
&lt;TR style="mso-yfti-irow: 4"&gt;
&lt;TD style="BORDER-BOTTOM: black 1pt solid; BORDER-LEFT: black 1pt solid; PADDING-BOTTOM: 0in; BACKGROUND-COLOR: transparent; PADDING-LEFT: 5.4pt; WIDTH: 45.9pt; PADDING-RIGHT: 5.4pt; BORDER-TOP: #f0f0f0; BORDER-RIGHT: black 1pt solid; PADDING-TOP: 0in" vAlign=top width=61&gt;
&lt;P style="TEXT-ALIGN: center; MARGIN: 0in 0in 0pt" class=MsoNormal align=center&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;4&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD style="BORDER-BOTTOM: black 1pt solid; BORDER-LEFT: #f0f0f0; PADDING-BOTTOM: 0in; BACKGROUND-COLOR: transparent; PADDING-LEFT: 5.4pt; WIDTH: 369pt; PADDING-RIGHT: 5.4pt; BORDER-TOP: #f0f0f0; BORDER-RIGHT: black 1pt solid; PADDING-TOP: 0in" vAlign=top width=492&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Step 3, run to Breakpoint; plus step 10x and then run to 2&lt;SUP&gt;nd&lt;/SUP&gt; Breakpoint&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD style="BORDER-BOTTOM: black 1pt solid; BORDER-LEFT: #f0f0f0; PADDING-BOTTOM: 0in; BACKGROUND-COLOR: transparent; PADDING-LEFT: 5.4pt; WIDTH: 63.9pt; PADDING-RIGHT: 5.4pt; BORDER-TOP: #f0f0f0; BORDER-RIGHT: black 1pt solid; PADDING-TOP: 0in" vAlign=top width=85&gt;
&lt;P style="TEXT-ALIGN: center; MARGIN: 0in 0in 0pt" class=MsoNormal align=center&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;1060&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;
&lt;TR style="mso-yfti-irow: 5; mso-yfti-lastrow: yes"&gt;
&lt;TD style="BORDER-BOTTOM: black 1pt solid; BORDER-LEFT: black 1pt solid; PADDING-BOTTOM: 0in; BACKGROUND-COLOR: transparent; PADDING-LEFT: 5.4pt; WIDTH: 45.9pt; PADDING-RIGHT: 5.4pt; BORDER-TOP: #f0f0f0; BORDER-RIGHT: black 1pt solid; PADDING-TOP: 0in" vAlign=top width=61&gt;
&lt;P style="TEXT-ALIGN: center; MARGIN: 0in 0in 0pt" class=MsoNormal align=center&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;5&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD style="BORDER-BOTTOM: black 1pt solid; BORDER-LEFT: #f0f0f0; PADDING-BOTTOM: 0in; BACKGROUND-COLOR: transparent; PADDING-LEFT: 5.4pt; WIDTH: 369pt; PADDING-RIGHT: 5.4pt; BORDER-TOP: #f0f0f0; BORDER-RIGHT: black 1pt solid; PADDING-TOP: 0in" vAlign=top width=492&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Step 4, with Resharper installed&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD style="BORDER-BOTTOM: black 1pt solid; BORDER-LEFT: #f0f0f0; PADDING-BOTTOM: 0in; BACKGROUND-COLOR: transparent; PADDING-LEFT: 5.4pt; WIDTH: 63.9pt; PADDING-RIGHT: 5.4pt; BORDER-TOP: #f0f0f0; BORDER-RIGHT: black 1pt solid; PADDING-TOP: 0in" vAlign=top width=85&gt;
&lt;P style="TEXT-ALIGN: center; MARGIN: 0in 0in 0pt" class=MsoNormal align=center&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;&amp;gt; 1300&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="FONT-SIZE: 9pt"&gt;&lt;FONT face=Calibri&gt;Table 1. Virtual memory usage of Visual Studio at the end of each scenario step.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;In Step 5,&amp;nbsp;we repeated the full scenario, but this time with Resharper, a popular Visual Studio add-in, installed. As Figure 1 shows, at that point, the virtual memory footprint of Visual Studio 2010 exceeds 1.3 GB. Committed bytes exceeds 1.2 GB. While that is not a particularly dangerous amount of virtual memory for Visual Studio to consume, it is still enough to start to make you wary. &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&amp;nbsp;&lt;IMG style="WIDTH: 531px; HEIGHT: 363px" title="VMMap screen shot of VS 2010" alt="VMMap screen shot of VS 2010" align=left src="http://rtsdpg.bay.livefilestore.com/y1pGdYZF0QneXOC46VEipluJdVwW1bXzhlItJvCdr6yPqwTB1Q4c-iq1v8vjJRE_5hTL2IjmPH6-pl-X1bgfoqa6wnNg-HF4Djl/VS%20vm%20usage%20with%20resharper%20Screen%20Snaper%20Image.jpg" width=531 height=363 mce_src="http://rtsdpg.bay.livefilestore.com/y1pGdYZF0QneXOC46VEipluJdVwW1bXzhlItJvCdr6yPqwTB1Q4c-iq1v8vjJRE_5hTL2IjmPH6-pl-X1bgfoqa6wnNg-HF4Djl/VS%20vm%20usage%20with%20resharper%20Screen%20Snaper%20Image.jpg"&gt;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&lt;/FONT&gt;&lt;/o:p&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&lt;/FONT&gt;&lt;/o:p&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&lt;/FONT&gt;&lt;/o:p&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&lt;/FONT&gt;&lt;/o:p&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&lt;/FONT&gt;&lt;/o:p&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&lt;/FONT&gt;&lt;/o:p&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&lt;/FONT&gt;&lt;/o:p&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&lt;/FONT&gt;&lt;/o:p&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&lt;/FONT&gt;&lt;/o:p&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&lt;/FONT&gt;&lt;/o:p&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&lt;/FONT&gt;&lt;/o:p&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&lt;/FONT&gt;&lt;/o:p&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&lt;/FONT&gt;&lt;/o:p&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&lt;/FONT&gt;&lt;/o:p&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&lt;/FONT&gt;&lt;/o:p&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&lt;/FONT&gt;&lt;/o:p&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&lt;/FONT&gt;&lt;/o:p&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&lt;/FONT&gt;&lt;/o:p&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&lt;/FONT&gt;&lt;/o:p&gt;&amp;nbsp;&lt;/P&gt;&lt;o:p&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="FONT-SIZE: 9pt"&gt;&lt;/SPAN&gt;&lt;/B&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="FONT-SIZE: 9pt"&gt;&lt;/SPAN&gt;&lt;/B&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="FONT-SIZE: 9pt"&gt;&lt;/SPAN&gt;&lt;/B&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="FONT-SIZE: 9pt"&gt;&lt;/SPAN&gt;&lt;/B&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="FONT-SIZE: 9pt"&gt;Figure 1. VMMAP report on Visual Studio’s virtual memory usage for Step 5 in the scenario.&amp;nbsp; (&lt;A title="VMMap report for Visual Studio's devenv process" href="http://rtsdpg.bay.livefilestore.com/y1pGdYZF0QneXOC46VEipluJdVwW1bXzhlItJvCdr6yPqwTB1Q4c-iq1v8vjJRE_5hTL2IjmPH6-pl-X1bgfoqa6wnNg-HF4Djl/VS%20vm%20usage%20with%20resharper%20Screen%20Snaper%20Image.jpg" target=_blank mce_href="http://rtsdpg.bay.livefilestore.com/y1pGdYZF0QneXOC46VEipluJdVwW1bXzhlItJvCdr6yPqwTB1Q4c-iq1v8vjJRE_5hTL2IjmPH6-pl-X1bgfoqa6wnNg-HF4Djl/VS%20vm%20usage%20with%20resharper%20Screen%20Snaper%20Image.jpg"&gt;Click&lt;/A&gt; for full size image.)&lt;/SPAN&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="FONT-SIZE: 9pt"&gt;&lt;/SPAN&gt;&lt;/B&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="FONT-SIZE: 9pt"&gt;&lt;/SPAN&gt;&lt;/B&gt;&lt;/FONT&gt;&lt;FONT size=3&gt;From the standpoint of Visual Studio responsiveness, it is worth noting that the high water mark for the devenv.exe working set was just under 600 MB in Step 5. So this is a scenario that can still readily fit in a machine with the minimum of 1 GB of RAM, and should perform quite well on a machine with 2 GB of RAM.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;FONT size=3&gt;Drilling into the virtual memory footprint using the VMMap report shown in Figure 1, we see that the largest single contributor is Images files, code that Visual Studio is loading. For the scenario in step 5, Visual Studio needs to load 775 MB of executable code.&amp;nbsp;we zoomed the VMMap detail report into the Image data and then sorted by Image file size. You can see that many of the image files that Visual Studio loads are quite large, in excess of 10 MB. Meanwhile, private data areas accounted for only about 220 MB of virtual.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;FONT size=3&gt;One thing about the Image files that get loaded&lt;SPAN style="COLOR: #1f497d"&gt; &lt;/SPAN&gt;in Visual Studio, the more components of the IDE you use, the more code gets loaded. And once loaded, Visual Studio does not have a mechanism to unload components that you are no longer using. The Image File footprint just keeps on growing throughout your Visual Studio session. &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;FONT size=3&gt;So, take this simple scenario and apply it to a very large project or solution and exercise a few more Visual Studio components like architectural diagrams and performance profiling, and you may quickly be up against the architectural limit of a 2 GB private process virtual address space. As Visual Studio virtual memory usage approaches that limit, the product becomes unstable on a 32-bit OS.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;FONT size=3&gt;As Brian’s blog discusses, to avoid possible instability problems, the workaround is to install and run Visual Studio on a 64-bit version of the OS. Hopefully, this discussion will help you understand better whether you are a good candidate to run Visual Studio on a 64-bit version of the OS. Running VMMap to gain a more detailed look at Visual Studio’s virtual memory usage in your specific environment can also be useful.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;FONT size=3&gt;&lt;/FONT&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;FONT size=3&gt;-- David Berg and Mark Friedman&lt;o:p&gt;&amp;nbsp;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;FONT size=3&gt;&lt;o:p&gt;posted May 3, 2010&lt;/o:p&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;/FONT&gt;&lt;/o:p&gt;&amp;nbsp;&lt;/P&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=10004608" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Performance/">Performance</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Hardware/">Hardware</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Visual+Studio+2010/">Visual Studio 2010</category></item><item><title>Measuring Processor Utilization and Queuing Delays in Windows applications</title><link>http://blogs.msdn.com/b/ddperf/archive/2010/04/04/measuring-processor-utilization-and-queuing-delays-in-windows-applications.aspx</link><pubDate>Sun, 04 Apr 2010 18:20:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9990328</guid><dc:creator>Mark B Friedman</dc:creator><slash:comments>0</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/ddperf/rsscomments.aspx?WeblogPostID=9990328</wfw:commentRss><comments>http://blogs.msdn.com/b/ddperf/archive/2010/04/04/measuring-processor-utilization-and-queuing-delays-in-windows-applications.aspx#comments</comments><description>&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;Continuing my answer to the mail I received recently from Uriel Carrasquilla… &lt;?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Uri’s note, reprinted in the previous &lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/ddperf/archive/2010/01/27/statistical-process-control-techniques-in-performance-monitoring-and-alerting.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2010/01/27/statistical-process-control-techniques-in-performance-monitoring-and-alerting.aspx"&gt;&lt;FONT size=3 face=Calibri&gt;post&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;, refers to an "issue" associated with the current technique for measuring processor utilization in Windows. As my reply mentioned, these are documented and well-understood issues. At the core is the methodology used to calculate processor utilization that was originally designed 20 years ago for Windows NT. Since one of the original goals of Windows NT was to be hardware independent, the measurement methodology was also designed so that it was not dependent on any specific set of hardware measurement features. And therein lies a tale.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;This methodology was amply documented in my &lt;I style="mso-bidi-font-style: normal"&gt;Windows 2003 Server Performance Guide&lt;/I&gt;, published in the Windows 2003 Server Resource Kit. As you know, I wrote that book before I came to work at Microsoft as a full-time employee, but the Windows Server performance team that helped me at the time was quite open to documenting the existing facility, warts and all.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Briefly summarizing what I wrote there (and which still applies today), the % Processor Time counters in Windows are measurements derived using a sampling technique. The OS Scheduler samples the state of the CPU once per system clock tick, driven by a high priority timer-based interrupt. Currently, the clock tick interval the OS Scheduler uses is usually 15.6 ms. (The precise value that the OS uses between timer interrupts is available by calling &lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;the &lt;/FONT&gt;&lt;A href="http://msdn.microsoft.com/en-us/library/ms724394(VS.85).aspx%20" mce_href="http://msdn.microsoft.com/en-us/library/ms724394(VS.85).aspx%20"&gt;&lt;FONT color=#0000ff size=3 face=Calibri&gt;GetSystemTimeAdjustment()&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt; function.) If the processor is running the Idle loop when the quantum interrupt occurs, it is recorded as an Idle Time sample. If the processor is running some application thread, that is recorded as a busy sample. Busy samples are accumulated continuously at both the thread and process level. When a clock interrupt occurs, the Scheduler performs a number of other tasks, including adjusting the dispatching priority of threads that are currently executing, stopping the progress of any thread that has exceeded its time slice, as well as performing its CPU accounting. &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;The &lt;I style="mso-bidi-font-style: normal"&gt;System\Processor Queue Length&lt;/I&gt; counter in Perfmon is an &lt;I style="mso-bidi-font-style: normal"&gt;instantaneous&lt;/I&gt; counter that reflects the current number of Ready threads waiting in the OS Scheduler queue. When a performance monitoring application such as Perfmon requests the Processor Queue counter, there is a measurement function that traverses the Scheduler Ready Queue and counts the number of threads waiting for an available processor. Thus, the &lt;I style="mso-bidi-font-style: normal"&gt;System\Processor Queue Length&lt;/I&gt; counter represents one sampled observation, and needs to be interpreted with that in mind. (If memory serves, the data collection process that Charles’ analysis relies upon gathers samples of this measurement several times per minute, &amp;amp; his servers are not idle by design. Which basically means I think what he is doing is just fine.)&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;What I believe Uri is referring to with regard to this processor Queue Length metric not being “correct” is anomalies in this measurement that are due to the kind of phased behavior you can sometimes see on an otherwise idle system. Even on an mostly idle system, a sizable number of threads can be waiting on the same clock interrupt (typically, polling the system state once per second), one of which also happens to be the Perfmon measurement thread, also cycling once per second. These sleeping threads tend to clump together so that they get woken up &lt;I style="mso-bidi-font-style: normal"&gt;at the exact same time&lt;/I&gt; by the timer interrupt. (As I mentioned, this happens mainly when the machine is idling with little or no real work to do.) These awakened threads then flood the OS dispatching queue at exactly the same time. If one of these threads is the Perfmon measurement thread that gathers the Processor Queue Length measurement, you can see how this “clumping” behavior could distort the measurements. The Perfmon measurement thread executes at an elevated priority level of 15, so it is scheduled for execution ahead of any other User mode threads that were also awakened by the same Scheduler clock tick. The effect is that at the precise time when the Processor ready queue length is measured, there are likely to be a fair number of Ready Threads. Compared to the modeling assumption where processor scheduling is subject to random arrivals, one observes a disproportionate number of Ready Threads waiting for service, even (or especially) when the processor itself is not very busy overall.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;This anomaly is best characterized as a &lt;I style="mso-bidi-font-style: normal"&gt;low-utilization effect&lt;/I&gt; that perturbs the measurement when the machine is loafing. It generally ceases to be an issue when processor utilization climbs or there are more available processors on the machine. But this bunching of timer-based interrupts remains a serious concern, for instance, whenever Windows is running as a guest virtual machine under VMware or Hyper-V. (Please don’t get me started on that topic.) &lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;Another interesting side discussion is how this clumping of timer-based interrupts interacts with power management, but I do not intend to go there either.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;To summarize, the CPU utilization measurements at the system, process and thread level in Windows are based on a sampling methodology. Similarly, the processor queue length is also sampled. Like any sampling approach, the data gathered is subject to typical sampling errors, including &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="TEXT-INDENT: -0.25in; MARGIN: 0in 0in 0pt 0.5in; mso-list: l0 level1 lfo1" class=MsoListParagraphCxSpFirst&gt;&lt;SPAN style="FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol"&gt;&lt;SPAN style="mso-list: Ignore"&gt;&lt;FONT size=3&gt;·&lt;/FONT&gt;&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;accumulating a sufficient number of sample observations to be able to make a reliable statistical inference about the underlying population, and&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="TEXT-INDENT: -0.25in; MARGIN: 0in 0in 10pt 0.5in; mso-list: l0 level1 lfo1" class=MsoListParagraphCxSpLast&gt;&lt;SPAN style="FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol"&gt;&lt;SPAN style="mso-list: Ignore"&gt;&lt;FONT size=3&gt;·&lt;/FONT&gt;&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;ensuring that there aren’t systemic sources of sampling error that causes sub-classes of the underlying population to be under or over-sampled markedly &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;So, these CPU measurements face familiar issues with sampling size and the potential for systematic sampling bias, as well as the usual difficulty in ensuring that the sample data is &lt;I style="mso-bidi-font-style: normal"&gt;representative&lt;/I&gt; of the underlying population (something known as &lt;I style="mso-bidi-font-style: normal"&gt;non-sampling error&lt;/I&gt;). For example, the interpretation of the CPU utilization data that Perfmon gathers at the process and thread level is subject to limitations based on a small sample size for collection intervals less than, for example, 15 seconds. The &lt;I style="mso-bidi-font-style: normal"&gt;Performance Guide&lt;/I&gt; has a more detailed discussion of these issues, if you are interested.&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;FONT size=3 face=Calibri&gt;&lt;/FONT&gt;&lt;FONT face=Calibri&gt;&lt;o:p&gt;
&lt;H3 style="MARGIN: 10pt 0in 0pt"&gt;&lt;FONT color=#4f81bd&gt;&lt;FONT face=Cambria&gt;&lt;FONT size=3&gt;Event-driven measurement approaches.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/H3&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;What exactly Windows is going to do about this, I couldn't say -- I work in a different part of the company -- but I have consistently lobbied for a more accurate, event-driven approach to gathering CPU measurements. Efficient power management, for example, strongly argues for an event-driven approach. You do not want the OS to wake up periodically on an idle machine that could otherwise be powered down just to perform its CPU usage accounting duties, for example. &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;It is encouraging that recent versions of Windows have taken major steps in this direction, supporting a more accurate event-driven approach using instrumentation added to the Scheduler to measure CPU utilization. The improvements in this area have not gotten much notice, which is something I will try to rectify a bit here. Windows exploits a machine timing facility that is present in both x86 and x64 hardware, namely, the &lt;B style="mso-bidi-font-weight: normal"&gt;rdtsc&lt;/B&gt; Read TimeStamp Counter instruction. Moreover, these improvements position the Windows OS so it can replace its legacy CPU measurement facility with something more reliable and accurate sometime in the near future. Unfortunately, converting all existing features in Windows, including Perfmon and Task Manager, to support the new measurements is a big job, not without its complications and not always as straightforward as one would hope. (One of the complications is using &lt;B style="mso-bidi-font-weight: normal"&gt;rdtsc&lt;/B&gt; on older hardware where the hardware tick rate changes when there is a power management event. There are also issues of clock drift across multiprocessor cores when they are not re-synchronized periodically. I do not want to take the time to discuss these issues in detail here.)&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;To see this new processor measurement facility at work, access the new Resource Monitor application (resmon.exe) that is available beginning in Vista and Windows Server 2008. Resource Monitor can be launched directly from the command line, or from either Performance Monitor or Task Manager. In case you are not familiar with it, here is a screen shot that shows Resource Monitor in action on a Windows Server 2008 R2 machine, calculating CPU utilization over the last 60 seconds of operation, breaking out that utilization by process. The CPU utilization measurements that ResMon calculates are based on new OS Scheduler instrumentation. These measurements are very accurate, about as good as it gets from a vantage point inside the OS.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&lt;IMG style="WIDTH: 402px; HEIGHT: 302px" title="Windows Resource Monitor" alt="Windows Resource Monitor" align=left src="http://5l3vgw.bay.livefilestore.com/y1pX1wNrsIvjzPmgE1gAtYmHpWhyV51wumpGCzUzdb1hVKHtsSIMsvqf1LAaymfwIr8WhHs612ZOfbnAc_zwq4MHRNI1EIGW7ug/Win7%20Resource%20Monitor.jpg" width=402 height=302 mce_src="http://5l3vgw.bay.livefilestore.com/y1pX1wNrsIvjzPmgE1gAtYmHpWhyV51wumpGCzUzdb1hVKHtsSIMsvqf1LAaymfwIr8WhHs612ZOfbnAc_zwq4MHRNI1EIGW7ug/Win7%20Resource%20Monitor.jpg"&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;&lt;/FONT&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;&lt;/FONT&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;&lt;/FONT&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;&lt;/FONT&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;&lt;/FONT&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;&lt;/FONT&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;&lt;/FONT&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;&lt;/FONT&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;&lt;/FONT&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;&lt;/FONT&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;&lt;STRONG&gt;&lt;FONT size=2&gt;Figure 1. The Windows 7 Resource Manager application.&amp;nbsp; (&lt;A title="Windows Resource Manager" href="http://public.bay.livefilestore.com/y1pmGNsLRE7KXC71flyu1tDbdviKnxOsXXi4gjqrnPoelib2crxOBZa-nLKVN7LyRX7GZoevqdD1wcHt3_ZwnhNUQ/Win7%20Resource%20Monitor.jpg" target=_blank mce_href="http://public.bay.livefilestore.com/y1pmGNsLRE7KXC71flyu1tDbdviKnxOsXXi4gjqrnPoelib2crxOBZa-nLKVN7LyRX7GZoevqdD1wcHt3_ZwnhNUQ/Win7%20Resource%20Monitor.jpg"&gt;Click&lt;/A&gt; for full size image.)&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;FONT size=3&gt;&lt;FONT size=2&gt;&lt;o:p&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;The Resource Monitor measures CPU busy in real time using event-orient measurement data gathered by the OS Scheduler each time a context switch occurs. A context switch occurs in Windows whenever the processor switches the processor execution context to run a different thread. Context switches also occur as a result of high priority Interrupt Service Routines (ISRs), as well as the Deferred Procedure Calls (DPCs) that ISRs schedule to complete the interrupt processing. Starting in Windows 6, the OS Scheduler began issuing &lt;/FONT&gt;&lt;FONT size=3&gt;&lt;STRONG&gt;rdtsc&lt;/STRONG&gt; instructions to get the internal processor clock each time a context switch occurs. (Note: The Windows OS Scheduler not only orders ready threads in its dispatcher queue by priority, it also lets a higher priority thread preemptively interrupt the execution of a lower priority one. This is known as &lt;I style="mso-bidi-font-style: normal"&gt;Preemptive Scheduling with Priority Queuing&lt;/I&gt;. Once executing, a thread is also subject to a maximum time-slice interval, sometimes referred to as the &lt;I style="mso-bidi-font-style: normal"&gt;quantum&lt;/I&gt;. When the Scheduler determines that a thread’s time-slice has expired, the thread is also subject to preemption in favor of another thread from the ready queue of equal priority.&lt;/FONT&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-SIZE: 9pt; mso-bidi-font-size: 11.0pt"&gt;) &lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;Meanwhile, the hardware manufacturers have improved the performance of the &lt;STRONG&gt;rdtsc&lt;/STRONG&gt; instruction, making it considerably more efficient for the Scheduler to gather these processor utilization measurements. &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;You will notice in the example screen shot shown here in Figure 1 that Resource Manager has calculated that System Interrupts (from ISR and DPC routines) accounts for most of the processor utilization during the last 60 second interval. The machine being monitored is mainly doing file I/O, so this is expected. Although the Resource Monitor display doesn’t say so explicitly, it is worth noting that this is all work being performed in kernel mode.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;Conceptually, a context switch event is something like &lt;STRONG&gt;switch&lt;/STRONG&gt;(&lt;I style="mso-bidi-font-style: normal"&gt;oldThreadId,&lt;/I&gt; &lt;I style="mso-bidi-font-style: normal"&gt;newThreadId&lt;/I&gt;), with an &lt;STRONG&gt;rdtsc&lt;/STRONG&gt; time stamp identifying when the context switch occurred. The Context Switch event also provides the old thread’s Wait Reason code, which helps you to understand why the sequence of thread scheduling events occurred. For reference, a Windows context switch is defined &lt;/FONT&gt;&lt;A href="http://msdn.microsoft.com/en-us/library/ms682105(VS.85).aspx"&gt;&lt;FONT color=#0000ff size=3&gt;here&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3&gt;, while the contents of the ETW (Event Tracing for Windows) context switch event record are defined &lt;/FONT&gt;&lt;A href="http://msdn.microsoft.com/en-us/library/aa964744(VS.85).aspx"&gt;&lt;FONT size=3&gt;here&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3&gt;. Also, see the series of articles by Insung Park and Alex Bendetov that were published in MSDN Magazine entitled “&lt;A href="http://msdn.microsoft.com/en-us/magazine/ee412263.aspx"&gt;&lt;SPAN style="COLOR: windowtext; TEXT-DECORATION: none; text-underline: none"&gt;&lt;FONT size=3&gt;Core OS Events in Windows 7&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;FONT size=3&gt;” for additional background and perspective. &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;(Note: You have to hunt a bit to find the current values of the KWait_Reason enumeration. Officially, the KWait_Reason enumeration is published in the Wdm.h header file available in the Windows Driver Kit (WDK). Unfortunately, in MSDN, ordinarily the most authoritative places to find something like this, the available documentation tends to lag recent releases of Windows. For instance, the version of the enum provided to the .NET developer available here is considerably out of date. The explain text in Perfmon for the Thread\Thread Wait Reason counter is also not current. If you do not have access to the WDK, try either ProcessHacker or NirSoft instead for more up-to-date documentation.) &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;The same calculation that the Resource Manager in Windows 7 uses can be performed after the fact using event data from ETW. In their article, Insung and Alex write, “In state machine construction, combining Context Switch, DPC and ISR events enables a very accurate accounting of CPU utilization.” This is the technique used in the &lt;/FONT&gt;&lt;A href="http://msdn.microsoft.com/en-us/performance/default.aspx"&gt;&lt;FONT color=#0000ff size=3&gt;Windows Performance Toolkit&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3&gt; (WPT, which is also known as xperf), for example, to calculate its CPU utilization statistics. (For reference, there is a discussion that illustrates using the WPT to analyze ISR and DPC event data in a previous blog entry entitled “Mainstream NUMA and the TCP/IP stack” posted &lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx"&gt;&lt;FONT color=#0000ff size=3&gt;here&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3&gt;.)&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;The new Concurrency Visualizer, which my colleague Hazim Shafi discusses in &lt;/FONT&gt;&lt;A href="http://msdn.microsoft.com/en-us/magazine/ee336027.aspx"&gt;&lt;FONT color=#0000ff size=3&gt;a recent MSDN Magazine article &lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3&gt;entitled “Performance Tuning with the Concurrency Visualizer in Visual Studio 2010 in the Visual Studio 2010 Profiler”, also consumes Context Switch events to calculate processor utilization for the application being profiled. An interesting twist in the Concurrency Visualizer’s CPU Utilization View is that the view pivots based on the application you are profiling, which is likely how a developer engaged in a performance investigation wants to see things. Based on the sequence of context switch trace events, the Concurrency Visualizer calculates processor utilization by the process, aggregates it for the current selection window, and displays it in the CPU Utilization View. In the CPU Utilization View, all other processor activity for processes (other than one being profiled) is lumped together under a category called “Other Processes.” System-processes and the “Idle process,” which is a bookkeeping mechanism, not an actual process that is dispatched, are also broken out separately. See Hazim’s article for more details. &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;Beyond its CPU utilization calculations, the Concurrency Visualizer’s primary focus is on being able to reconstruct the sequence of events that impact an application’s execution progress. The Concurrency Visualizer’s Threads View is the main display showing an application’s execution path. The view here is of execution progress on a thread by thread basis. For each thread in your application, the Concurrency Visualizer shows the precise sequence of context switch events that occurred. These OS Scheduler events reflect that thread’s execution state. See Figure 2 for an example of this view.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;Figure 2 shows the execution path of six application threads, a Main thread, a generic worker thread, and 4 CLR worker threads that the application created by instantiating a &lt;/FONT&gt;&lt;A href="http://msdn.microsoft.com/en-us/library/3dasc8as(VS.100).aspx"&gt;&lt;FONT size=3&gt;ThreadPool&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3&gt; object. (There were originally more threads than this, but I have hidden some that were inactive over the entire run.) For each thread, the execution state of the thread – whether it is running or whether it is blocked –is indicated over time. The upper part of the display is a timeline that shows the execution state of each thread over time. The execution progress of each thread display is constructed horizontally from left to right from rectangles that indicate the start and end of a particular thread state. An interval when the thread was running shows as green. An interval where the thread is sleeping is shown in blue. A ready thread that is blocked from executing because a higher priority thread is running is shown in yellow. (This state is labeled “preemption.”) A thread in a synchronization delay waiting on a lock is shown as red.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;On the lower left of the display is a Visible Timeline Profile. This summarizes the state of all threads that are visible within the selected time window. In the screen shot in Figure 2, I have zoomed into a window that is approximately 150 milliseconds wide. During that interval, the threads shown were in a state where they were actively executing instruction only 11% of the time. For 25% of the time interval, threads were blocked waiting on a lock. There is a tabbed display at the lower right. If you click on the “Profile Report” tab, a histogram displays that summarizes the execution state of each individual thread over the time window. In the screen shot, I have clicked on the “Current stack,” which displays the call stack associated with the ETW context switch event. If the thread is blocked, the call stack indicates where in the code the thread will resume execution once it unblocks. We will drill into that call stack in a moment.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;Note: The Threads View also displays call stacks from processor utilization samples that ETW gathers on a system-wide basis once per millisecond. These are available during any periods when the thread is executing instructions (and ETW execution sampling is active).&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;One of the other ETW events that the Concurrency Visualizer analyzes is the &lt;/FONT&gt;&lt;A href="http://msdn.microsoft.com/en-us/library/dd765158(VS.85).aspx"&gt;&lt;FONT size=3&gt;ReadyThread&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3&gt; event. The interval between a ReadyThread event and a subsequent Context Switch that signals that a ready Thread is being dispatched measures CPU queue time delay directly. Using event data, it is possible to measure CPU queuing delays to a degree of precision that far exceeds anything that can be done using performance counters.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;The Concurrency Visualizer screen shot in Figure 2 illustrates the calculation of CPU queue time delay. &lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;Thread 6920, which happens to be a CLR thread pool worker thread, is shown at a point in time where it was preempted by a higher priority task. The specific delay that I zoomed in on in the screen shot is preemption due to the scheduling of a high priority LPC or ISR – note this category also encompasses assorted APCs and DPCs. In this specific example, execution of Thread 6920 was delayed for 0.7718 milliseconds. According to the trace, that is the amount of time between Thread 6920 being preempted by a high priority system routine and a subsequent context switch where the ready thread was finally re-dispatched. &lt;/FONT&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MyCaption&gt;&lt;STRONG&gt;&lt;FONT size=2&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&lt;IMG style="WIDTH: 491px; HEIGHT: 413px" title="Thread 6920 preempted by an ISR" alt="Thread 6920 preempted by an ISR" align=left src="http://public.bay.livefilestore.com/y1psA60dtVg4Nh_gGSu9X72_Y7sgeRdJKr_37IOLqa11r7AR_ibr5K6zOqkqdGk4op643-na84TRUpksKJZiuSjBg/Concurrency%20visualizer%20screen%20shot%20showing%20preemption%20by%20a%20higher%20priority%20ISS%20or%20DPC%20event.jpg" width=491 height=413 mce_src="http://public.bay.livefilestore.com/y1psA60dtVg4Nh_gGSu9X72_Y7sgeRdJKr_37IOLqa11r7AR_ibr5K6zOqkqdGk4op643-na84TRUpksKJZiuSjBg/Concurrency%20visualizer%20screen%20shot%20showing%20preemption%20by%20a%20higher%20priority%20ISS%20or%20DPC%20event.jpg"&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MyCaption&gt;&lt;STRONG&gt;&lt;FONT size=2&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MyCaption&gt;&lt;STRONG&gt;&lt;FONT size=2&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MyCaption&gt;&lt;STRONG&gt;&lt;FONT size=2&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MyCaption&gt;&lt;STRONG&gt;&lt;FONT size=2&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MyCaption&gt;&lt;STRONG&gt;&lt;FONT size=2&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MyCaption&gt;&lt;STRONG&gt;&lt;FONT size=2&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MyCaption&gt;&lt;STRONG&gt;&lt;FONT size=2&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MyCaption&gt;&lt;STRONG&gt;&lt;FONT size=2&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MyCaption&gt;&lt;STRONG&gt;&lt;FONT size=2&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MyCaption&gt;&lt;STRONG&gt;&lt;FONT size=2&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MyCaption&gt;&lt;STRONG&gt;&lt;FONT size=2&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MyCaption&gt;&lt;STRONG&gt;&lt;FONT size=2&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MyCaption&gt;&lt;STRONG&gt;&lt;FONT size=2&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MyCaption&gt;&lt;STRONG&gt;&lt;FONT size=2&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MyCaption&gt;&lt;STRONG&gt;&lt;FONT size=2&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MyCaption&gt;&lt;STRONG&gt;&lt;FONT size=2&gt;Figure 2. Screen shot of the Concurrency Visualizer illustrating thread preemption by a higher priority system routine.&amp;nbsp; (&lt;A title="Thread 6920 preempted by an ISR" href="http://public.bay.livefilestore.com/y1psA60dtVg4Nh_gGSu9X72_Y7sgeRdJKr_37IOLqa11r7AR_ibr5K6zOqkqdGk4op643-na84TRUpksKJZiuSjBg/Concurrency%20visualizer%20screen%20shot%20showing%20preemption%20by%20a%20higher%20priority%20ISS%20or%20DPC%20event.jpg" target=_blank mce_href="http://public.bay.livefilestore.com/y1psA60dtVg4Nh_gGSu9X72_Y7sgeRdJKr_37IOLqa11r7AR_ibr5K6zOqkqdGk4op643-na84TRUpksKJZiuSjBg/Concurrency%20visualizer%20screen%20shot%20showing%20preemption%20by%20a%20higher%20priority%20ISS%20or%20DPC%20event.jpg"&gt;Click&lt;/A&gt; for full size image.)&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;The tool also displays the call stack of the preempted thread. The call stack indicates that the CLR’s garbage collector (GC) was running at the time that thread execution was preempted. From the call stack, it looks like the GC is sweeping the Large Object Heap (LOH), trying to free up some previously allocated virtual memory. This is not an opportune time to get preempted. You can see that one of the other CLR worker threads, Thread 6420, is also delayed. Notice from the color coding that Thread 6420 is delayed waiting on a lock. Presumably, one of the other active CLR worker threads in the parent process holds the lock that Thread 6420 is waiting for. &lt;o:p&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;This is one of those “Aha” moments. If you click on the synchronization delay that Thread 6420 is experiencing, as illustrated in Figure 3, you can see that the lock that Thread 6420 is trying to acquire is, in fact, currently held by Thread 6920. Clicking on the tab that says “Current Stack” (not shown) indicates that the duration of the synchronization delay that Thread 6420 suffered in this specific instance of lock contention was about 250 milliseconds. &lt;o:p&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;The scenario here shows one CLR worker thread blocked on a lock that is held by another CLR worker thread, which in turn finds itself being delayed due to preemptions from higher priority Interrupt processing. We can see that whatever high priority work preempted Thread 6920 has the side effect of also delaying Thread 6420, since 6420 was waiting on a lock that Thread 6920 just happened to be holding at the time. The tool in Figure 3 displays the Unblocking stack from Thread 6920 which shows the original memory allocation from the Dictionary.Resize() method call being satisfied, releasing a global GC lock. When Thread 6920 resumed execution following its preemption, the GC operation completes, releasing the global GC lock. Thread 6920 continues to execute for another 25 microseconds or so, before it is preempted because its time slice expired. Even as Thread 6920 blocks, Thread 6420 continues to wait while a different CLR thread pool thread (4664) begins to execute instead. Finally, after another 25 microseconds delay, Thread 6420 resumes execution. For a brief period both 6420 and 4664 execute in parallel from approximately the 7640 to 7650 microsecond milestones. (However, they are subject to frequent preemptions during that period of overlapped execution.) Welcome to the wonderful world of&amp;nbsp;&lt;I style="mso-bidi-font-style: normal"&gt;indeterminancy&lt;/I&gt; in concurrent programming.&lt;o:p&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&lt;IMG style="WIDTH: 541px; HEIGHT: 413px" title="Thread 6420 delayed waiting on a lock" alt="Thread 6420 delayed waiting on a lock" align=left src="http://public.bay.livefilestore.com/y1pwyW9xNPqEHc-r2iqHj2NluFVeNqc4-NSgJoJ83e9GzhMlfifaSquwuZHN4qZqWPEhgR5vp5ou6ipEtVsJFO8NA/Concurrency%20visualizer%20screen%20shot%20showing%20a%20different%20thread%20blocked%20by%20a%20GC%20lock%20jpg.jpg" width=541 height=413 mce_src="http://public.bay.livefilestore.com/y1pwyW9xNPqEHc-r2iqHj2NluFVeNqc4-NSgJoJ83e9GzhMlfifaSquwuZHN4qZqWPEhgR5vp5ou6ipEtVsJFO8NA/Concurrency%20visualizer%20screen%20shot%20showing%20a%20different%20thread%20blocked%20by%20a%20GC%20lock%20jpg.jpg"&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MyCaption&gt;&lt;STRONG&gt;&lt;FONT size=2&gt;Figure 3. CLR Worker Thread 6420 blocked because it is waiting on a GC lock that happens to be held by Thread 6920, which is subject to preemption by higher priority system routines. (&lt;A title="Thread 6420 delayed waiting on a lock" href="http://public.bay.livefilestore.com/y1pwyW9xNPqEHc-r2iqHj2NluFVeNqc4-NSgJoJ83e9GzhMlfifaSquwuZHN4qZqWPEhgR5vp5ou6ipEtVsJFO8NA/Concurrency%20visualizer%20screen%20shot%20showing%20a%20different%20thread%20blocked%20by%20a%20GC%20lock%20jpg.jpg" target=_blank mce_href="http://public.bay.livefilestore.com/y1pwyW9xNPqEHc-r2iqHj2NluFVeNqc4-NSgJoJ83e9GzhMlfifaSquwuZHN4qZqWPEhgR5vp5ou6ipEtVsJFO8NA/Concurrency%20visualizer%20screen%20shot%20showing%20a%20different%20thread%20blocked%20by%20a%20GC%20lock%20jpg.jpg"&gt;Click&lt;/A&gt; for full size image.)&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MyCaption&gt;&lt;STRONG&gt;&lt;FONT size=2&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&amp;nbsp;I won’t take the time now to go into what this little concurrent CLR thread pool application is doing. Suffice to say that I wrote it to illustrate some of the performance issues developers can face trying to do parallel programming, which is the other topic I have been blogging about. (I should note that the test program puts the worker threads to sleep periodically to simulate synchronous I/O waits.) As I started to run the test app I developed using the Concurrency Visualizer, I was able to see blocking issues like this one where the Common Language Runtime introduced synchronization and locking considerations that are otherwise opaque to the developer. I eventually tweaked the test app into an especially ghoulish version I call the LockNestMonster program to shine an even brighter light on these issues. (More about this later when I resume blogging about concurrent programming in .NET.)&lt;o:p&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;H3 style="MARGIN: 10pt 0in 0pt"&gt;&lt;FONT color=#4f81bd&gt;&lt;FONT face=Cambria&gt;&lt;FONT size=3&gt;Time-slicing. &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/H3&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;The Concurrency Visualizer also breaks out preemption due to the expiration of a thread’s time-slice, the duration of a time slice being one of the few tuning adjustments available in the OS. For the record, I normally recommend that system administrators &lt;I style="mso-bidi-font-style: normal"&gt;&lt;U&gt;not&lt;/U&gt;&lt;/I&gt; fiddle with this tuning knob, unless they have a whole lot of extra time on their hands. (This older &lt;A href="http://support.microsoft.com/kb/111405"&gt;&lt;FONT color=#0000ff&gt;KB article&lt;/FONT&gt;&lt;/A&gt; provides some flavor for what is involved.) For those of you that cannot resist the temptation, the &lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;Concurrency Visualizer Threads View is the first Windows performance tool that can help you determine if changing the OS default time-slice value is doing your application any good, or harm, for that matter. &lt;o:p&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;In Figure 4, I clicked on the large yellow block on the right hand side of the execution time bar graph for Thread 6920 indicating another long delay. As in Figure 3, I have hidden all but the three active CLR thread pool threads. Using a combination of zooming to a point of interest in the event stream and filtering out extraneous threads, Figure 4 shows that the Concurrency Visualizer computes an execution time profile for just those events that are visible in the current window. Overall, the three active CLR worker threads are only able to execute 18% of the time, while they are delayed by synchronization 9% of the time and subject to preemption 39% of the time. (You can click on the Profile Report tab in the lower right portion of the display and see a profile report by thread.) &lt;o:p&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MyCaption mce_keep="true"&gt;&lt;IMG style="WIDTH: 541px; HEIGHT: 413px" title="Time slice quantum expiration" alt="Time slice quantum expiration" src="http://public.bay.livefilestore.com/y1pIUdtMiovFTJ3M9eUrDmJduTMusQTD04M89k_gUT3CcA3oIUDxbEdWNEGt9qkpsdtNY_pvkH94GbvQ1DttYt-_A/Concurrency%20visualizer%20screen%20shot%20showing%20preemption%20due%20to%20quantum%20expiration.jpg" width=541 height=413 mce_src="http://public.bay.livefilestore.com/y1pIUdtMiovFTJ3M9eUrDmJduTMusQTD04M89k_gUT3CcA3oIUDxbEdWNEGt9qkpsdtNY_pvkH94GbvQ1DttYt-_A/Concurrency%20visualizer%20screen%20shot%20showing%20preemption%20due%20to%20quantum%20expiration.jpg"&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MyCaption&gt;&lt;STRONG&gt;&lt;FONT size=2&gt;Figure 4. Using the Concurrency Visualizer to drill into thread preemption delays. (&lt;A title="Preemption due to quantum time-slice expiration" href="http://public.bay.livefilestore.com/y1pIUdtMiovFTJ3M9eUrDmJduTMusQTD04M89k_gUT3CcA3oIUDxbEdWNEGt9qkpsdtNY_pvkH94GbvQ1DttYt-_A/Concurrency%20visualizer%20screen%20shot%20showing%20preemption%20due%20to%20quantum%20expiration.jpg" target=_blank mce_href="http://public.bay.livefilestore.com/y1pIUdtMiovFTJ3M9eUrDmJduTMusQTD04M89k_gUT3CcA3oIUDxbEdWNEGt9qkpsdtNY_pvkH94GbvQ1DttYt-_A/Concurrency%20visualizer%20screen%20shot%20showing%20preemption%20due%20to%20quantum%20expiration.jpg"&gt;Click&lt;/A&gt; for full size image.)&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;At the point indicated in the tool, the time-slice quantum for Thread 6920 expired and the Scheduler preempted the executing thread in favor of some other ready thread. Looking at the visualization, it should be apparent that the ready thread the Scheduler chose to execute next was another CLR thread pool worker thread, namely Thread 4664, which then blocked Thread 6920 from continuing. The tool reports that a &lt;I style="mso-bidi-font-style: normal"&gt;context switch&lt;/I&gt;(6920, 4664) occurred, and that Thread 6920 was delayed for about 275 milliseconds before it resumed execution as a result of being preempted.&lt;o:p&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;As illustrated in this example, the Concurrency Visualizer uses the ETW-based event data from a profiling run to construct a state machine that reflects the precise execution state of each application thread over the time interval being monitored. It goes considerably beyond calculating processor queue time at the thread level. It understands how to weave the sequence of Ready Thread and Context switch events together to create this execution time profile. It summarizes the profiling data, calculating the precise amount time of time each thread is delayed by synchronous IO, page faults (i.e., Memory Management overhead), processor contention, preemption by priority work, and lock contention over the profiling interval. Furthermore, it analyzes the call stacks gathered at each Context Switch event, looking for signatures that identify the specific blocking reason. And, specifically, to help with lock contention issues, which are otherwise often very difficult to identify, it also identifies the thread that ultimately unblocks the thread that was found waiting to acquire a lock. &lt;o:p&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;I will discuss another new facility for capturing CPU time accurately at the thread level as your program executes in the next blog post in this series.&lt;o:p&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MyCaption mce_keep="true"&gt;-- Mark Friedman&lt;/P&gt;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9990328" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/-NET/">.NET</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Parallel+programming/">Parallel programming</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Performance/">Performance</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Visual+Studio/">Visual Studio</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Visual+Studio+Profiler/">Visual Studio Profiler</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Visual+Studio+2010/">Visual Studio 2010</category></item><item><title>Statistical Process Control Techniques in Performance Monitoring and Alerting</title><link>http://blogs.msdn.com/b/ddperf/archive/2010/01/27/statistical-process-control-techniques-in-performance-monitoring-and-alerting.aspx</link><pubDate>Wed, 27 Jan 2010 01:31:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9953831</guid><dc:creator>Mark B Friedman</dc:creator><slash:comments>1</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/ddperf/rsscomments.aspx?WeblogPostID=9953831</wfw:commentRss><comments>http://blogs.msdn.com/b/ddperf/archive/2010/01/27/statistical-process-control-techniques-in-performance-monitoring-and-alerting.aspx#comments</comments><description>&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;Being focused on the upcoming release of Visual Studio 2010 for the past six months or so, I, unfortunately, have been neglecting to blog about it. Before I get back to the series of blog posts I started about writing in parallel programming, I thought I’d first answer the mail.&lt;?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&lt;/FONT&gt;&lt;/o:p&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;Concerning a recent presentation by Charles Loboz, my colleague at Microsoft, at CMG09 in Texas (DEC 2009), Uriel Carrasquilla, a very knowledgeable and resourceful performance analyst at NCCI in Florida, writes, &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;&lt;EM&gt;“Mr. Loboz indicated that the theoretical calculations were based on Windows reporting of CPU busy and CPU queue length. My results indicate that the CPU queue length reported by Microsoft can't be correct.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;I found that other CMG researchers came up with the same conclusion.&lt;o:p&gt;&lt;/o:p&gt;&lt;/EM&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;&lt;EM&gt;“I used similar ideas for my Linux, AIX and Sun Solaris data as reported by SAR, and Mr. Loboz ideas work like a charm.&lt;o:p&gt;&lt;/o:p&gt;&lt;/EM&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;&lt;EM&gt;“Question:&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;are you aware of this problem with Microsoft performance reporting? Is anybody working on this issue?”&lt;o:p&gt;&lt;/o:p&gt;&lt;/EM&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;Uri,&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;The short answer is, &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;“The derivation and interpretation of the System\Processor Queue Length performance counter is well-documented in the &lt;I style="mso-bidi-font-style: normal"&gt;Windows 2003 Server Performance Guide&lt;/I&gt;, published back in the Windows 2003 Server Resource Kit. I believe the Processor Queue Length performance counter continues to be a very useful metric to track, as Charles and his team that is responsible for capacity planning for the many of the Microsoft online properties do.”&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="TEXT-INDENT: -0.25in; MARGIN: 0in 0in 10pt 0.5in; mso-list: l0 level1 lfo1" class=MsoListParagraph&gt;&lt;SPAN style="FONT-FAMILY: Wingdings; mso-fareast-font-family: Wingdings; mso-bidi-font-family: Wingdings"&gt;&lt;SPAN style="mso-list: Ignore"&gt;&lt;FONT size=3&gt;n&lt;/FONT&gt;&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;Mark&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;I will post a more expansive answer soon, allowing me to expound a little on a question that gets asked quite frequently, namely, “How are measures of CPU utilization in Windows derived and how can they be interpreted?”&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;First, though, I’d like to mention some of the work Charles Loboz and his team have been doing in the context of capacity planning to support some of the massive applications Microsoft provisions and supports. Consider an application like Hotmail that supports something in the neighborhood of 500 million mailboxes (give or take a couple hundred million) and a customer base that is global in scale. That is an order of magnitude larger than the largest corporate entity responsible for a single e-mail or messaging infrastructure. (My guess is that the largest corporate entity responsible for a single e-mail infrastructure is the US Department of Defense. Although, it might be the US Army instead since the different service branches probably operate separate infrastructures.) Performance monitoring and capacity planning on the scale of Hotmail or Search is certainly unprecedented. Do you think performance and capacity planning are important in an application the size of Hotmail. The answer is, “You bet.” The investment in hardware and power consumption alone justifies the capacity planning effort.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;I had an opportunity to see some of the material that Charles was working on back in the summer &amp;amp; gave him some feedback on measurements and what valid inferences can be drawn from them. I haven’t read the final published version, but, I am certainly in sympathy with the approach he has adopted. (BTW, people like Uri that attended the recent CMG Conference have access to Charles’ paper, but no one else at the moment. As soon as Charles posts it somewhere publicly, I will point this blog entry to it.)&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Although the scale Charles has to deal with is something new, the approach isn’t. I remember I also sent Charles a pointer to Igor Trobin's work, which I believe is very complementary. Igor writes an interesting blog called “&lt;/FONT&gt;&lt;A href="http://itrubin.blogspot.com/" mce_href="http://itrubin.blogspot.com/"&gt;&lt;FONT size=3 face=Calibri&gt;System Management by Exception&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3 face=Calibri&gt;.” In addition, Jeff Buzen and Annie Shum published a very influential paper on this subject called “MASF: Multivariate Adaptive Statistical Filtering” back in 1995. (Igor’s papers on the subject and the original Buzen and Shum paper are all available at &lt;/FONT&gt;&lt;A href="http://www.cmg.org/" mce_href="http://www.cmg.org/"&gt;&lt;FONT size=3 face=Calibri&gt;www.cmg.org&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3 face=Calibri&gt;.) My colleague &lt;/FONT&gt;&lt;A href="http://bezsys.blogspot.com/" mce_href="http://bezsys.blogspot.com/"&gt;&lt;FONT size=3 face=Calibri&gt;Boris Zibitsker&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt; has also made a substantial contribution to what I consider a very useful approach, namely applying statistical process control (SPC) techniques to mine for gold within the enormous amounts of performance data that IT organizations routinely gather. &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;For perspective, Carnegie Mellon’s Software Engineering Institute (SEI) is usually credited with the original application of SPC techniques to software engineering. Len Bass at SEI wrote an excellent book entitled &lt;/FONT&gt;&lt;A href="http://www.amazon.com/Software-Architecture-Practice-2nd-Bass/dp/0321154959/ref=ntt_at_ep_dpi_1" mce_href="http://www.amazon.com/Software-Architecture-Practice-2nd-Bass/dp/0321154959/ref=ntt_at_ep_dpi_1"&gt;&lt;FONT size=3 face=Calibri&gt;Software Architecture in Practice&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3 face=Calibri&gt; that embraces a broader perspective on quality in software development that I share. Len’s work on software quality metrics is close to my current interests here in Developer Division, especially around the potential value of scenario-driven development processes. (More on scenarios in the next blog post. Len’s submitted to a brief interview on Channel 9 recently that is posted &lt;/FONT&gt;&lt;A href="http://channel9vip.orcsweb.com/posts/mattdeacon/Talking-Architects-with-Len-Bass/" mce_href="http://channel9vip.orcsweb.com/posts/mattdeacon/Talking-Architects-with-Len-Bass/"&gt;&lt;FONT size=3 face=Calibri&gt;here&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;.) &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;Within the application life-cycle, performance, unfortunately, is considered one of the &lt;I style="mso-bidi-font-style: normal"&gt;non-functional&lt;/I&gt; requirements associated with a system specification, which often means it is relegated to a secondary role during the much of the application life-cycle. In the specification process, getting the business requirements and translating them into system specifications correctly is the most pressing problem for developers of Line of Business applications. Performance is one of those aspects of software quality that often doesn’t get expressed during the software development life cycle until very late in the process when design flaws that lead to scalability problems are very expensive to fix. &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;Len Bass’s suggestion is that the requirements definition of a scenario should include a response time specification that can then be monitored throughout the development life cycle, just like any other set of requirements. That is the approach that we advocate using here in the Microsoft Developer Division for the software products that we built here. In developing Visual Studio 2010, for example, we made major commitments to performance requirements and regularly conduct automated acceptance testing against those requirements. However, you can also see from the many recent blogs on VS 2010 performance coming from the Microsoft Developer Division that we have not exactly gotten this down to a science yet.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;The Len Bass and SEI approach is informed by experience building real-time control systems to fly airplanes, for example, where performance goals absolutely have to be met or the system cannot function as designed. &lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/B&gt;The performance requirements for real-time control systems applications are fundamentally easy to specify. If the computer system doesn’t recognize a condition and respond to it in time, the plane is going to crash. Bass makes the case for that&lt;STRONG&gt; &lt;/STRONG&gt;system performance being one of those important Quality Attributes that needs to be addressed at the outset of the development life cycle, beginning with the architectural specification and continuing through the design, development and QA processes, to the delivered software’s operational phase, where it finally becomes the focus of performance analysts and capacity planners like Charles Loboz, Uriel Carrasquilla,&amp;nbsp;and Igor Torbin. &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;What if you need to specify performance requirements for your LOB application, but don’t know where to start? Consider these two approaches:&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="TEXT-INDENT: -0.25in; MARGIN: 0in 0in 0pt 38.55pt; mso-list: l1 level1 lfo2; mso-add-space: auto" class=MsoListParagraphCxSpFirst&gt;&lt;SPAN style="FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol"&gt;&lt;SPAN style="mso-list: Ignore"&gt;&lt;FONT size=3&gt;·&lt;/FONT&gt;&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3 face=Calibri&gt;Research in human factors engineering has generated a set of performance requirements for specific types of human-computer interactions in order to promote usability and improve customer satisfaction. &lt;/FONT&gt;&lt;A href="http://stevenseow.com/" mce_href="http://stevenseow.com/"&gt;&lt;FONT size=3 face=Calibri&gt;Steve Seow&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3 face=Calibri&gt;, another colleague here in Microsoft, has an excellent, concise book on this topic called “&lt;/FONT&gt;&lt;A href="http://www.amazon.com/Designing-Engineering-Time-Psychology-Perception/dp/0321509188/" mce_href="http://www.amazon.com/Designing-Engineering-Time-Psychology-Perception/dp/0321509188/"&gt;&lt;FONT size=3 face=Calibri&gt;Designing and Engineering Time&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3&gt;,” complete with application responsiveness guidelines to help improve customer satisfaction. If you are in a position to design a new application from scratch from First Principles, Steve’s book will be an invaluable guide.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="TEXT-INDENT: -0.25in; MARGIN: 0in 0in 10pt 38.55pt; mso-list: l1 level1 lfo2; mso-add-space: auto" class=MsoListParagraphCxSpLast&gt;&lt;SPAN style="FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol"&gt;&lt;SPAN style="mso-list: Ignore"&gt;&lt;FONT size=3&gt;·&lt;/FONT&gt;&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;If the application currently exists in some form or another, measure its current performance. When you deliver the next version of the application, any significant decrease in performance from one release to the next will be perceived as an irritant and received negatively by existing users. In other words, measure the scenario of interest on the current system &amp;amp; use that as a &lt;EM&gt;baseline&lt;/EM&gt; that you won’t regress in a subsequent version. &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt 2.55pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;If you have to start somewhere, measuring current levels of performance around key scenarios and using them as a baseline gives you a place to start, at least. My experience is that the current level of performance sets expectations that the next version of the application must meet if you want your customers to be satisfied. In this context, Steve Seow's book&amp;nbsp;cites psychological research into how much of a response time difference is necessary to be perceived as a difference. (About 20% in either direction makes a difference.) This reminds me of &lt;/FONT&gt;&lt;A href="http://en.wikipedia.org/wiki/Gregory_Bateson" mce_href="http://en.wikipedia.org/wiki/Gregory_Bateson"&gt;&lt;FONT size=3 face=Calibri&gt;Gregory Bateson&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;’s adage that “information is a difference that &lt;I style="mso-bidi-font-style: normal"&gt;makes&lt;/I&gt; a difference.” &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt 2.55pt" class=MsoNormal&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;I do think that over time, humans adapt themselves to the response times they experience, such that, eventually, the response times of the new version become the new baseline. In other words, our positive or negative perception tends to atrophy over time. For example, consider the last time you acquired a new desktop or portable computer that was noticeably faster than its predecessor. How long was it before that rush of enthusiasm for the fast, new machine started to diminish? About 30 days, in my experience.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Twenty-five years ago, I was in a similar position to Charles, responsible for performance and capacity planning at a large telecommunications company for maybe 20 IBM mainframe computers, which was considered a whole lot of machines to keep track of back in those days. We used a product called MICS (full disclosure, I was a developer on MICS for a brief period in the mid-80s) to warehouse the performance data we were gathering from these machines and the SAS language for statistical reporting. Subsequently, at Landmark Systems, I designed a “management by exception” feature for our monitoring products that our customers loved based on very simple statistical process control techniques. Today, for Charles’ team that needs to monitor performance on 100,000s of servers, these statistical techniques are the only viable approach.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;But, of course, Uri is correct. You do have to choose the right metrics. I believe Charles has. I will discuss the CPU utilization metrics in Windows &lt;A title="next post in series" href="http://blogs.msdn.com/ddperf/archive/2010/04/04/measuring-processor-utilization-and-queuing-delays-in-windows-applications.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2010/04/04/measuring-processor-utilization-and-queuing-delays-in-windows-applications.aspx"&gt;in my next post&lt;/A&gt;.&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;&lt;/FONT&gt;&lt;/FONT&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;-- Mark Friedman&lt;o:p&gt;&amp;nbsp;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9953831" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Performance+Engineering/">Performance Engineering</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Scalability/">Scalability</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Performance/">Performance</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Control+Engineering/">Control Engineering</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Performance+testing/">Performance testing</category></item><item><title>Looking at Virtual Memory Usage</title><link>http://blogs.msdn.com/b/ddperf/archive/2009/12/08/looking-at-virtual-memory-usage.aspx</link><pubDate>Tue, 08 Dec 2009 15:43:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9934093</guid><dc:creator>David Berg</dc:creator><slash:comments>1</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/ddperf/rsscomments.aspx?WeblogPostID=9934093</wfw:commentRss><comments>http://blogs.msdn.com/b/ddperf/archive/2009/12/08/looking-at-virtual-memory-usage.aspx#comments</comments><description>&lt;P&gt;Brian Harry is continuing a great series of posts on VS2010 performance, you can read the latest in that series &lt;A class="" title="BHarry's blog - Virtual Memory Usahe" href="http://blogs.msdn.com/bharry/archive/2009/12/08/looking-at-virtual-memory-usage.aspx" mce_href="http://blogs.msdn.com/bharry/archive/2009/12/08/looking-at-virtual-memory-usage.aspx"&gt;here&lt;/A&gt;; where Brian talks about the issues we've been seeing around Virtual Memory Exhaustion and what we're doing to address it.&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9934093" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Performance/">Performance</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Visual+Studio/">Visual Studio</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Visual+Studio+2010/">Visual Studio 2010</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Virtual+Memory/">Virtual Memory</category></item><item><title>Improvements in Intellisense post Beta 2</title><link>http://blogs.msdn.com/b/ddperf/archive/2009/12/04/improvements-in-intellisense-post-beta-2.aspx</link><pubDate>Sat, 05 Dec 2009 06:06:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9932898</guid><dc:creator>David Berg</dc:creator><slash:comments>0</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/ddperf/rsscomments.aspx?WeblogPostID=9932898</wfw:commentRss><comments>http://blogs.msdn.com/b/ddperf/archive/2009/12/04/improvements-in-intellisense-post-beta-2.aspx#comments</comments><description>&lt;P&gt;&lt;A class="" title="bharry - Improvements in Intellisense Post Beta 2" href="http://blogs.msdn.com/bharry/archive/2009/12/04/improvements-in-intellisense-post-beta-2.aspx" mce_href="http://blogs.msdn.com/bharry/archive/2009/12/04/improvements-in-intellisense-post-beta-2.aspx"&gt;Brian Harry has posted a discussion of Intellisense performance improvements&lt;/A&gt; in VS2010 that we've made since Beta 2, including bothe before and after videos.&amp;nbsp; He also touches a little on the massive performance effort we're making as a division to address the performance issues identified by our external and internal customers.&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9932898" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Performance/">Performance</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Visual+Studio+2010/">Visual Studio 2010</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Intellisense/">Intellisense</category></item><item><title>Improving the Start-up Performance of the WPF and Silverlight Designer in Visual Studio 2010 Beta 2</title><link>http://blogs.msdn.com/b/ddperf/archive/2009/11/02/improving-the-start-up-performance-of-the-wpf-and-silverlight-designer-in-visual-studio-2010-beta-2.aspx</link><pubDate>Mon, 02 Nov 2009 18:34:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9916292</guid><dc:creator>David Berg</dc:creator><slash:comments>2</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/ddperf/rsscomments.aspx?WeblogPostID=9916292</wfw:commentRss><comments>http://blogs.msdn.com/b/ddperf/archive/2009/11/02/improving-the-start-up-performance-of-the-wpf-and-silverlight-designer-in-visual-studio-2010-beta-2.aspx#comments</comments><description>&lt;P&gt;I wanted to let you know about a last minute change that went into VS 2010 Beta 2 that you can use to improve the startup performance for the WPF and Silverlight Designer.&amp;nbsp; The change went in late and it was a little risky, so we decided to leave it off until we had a chance to do some more testing with it.&amp;nbsp; You can turn the change on yourself via a registry key.&amp;nbsp; We expect the change will be on all the time in the final product, so changing the registry key is strictly a Beta 2 issue.&lt;/P&gt;
&lt;P&gt;You can read more about it here: &lt;A href="http://social.msdn.microsoft.com/Forums/en-US/vswpfdesigner/thread/4511d43f-c134-4329-a970-e374252a620e"&gt;http://social.msdn.microsoft.com/Forums/en-US/vswpfdesigner/thread/4511d43f-c134-4329-a970-e374252a620e&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;If you have any feedback on how this works (or doesn't work) for you, please let me know.&amp;nbsp; You can contact me at &lt;A href="mailto:DevPerf@Microsoft.com"&gt;DevPerf@Microsoft.com&lt;/A&gt;.&lt;/P&gt;
&lt;P&gt;Regards,&lt;/P&gt;
&lt;P&gt;Dave&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9916292" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Performance/">Performance</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Beta/">Beta</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Visual+Studio/">Visual Studio</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Visual+Studio+2010/">Visual Studio 2010</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/WPF/">WPF</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Sliverlight/">Sliverlight</category></item><item><title>VS2010 Performance and Bad Video Drivers/Hardware</title><link>http://blogs.msdn.com/b/ddperf/archive/2009/10/29/vs2010-performance-and-bad-video-drivers-hardware.aspx</link><pubDate>Fri, 30 Oct 2009 05:49:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9915126</guid><dc:creator>David Berg</dc:creator><slash:comments>1</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/ddperf/rsscomments.aspx?WeblogPostID=9915126</wfw:commentRss><comments>http://blogs.msdn.com/b/ddperf/archive/2009/10/29/vs2010-performance-and-bad-video-drivers-hardware.aspx#comments</comments><description>&lt;p class="MsoNormal" style="MARGIN: 0in 0in 10pt"&gt;&lt;span style="font-family: Calibri; font-size: small;"&gt;[Note, this post is superceeded by a newer post &lt;a href="http://blogs.msdn.com/b/ddperf/archive/2010/09/16/vs2010-performance-and-bad-video-drivers-hardware-redux.aspx"&gt;here&lt;/a&gt;.]&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="MARGIN: 0in 0in 10pt"&gt;&lt;span style="font-family: Calibri; font-size: small;"&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="MARGIN: 0in 0in 10pt"&gt;&lt;span style="font-family: Calibri; font-size: small;"&gt;We&amp;rsquo;ve received a few performance complaints around Visual Studio 2010 (Beta 2) performance that can be traced to old video drivers or GPU virtualization issues.&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="MARGIN: 0in 0in 10pt"&gt;&lt;span style="font-family: Calibri; font-size: small;"&gt;If you&amp;rsquo;re seeing slow / broken screen updates verify you have the latest drivers for your system. If this doesn&amp;rsquo;t resolve your rendering issues, you may be able to work around the problem by forcing software emulation mode by changing one registry key:&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="MARGIN: 0in 0in 10pt 0.5in; PADDING-LEFT: 30px"&gt;&lt;span style="font-size: small;"&gt;&lt;span style="font-family: terminal,monaco;"&gt;&lt;span style="font-size: x-small;"&gt;[HKEY_CURRENT_USER\Software\Microsoft\Avalon.Graphics]&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 10pt 0.5in; padding-left: 30px;"&gt;&lt;span style="font-size: small;"&gt;&lt;span style="font-family: terminal,monaco;"&gt;&lt;span style="font-size: x-small;"&gt;"DisableHWAcceleration"=dword:00000001&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="MARGIN: 0in 0in 10pt"&gt;&lt;span style="font-family: Calibri; font-size: small;"&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="MARGIN: 0in 0in 10pt"&gt;&lt;span style="font-family: Calibri; font-size: small;"&gt;As you can probably guess, this can be undone with:&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 10pt 0.5in; padding-left: 30px;"&gt;&lt;span style="font-size: small;"&gt;&lt;span style="font-family: terminal,monaco;"&gt;&lt;span style="font-size: x-small;"&gt;[HKEY_CURRENT_USER\Software\Microsoft\Avalon.Graphics]&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="margin: 0in 0in 10pt 0.5in; padding-left: 30px;"&gt;&lt;span style="font-size: small;"&gt;&lt;span style="font-family: terminal,monaco;"&gt;&lt;span style="font-size: x-small;"&gt;"DisableHWAcceleration"=dword:00000000&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="MARGIN: 0in 0in 10pt"&gt;&lt;span style="font-family: Calibri; font-size: small;"&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="MARGIN: 0in 0in 10pt"&gt;&lt;span style="font-family: Calibri; font-size: small;"&gt;If you have these issues, please, let me know by either posting here or preferably by e-mailing &lt;/span&gt;&lt;a href="mailto:DevPerf@Microsoft.com"&gt;&lt;span style="font-family: Calibri; color: #0000ff; font-size: small;"&gt;DevPerf@Microsoft.com&lt;/span&gt;&lt;/a&gt;&lt;span style="font-family: Calibri; font-size: small;"&gt; with details about the problems you are/were seeing.&lt;span style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/span&gt;(When you e-mail, please run DXDIAG first and attach the DXDIAG.TXT file.&lt;span style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/span&gt;This will give us a lot of information about your system, including what drivers you&amp;rsquo;re running.)&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="MARGIN: 0in 0in 10pt"&gt;&lt;span style="font-family: Calibri; font-size: small;"&gt;We&amp;rsquo;re very interested in finding any hardware/software incompatibilities we might have missed and getting them cleaned up. &lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="MARGIN: 0in 0in 10pt"&gt;&lt;span style="font-family: Calibri; font-size: small;"&gt;If you do find you need to use software emulation, and then get new drivers or hardware later, don&amp;rsquo;t forget to switch software emulation back off so you can benefit from the improved performance.&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="MARGIN: 0in 0in 10pt"&gt;&lt;span style="font-size: small;"&gt;&lt;span style="font-family: Calibri;"&gt;Note that this will impact all WPF applications on the system, not just Visual Studio.&lt;span style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="MARGIN: 0in 0in 10pt"&gt;&lt;o:p&gt;&lt;span style="font-family: Calibri; font-size: small;"&gt;&amp;nbsp;&lt;/span&gt;&lt;/o:p&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="MARGIN: 0in 0in 10pt"&gt;&lt;span style="font-family: Calibri; font-size: small;"&gt;Regards,&lt;br /&gt;&lt;/span&gt;&lt;span style="font-family: Calibri; font-size: small;"&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal" style="MARGIN: 0in 0in 10pt"&gt;&lt;span style="font-family: Calibri; font-size: small;"&gt;Dave Berg&lt;br /&gt;&lt;/span&gt;&lt;span style="font-family: Calibri; font-size: small;"&gt;Developer Division Performance Team&lt;/span&gt;&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9915126" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Performance/">Performance</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Beta/">Beta</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Visual+Studio/">Visual Studio</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Visual+Studio+2010/">Visual Studio 2010</category></item><item><title>Tell us about VS2010 Beta2</title><link>http://blogs.msdn.com/b/ddperf/archive/2009/10/29/tell-us-about-vs2010-beta2.aspx</link><pubDate>Thu, 29 Oct 2009 21:49:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9914979</guid><dc:creator>David Berg</dc:creator><slash:comments>2</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/ddperf/rsscomments.aspx?WeblogPostID=9914979</wfw:commentRss><comments>http://blogs.msdn.com/b/ddperf/archive/2009/10/29/tell-us-about-vs2010-beta2.aspx#comments</comments><description>&lt;P&gt;&lt;FONT face="Times New Roman" size=3&gt;Last week we &lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/jasonz/archive/2009/10/19/announcing-vs2010-net-framework-beta-2.aspx"&gt;&lt;FONT face="Times New Roman" size=3&gt;shipped Beta 2&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face="Times New Roman" size=3&gt; for broad distribution.&amp;nbsp; Many of you have already sent us comments and improvement suggestions (thanks!)&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size=3&gt;At this point we are down to our final set of bug fixing, perf tuning, etc.&amp;nbsp; We’re very interested in your feedback so we can take action on it before we ship the final version.&amp;nbsp; To help make it easy, you can &lt;/FONT&gt;&lt;A href="https://mscuillume.smdisp.net/Collector/Survey.ashx?Name=D10G1"&gt;&lt;FONT face="Times New Roman" size=3&gt;take this simple survey&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face="Times New Roman" size=3&gt;.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="TEXT-ALIGN: center" align=center&gt;&lt;A href="https://mscuillume.smdisp.net/Collector/Survey.ashx?Name=D10G1"&gt;&lt;SPAN style="TEXT-DECORATION: none; text-underline: none"&gt;&lt;FONT face="Times New Roman" size=3&gt;&lt;IMG id=_x0000_i1025 title=image height=92 alt=image src="http://blogs.msdn.com/blogfiles/jasonz/WindowsLiveWriter/VS2010Beta2FeedbackSurvey_C49A/image_3.png" width=240 border=0&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size=3&gt;One thing in particular we are hearing a lot of feedback on is performance.&amp;nbsp; We are working hard on the next round of perf improvements.&amp;nbsp; You can supply your feedback through the survey.&amp;nbsp; When you give us your feedback, the more actionable you can make it the better.&amp;nbsp; We need to know what operations you are doing (like editing, debugging, etc), what kind of hardware you have (CPU, RAM, disk), and your hosting scenario (main machine, running in VM, terminal server, etc).&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size=3&gt;Thanks in advance for your feedback!&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size=3&gt;Dave&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size=3&gt;P.S. If you'd like to talk to me directly about your performance experience, you can reply here or e-mail us at &lt;A href="mailto:DevPerf@Microsoft.com"&gt;DevPerf@Microsoft.com&lt;/A&gt;.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size=3&gt;&lt;/FONT&gt;&amp;nbsp;&lt;/P&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9914979" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Performance/">Performance</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Visual+Studio/">Visual Studio</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Visual+Studio+2010/">Visual Studio 2010</category></item><item><title>Parallel Scalability Isn’t Child’s Play, Part 3: The Problem with Fine-Grained Parallelism</title><link>http://blogs.msdn.com/b/ddperf/archive/2009/06/09/parallel-scalability-isn-t-child-s-play-part-3-the-problem-with-fine-grained-parallelism.aspx</link><pubDate>Tue, 09 Jun 2009 23:17:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9718379</guid><dc:creator>Mark B Friedman</dc:creator><slash:comments>1</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/ddperf/rsscomments.aspx?WeblogPostID=9718379</wfw:commentRss><comments>http://blogs.msdn.com/b/ddperf/archive/2009/06/09/parallel-scalability-isn-t-child-s-play-part-3-the-problem-with-fine-grained-parallelism.aspx#comments</comments><description>&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT face=Calibri&gt;&lt;A title="Part 2 in this series" href="http://blogs.msdn.com/ddperf/archive/2009/04/29/parallel-scalability-isn-t-child-s-play-part-2-amdahl-s-law-vs-gunther-s-law.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2009/04/29/parallel-scalability-isn-t-child-s-play-part-2-amdahl-s-law-vs-gunther-s-law.aspx"&gt;&lt;FONT size=3&gt;In the last blog entry in this series&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3&gt;, I introduced the model for parallel program scalability proposed by Neil Gunther, which I praised for being a realistic antidote to more optimistic, but better known, formulas. Gunther’s model adds a new parameter to the more familiar Amdahl’s law. The additional parameter&lt;I style="mso-bidi-font-style: normal"&gt; k&lt;/I&gt;, representing &lt;I style="mso-bidi-font-style: normal"&gt;coherence&lt;/I&gt;-related delays, enables Gunther’s formula to model behavior where the performance of a parallel program can actually degrade at higher and higher levels of parallelization.&amp;nbsp;&lt;/FONT&gt;&lt;FONT size=3&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;Although I don’t know that the coherence delay factor that Gunther’s formula adds fully addresses the range and depth of the performance issues surrounding fine-grained parallelism, it is certainly one of the key factors Gunther’s law expresses that earlier formulations do not.&lt;/FONT&gt;&lt;/P&gt;&lt;/FONT&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Developers experienced in building parallel programs recognize that Gunther’s formula echoes an inconvenient truth, namely, that the task of achieving performance gains using parallel programming techniques is often quite arduous. For example, in a recent blog entry entitled “&lt;/FONT&gt;&lt;A href="http://software.intel.com/en-us/articles/when-to-say-no-to-parallelism/"&gt;&lt;FONT size=3 face=Calibri&gt;When to Say No to Parallelism&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3 face=Calibri&gt;,” Sanjiv Shah, a colleague at Intel, expressed similar sentiments. One very good piece of advice Sanjiv gives is that you should not even be thinking about parallelism until you have an efficient single-threaded version of your program debugged and running. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Let’s continue, for a moment, in the same vein as “&lt;/FONT&gt;&lt;A href="http://software.intel.com/en-us/articles/when-to-say-no-to-parallelism/"&gt;&lt;FONT size=3 face=Calibri&gt;When to Say No to Parallelism&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3 face=Calibri&gt;.” Let’s look at the major sources of coherence-related delays in various kinds of parallel programs, how and why they occur, and what, if anything, can be done about them. Ultimately, I will try to tie this discussion into one about tools, especially some great new tools in Visual Studio Team System (see &lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/hshafi/archive/2009/05/18/visual-studio-2010-beta-1-parallel-performance-tools.aspx"&gt;&lt;FONT size=3 face=Calibri&gt;Hazim Shafi’s blog&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3 face=Calibri&gt; for details) that help you understand contention in your multi-threaded apps. When you use these new tools to gather and analyze the thread contention data for your app, it helps when you understand some of the common patterns to look for. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;The first aspect of the coherence delays Gunther is concerned with that we will look at are the bare &lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;I style="mso-bidi-font-style: normal"&gt;minimum&lt;/I&gt; additional costs that a parallel program running multiple threads must pay, compared to running the same program single threaded. To simplify matters, we will look at the best possible prospect for parallel programming, an algorithm that is both &lt;I style="mso-bidi-font-style: normal"&gt;embarrassingly parallel&lt;/I&gt; and easy to partition into roughly equal sized subprogram chunks. There are two basic costs to consider: one that is paid up front for initialization of the parallel runtime environment, and one that is paid incrementally each time one of the parallel tasks executes. It is also worth noting that these are unavoidable costs. I will lump both costs into an &lt;I style="mso-bidi-font-style: normal"&gt;overhead&lt;/I&gt; category associated with Gunther’s coherence delay factor &lt;I style="mso-bidi-font-style: normal"&gt;k&lt;/I&gt;. The embarrassingly parallel programs we will consider&amp;nbsp;here will incurr these minimum processor overhead penalties when they are transformed to execute in parallel.&lt;/FONT&gt;&lt;/P&gt;
&lt;H3 style="MARGIN: 10pt 0in 0pt"&gt;&lt;FONT color=#4f81bd size=3 face=Cambria&gt;Fine grained parallelism. &lt;/FONT&gt;&lt;/H3&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;To frame this part of the discussion, let’s also consider the characterization of workloads and their amenability to parallelization into either &lt;I style="mso-bidi-font-style: normal"&gt;fine-grained&lt;/I&gt; or &lt;I style="mso-bidi-font-style: normal"&gt;coarse-grained&lt;/I&gt; ones. This distinction implicitly recognizes the impact of coherency delay factors on scalability. With fine-grained parallelism, the overhead of setting up the parallel runtime &amp;amp; executing the tasks in parallel can easily exceed the benefits. By definition, the initialization overhead of spinning up multiple threads and dispatching them is not nearly so big an issue when the program lends itself to coarse-grained parallelism. Plus, when you are looking at a very long running process with many opportunities to exploit parallelism, it is important to understand you should only have to incur the setup cost once. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Coarse-grained parallelism occurs when each parallel worker thread is assigned to computing a relatively long running function:&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&lt;IMG style="WIDTH: 338px; HEIGHT: 419px" title="coarse-grained parallelism" alt="coarse-grained parallelism" src="http://5l3vgw.bay.livefilestore.com/y1pa53nIchv7Sk5zdNifva6i3kLD1Eu2MsrdAIOayN3r2OUL9XBpRy72kD_wGhKM1uYeb9TjLCGfyiz9E3hg5bqtg/Coarse-grained%20parallelism%20Fork-Join%20flowchart.jpg" width=338 height=419 mce_src="http://5l3vgw.bay.livefilestore.com/y1pa53nIchv7Sk5zdNifva6i3kLD1Eu2MsrdAIOayN3r2OUL9XBpRy72kD_wGhKM1uYeb9TjLCGfyiz9E3hg5bqtg/Coarse-grained%20parallelism%20Fork-Join%20flowchart.jpg"&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&lt;STRONG&gt;Figure 5. Coarse-grained parallelism.&lt;/STRONG&gt;&lt;/P&gt;&lt;?xml:namespace prefix = o /&gt;&lt;o:wrapblock&gt;&lt;?xml:namespace prefix = v ns = "urn:schemas-microsoft-com:vml" /&gt;&lt;v:shapetype id=_x0000_t75 coordsize="21600,21600" o:spt="75" o:preferrelative="t" path="m@4@5l@4@11@9@11@9@5xe" filled="f" stroked="f"&gt;&lt;v:stroke joinstyle="miter"&gt;&lt;/v:stroke&gt;&lt;v:formulas&gt;&lt;v:f eqn="if lineDrawn pixelLineWidth 0"&gt;&lt;/v:f&gt;&lt;v:f eqn="sum @0 1 0"&gt;&lt;/v:f&gt;&lt;v:f eqn="sum 0 0 @1"&gt;&lt;/v:f&gt;&lt;v:f eqn="prod @2 1 2"&gt;&lt;/v:f&gt;&lt;v:f eqn="prod @3 21600 pixelWidth"&gt;&lt;/v:f&gt;&lt;v:f eqn="prod @3 21600 pixelHeight"&gt;&lt;/v:f&gt;&lt;v:f eqn="sum @0 0 1"&gt;&lt;/v:f&gt;&lt;v:f eqn="prod @6 1 2"&gt;&lt;/v:f&gt;&lt;v:f eqn="prod @7 21600 pixelWidth"&gt;&lt;/v:f&gt;&lt;v:f eqn="sum @8 21600 0"&gt;&lt;/v:f&gt;&lt;v:f eqn="prod @7 21600 pixelHeight"&gt;&lt;/v:f&gt;&lt;v:f eqn="sum @10 21600 0"&gt;&lt;/v:f&gt;&lt;/v:formulas&gt;&lt;v:path o:extrusionok="f" gradientshapeok="t" o:connecttype="rect"&gt;&lt;/v:path&gt;&lt;?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /&gt;&lt;o:lock v:ext="edit" aspectratio="t"&gt;&lt;/o:lock&gt;&lt;/v:shapetype&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;v:shape style="Z-INDEX: 251662336; POSITION: absolute; MARGIN-TOP: 0px; WIDTH: 231.55pt; HEIGHT: 286.85pt; MARGIN-LEFT: 0px; mso-position-horizontal: center" id=_x0000_s1026 type="#_x0000_t75"&gt;&lt;v:imagedata src="file:///C:\Users\markfr\AppData\Local\Temp\msohtmlclip1\01\clip_image001.wmz" o:title=""&gt;&lt;/v:imagedata&gt;&lt;?xml:namespace prefix = w ns = "urn:schemas-microsoft-com:office:word" /&gt;&lt;w:wrap type="topAndBottom"&gt;&lt;/w:wrap&gt;&lt;/v:shape&gt;&lt;FONT size=3 face=Calibri&gt;while fine-grained parallelism looks more like this:&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;IMG style="WIDTH: 338px; HEIGHT: 300px" title="fine-grained parallelism" alt="fine-grained parallelism" src="http://5l3vgw.bay.livefilestore.com/y1pdnte4FSJ18igb7DEZZ2veO2esTg2BykZz6FavZ-37bbyLejUIQlFBc4mboJUbzN3jUp3uq-FV8MA2RGunfMe8Q/Fine-grained%20parallelism%20Fork-Join%20flowchart.jpg" width=338 height=300 mce_src="http://5l3vgw.bay.livefilestore.com/y1pdnte4FSJ18igb7DEZZ2veO2esTg2BykZz6FavZ-37bbyLejUIQlFBc4mboJUbzN3jUp3uq-FV8MA2RGunfMe8Q/Fine-grained%20parallelism%20Fork-Join%20flowchart.jpg"&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;STRONG&gt;Figure 6. Fine-grained parallelism.&lt;/STRONG&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;The difference, of course, lies in how long, relatively speaking, the worker thread processing the unit of work executes.&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;The rationale for the fine-grained:coarse-grained distinction is its significance to performance. We identify those parallel algorithms that execute in the worker thread long enough to recover the costs of setting up and running the parallel environment as coarse-grained. The benefits of running such programs in parallel are much easier to realize. On the other hand, the one full proof way to identify fine-grained parallelism is to find embarrassingly parallel programs with very short execution time spans for each parallel task. When executing fine-grained parallel programs, there is a very high risk of slowing down the performance of the program, instead of improving it. (If this sounds like a bit of circular reasoning, it most surely is.) &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Now let’s drill into these costs. In the .NET Framework, setting up a parallel execution environment is usually associated with the ThreadPool object. (If you are not very familiar with how to set up and use a ThreadPool in .NET, you might want to read &lt;/FONT&gt;&lt;A href="http://msdn.microsoft.com/en-us/library/0ka9477y.aspx"&gt;&lt;FONT color=#0000ff size=3 face=Calibri&gt;this&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3 face=Calibri&gt; bit of documentation first that shows some simple C# examples. If you really want to understand all the ins and outs of the .NET ThreadPool, you should considering picking up a copy of Joe Duffy’s very thorough and authoritative book, &lt;/FONT&gt;&lt;A href="http://www.amazon.com/Concurrent-Programming-Windows-Microsoft-Development/dp/032143482X/ref=sr_1_1?ie=UTF8&amp;amp;s=books&amp;amp;qid=1241140965&amp;amp;sr=1-1"&gt;&lt;FONT color=#0000ff size=3 face=Calibri&gt;Concurrent Programming in Windows&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3 face=Calibri&gt;.) &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&lt;BR style="mso-ignore: vglayout" clear=all&gt;&lt;FONT size=3 face=Calibri&gt;The thing about using the ThreadPool object in .NET is that you don’t need to write a lot of code on your own to get it up and running. With very little coding effort, you can be running a parallel program. In the .NET Framework, there are newer programming constructs in the &lt;/FONT&gt;&lt;A href="http://msdn.microsoft.com/en-us/concurrency/default.aspx"&gt;&lt;FONT color=#0000ff size=3 face=Calibri&gt;Task Parallel Library&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3 face=Calibri&gt; that are designed to make it easier for developers to express parallelism and exploit multi-core and many-core computers. But underneath the covers, many of the new TPL constructs are still using the CLR ThreadPool. So whatever overheads are associated with this older, less elegant approach still apply to any of the newer parallel programming constructs.&lt;/FONT&gt;&lt;/P&gt;
&lt;H3 style="MARGIN: 10pt 0in 0pt"&gt;&lt;FONT color=#4f81bd size=3 face=Cambria&gt;The ThreadPool in .NET&lt;/FONT&gt;&lt;/H3&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;The basic pattern for using the ThreadPool is to call the QueueUserWorkItem() method on the ThreadPool class, passing a delegate that performs the actual processing of the request, along with a set of parameters that are wrapped into a singleton Object. The parameters delineate the unit of work that is being requested. Typically, you also pass to the delegate a &lt;A title="ManualResetEvent reference" href="http://msdn.microsoft.com/en-us/library/system.threading.manualresetevent.aspx" mce_href="http://msdn.microsoft.com/en-us/library/system.threading.manualresetevent.aspx"&gt;ManualResetEvent&lt;/A&gt;, what is known as a &lt;EM&gt;synchronization primitive&lt;/EM&gt;. This event is used by the delegate to signal the Main task that the worker thread is finished processing the Work Item request.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Structurally, you have to write:&lt;/FONT&gt;&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;
&lt;DIV style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;a class that wraps the Work Item parameter list,&lt;/FONT&gt;&lt;/DIV&gt;&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;a C# delegate that runs in the worker thread to process the Work Item request,&lt;/FONT&gt;&lt;/DIV&gt;&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;a dispatcher routine that queues work items for the thread pool delegate to process, and&lt;/FONT&gt;&lt;/DIV&gt;&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;finally, an event handler to get control when the worker thread completes.&lt;/FONT&gt;&lt;/DIV&gt;&lt;/LI&gt;&lt;/OL&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;For example, to implement the Fork/Join pattern in .NET using the built-in ThreadPool Object, create (1) a wrapper for the parameter list:&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: auto auto 0pt; mso-add-space: auto; mso-layout-grid-align: none" class=MsoNormalCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; COLOR: blue; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;public&lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt; &lt;SPAN style="COLOR: blue"&gt;class&lt;/SPAN&gt; &lt;SPAN style="COLOR: #2b91af"&gt;WorkerThreadParms&lt;/SPAN&gt; &lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: auto auto 0pt; mso-add-space: auto; mso-layout-grid-align: none" class=MsoNormalCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;{&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; TEXT-INDENT: 0.5in; MARGIN: auto auto 0pt; mso-add-space: auto; mso-layout-grid-align: none" class=MsoNormalCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; COLOR: blue; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;private&lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt; &lt;SPAN style="COLOR: #2b91af"&gt;ManualResetEvent&lt;/SPAN&gt; _thisevent;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; TEXT-INDENT: 0.5in; MARGIN: auto auto 0pt; mso-add-space: auto; mso-layout-grid-align: none" class=MsoNormalCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;…&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: auto auto 0pt; mso-add-space: auto; mso-layout-grid-align: none" class=MsoNormalCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; COLOR: blue; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;public&lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt; &lt;SPAN style="COLOR: #2b91af"&gt;ManualResetEvent&lt;/SPAN&gt; thisevent&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: auto auto 0pt; mso-add-space: auto; mso-layout-grid-align: none" class=MsoNormalCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;{&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; TEXT-INDENT: 0.5in; MARGIN: 0in 0in 0pt 0.5in; mso-add-space: auto; mso-layout-grid-align: none" class=MsoNormalCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; COLOR: blue; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;get&lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt; { &lt;SPAN style="COLOR: blue"&gt;return&lt;/SPAN&gt; _thisevent; &lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; TEXT-INDENT: 0.5in; MARGIN: auto auto 0pt; mso-add-space: auto; mso-layout-grid-align: none" class=MsoNormalCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;}&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; TEXT-INDENT: 0.5in; MARGIN: auto auto 0pt; mso-add-space: auto; mso-layout-grid-align: none" class=MsoNormalCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;…&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: auto auto 0pt; mso-add-space: auto; mso-layout-grid-align: none" class=MsoNormalCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN style="COLOR: blue"&gt;public&lt;/SPAN&gt; WorkerThreadParms(&lt;SPAN style="COLOR: #2b91af"&gt;ManualResetEvent&lt;/SPAN&gt; signalwhendone, …,)&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: auto auto 0pt; mso-add-space: auto; mso-layout-grid-align: none" class=MsoNormalCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;{&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: auto auto 0pt; mso-add-space: auto; mso-layout-grid-align: none" class=MsoNormalCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN style="mso-tab-count: 1"&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-tab-count: 1"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;_thisevent = signalwhendone;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: auto auto 0pt; mso-add-space: auto; mso-layout-grid-align: none" class=MsoNormalCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;&lt;SPAN style="mso-tab-count: 2"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;…&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: auto auto 0pt; mso-add-space: auto; mso-layout-grid-align: none" class=MsoNormalCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;}&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: auto auto 0pt; mso-add-space: auto; mso-layout-grid-align: none" class=MsoNormalCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;}&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: 0in 0in 0pt; mso-layout-grid-align: none" class=MsoNormal&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;&lt;o:p&gt;&amp;nbsp;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;SPAN style="mso-no-proof: yes"&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;(2) the ThrealPool delegate that unwraps the parameter list, performs some work, then signals the main thread when it is done:&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: 0in 0in 0pt; mso-layout-grid-align: none" class=MsoNormal&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; COLOR: blue; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;public&lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt; &lt;SPAN style="COLOR: blue"&gt;void&lt;/SPAN&gt; ThreadPoolDelegate(&lt;SPAN style="COLOR: #2b91af"&gt;Object&lt;/SPAN&gt; parm)&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: 0in 0in 0pt; mso-layout-grid-align: none" class=MsoNormal&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;{&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; TEXT-INDENT: 0.5in; MARGIN: 0in 0in 0pt; mso-layout-grid-align: none" class=MsoNormal&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; COLOR: #2b91af; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;WorkerThreadParms&lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt; p = (&lt;SPAN style="COLOR: #2b91af"&gt;WorkerThreadParms&lt;/SPAN&gt;) parm;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: 0in 0in 0pt; mso-layout-grid-align: none" class=MsoNormal&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="COLOR: #2b91af"&gt;ManualResetEvent&lt;/SPAN&gt; signal = p.thisevent;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;… &lt;SPAN style="COLOR: #00b050"&gt;//Do some work here&lt;/SPAN&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;…&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="TEXT-INDENT: 0.5in; MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;signal.Set(); &lt;SPAN style="COLOR: #00b050"&gt;//Signal main task when done&lt;/SPAN&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;}&lt;/SPAN&gt;&lt;/P&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;(3) a (simple) dispatcher loop:&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: 0in 0in 0pt; mso-layout-grid-align: none" class=MsoNormal&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; COLOR: #2b91af; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;ManualResetEvent&lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;[] thisevent = &lt;SPAN style="COLOR: blue"&gt;new&lt;/SPAN&gt; &lt;SPAN style="COLOR: #2b91af"&gt;ManualResetEvent&lt;/SPAN&gt;[tasks];&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: 0in 0in 0pt; mso-layout-grid-align: none" class=MsoNormal&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; COLOR: blue; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;for&lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt; (&lt;SPAN style="COLOR: blue"&gt;int&lt;/SPAN&gt; j = 0; j &amp;lt; tasks; j++)&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: 0in 0in 0pt; mso-layout-grid-align: none" class=MsoNormal&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;{&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; TEXT-INDENT: 0.5in; MARGIN: 0in 0in 0pt; mso-layout-grid-align: none" class=MsoNormal&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;thisevent[j] = &lt;SPAN style="COLOR: blue"&gt;new&lt;/SPAN&gt; &lt;SPAN style="COLOR: #2b91af"&gt;ManualResetEvent&lt;/SPAN&gt;(&lt;SPAN style="COLOR: blue"&gt;false&lt;/SPAN&gt;);&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: 0in 0in 0pt; mso-layout-grid-align: none" class=MsoNormal&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="COLOR: #2b91af"&gt;WorkerThreadParms&lt;/SPAN&gt; p = &lt;SPAN style="COLOR: blue"&gt;new&lt;/SPAN&gt; &lt;SPAN style="COLOR: #2b91af"&gt;WorkerThreadParms&lt;/SPAN&gt;(thisevent[j],…);&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: 0in 0in 0pt; mso-layout-grid-align: none" class=MsoNormal&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="COLOR: #2b91af"&gt;WorkerThread&lt;/SPAN&gt; worker = &lt;SPAN style="COLOR: blue"&gt;new&lt;/SPAN&gt; &lt;SPAN style="COLOR: #2b91af"&gt;WorkerThread&lt;/SPAN&gt;();&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: 0in 0in 0pt; mso-layout-grid-align: none" class=MsoNormal&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="COLOR: #2b91af"&gt;ThreadPool&lt;/SPAN&gt;.QueueUserWorkItem (worker.ThreadPoolDelegate,p);&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: 0in 0in 0pt; mso-layout-grid-align: none" class=MsoNormal&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;}&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;SPAN style="mso-no-proof: yes"&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;SPAN style="mso-no-proof: yes"&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;followed by (4) a WaitForMultipleObjects in the Main thread:&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; COLOR: #2b91af; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;ManualResetEvent&lt;/SPAN&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;.WaitAll(thisevent);&lt;/SPAN&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-SIZE: 9pt"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3 face=Calibri&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;You can see there isn’t very much code for you to write to get up and running in parallel and start taking advantage of all those multi-core processor resources. You should take Sanjiv Khan’s advice and write &amp;amp; debug the delegate code you intend to parallelize by testing it in a single threaded mode first. Once the single threaded program is debugged and optimized, you can easily restructure the program to run in parallel by encapsulating that processing in your worker thread delegate following this simple recipe.&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;Even though there isn’t very much code for you to write, there are overhead considerations that you need to be aware of. Let’s look at the simplest case where the program is embarrassingly parallel (as discussed in the previous blog entry). This allows us to ignore complications introduced by serialization and locking (which we will get to later). These overheads include &lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="TEXT-INDENT: -0.25in; MARGIN: 0in 0in 0pt 0.5in; mso-list: l0 level1 lfo1" class=MsoListParagraphCxSpFirst&gt;&lt;SPAN style="mso-bidi-font-family: Calibri; mso-bidi-theme-font: minor-latin"&gt;&lt;SPAN style="mso-list: Ignore"&gt;&lt;FONT size=3 face=Calibri&gt;(1)&lt;/FONT&gt;&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3 face=Calibri&gt;work done in the Common Language Runtime (CLR) on your behalf to spin up the worker threads in the ThreadPool initially, and &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="TEXT-INDENT: -0.25in; MARGIN: 0in 0in 10pt 0.5in; mso-list: l0 level1 lfo1" class=MsoListParagraphCxSpLast&gt;&lt;SPAN style="mso-bidi-font-family: Calibri; mso-bidi-theme-font: minor-latin"&gt;&lt;SPAN style="mso-list: Ignore"&gt;&lt;FONT size=3 face=Calibri&gt;(2)&lt;/FONT&gt;&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3 face=Calibri&gt;the additional cost when running the parallel program associated with queuing work items, dispatching them to a thread pool thread to process, and signaling the main dispatcher thread when done. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;In the case of the initialization work, this is something that only needs to be done once. The other costs accrue each time you need to dispatch a worker thread. For parallelism to be an effective scaling strategy, it is necessary to amortize this overhead cost over the life of the parallel threads. The worker thread delegates need to execute long enough that there is a benefit to executing in parallel. And this is for a best case for parallelism where the underlying program is both embarrassingly parallel and easy to partition into roughly equivalent work requests. &lt;/FONT&gt;&lt;/P&gt;
&lt;H3 style="MARGIN: 10pt 0in 0pt"&gt;&lt;FONT color=#4f81bd size=3 face=Cambria&gt;A Parallel.For example.&lt;/FONT&gt;&lt;/H3&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Using the new Parallel.For construct in the Task Parallel Library, by the way, there is even less code you have to write. All you need to code the Parallel.For is write is the worker thread delegate, because the TPL library handles the remaining boilerplate tasks. However, the underlying overhead considerations are almost identical.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;The challenge of speeding up programs that have fine-grained parallelism grows in tandem with making it easier for developers to write concurrent programs. Take a look at the following example of parallelization in C# using the new Parallel.For construct taken verbatim from a Microsoft white paper entitled “&lt;A title="Taking Parallelism Mainstream" href="http://download.microsoft.com/download/D/5/9/D597F62A-0BEE-4CE7-965B-099D705CFAEE/Taking%20Parallelism%20Mainstream%20Microsoft%20February%202009.docx" mce_href="http://download.microsoft.com/download/D/5/9/D597F62A-0BEE-4CE7-965B-099D705CFAEE/Taking%20Parallelism%20Mainstream%20Microsoft%20February%202009.docx"&gt;Taking Parallelism Mainstream&lt;/A&gt;.” The white paper describes some of the new language facilities in the Task Parallel Library that make it easier for developers to write parallel programs. These new language facilities include Parallel For loops, Parallel LINQ, Parallel Invoke, Futures and Continuations, and messaging using asynchronous agents. To the extent that these parallel computing initiatives succeed, they will generate the need for better performance analysis tools to understand the performance of concurrent programs because not everyone who implements these new constructs is going to see impressive speed-up of his or her applications. Some developers will even see the retrograde performance predicted by Gunther’s scalability formula.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Here’s the C# program that illustrates one of the new parallel programming constructs:&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; COLOR: #4f81bd; FONT-SIZE: 9pt; mso-themecolor: accent1"&gt;IEnumerable&amp;lt;StockQuote&amp;gt; Query(IEnumerable&amp;lt;StockQuote&amp;gt; stocks) {&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; COLOR: #4f81bd; FONT-SIZE: 9pt; mso-themecolor: accent1"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;var results = new ConcurrentBag&amp;lt;StockQuote&amp;gt;();&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; COLOR: #4f81bd; FONT-SIZE: 9pt; mso-themecolor: accent1"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;Parallel.ForEach (stocks, stock =&amp;gt; {&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; COLOR: #4f81bd; FONT-SIZE: 9pt; mso-themecolor: accent1"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;if (stock.MarketCap &amp;gt; 100000000000.0 &amp;amp;&amp;amp;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; COLOR: #4f81bd; FONT-SIZE: 9pt; mso-themecolor: accent1"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;stock.ChangePct &amp;lt; 0.025 &amp;amp;&amp;amp;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; COLOR: #4f81bd; FONT-SIZE: 9pt; mso-themecolor: accent1"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;stock.Volume&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&amp;gt; 1.05 * stock.VolumeMavg3M) {&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; COLOR: #4f81bd; FONT-SIZE: 9pt; mso-themecolor: accent1"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;results.Add(stock);&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; COLOR: #4f81bd; FONT-SIZE: 9pt; mso-themecolor: accent1"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;}&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; COLOR: #4f81bd; FONT-SIZE: 9pt; mso-themecolor: accent1"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;});&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; COLOR: #4f81bd; FONT-SIZE: 9pt; mso-themecolor: accent1"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;return results;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; COLOR: #4f81bd; FONT-SIZE: 9pt; mso-themecolor: accent1"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;}&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;The example uses a &lt;I style="mso-bidi-font-style: normal"&gt;Parallel.ForEach&lt;/I&gt; enumeration loop, along with one of the new concurrent collection classes, the &lt;I style="mso-bidi-font-style: normal"&gt;ConcurrentBag&lt;/I&gt;, to evaluate stock prices based on some set of selection criteria. The problem, the white paper author notes, is one that is considered “embarrassingly parallel” because it is easily decomposed into independent sub-problems that can be executed concurrently. The new C# Task Parallel Library language features provide an elegant way to express this parallelism. Underneath the Parallel.For expression is a run-time library that understands how to partition the body of the parallel For loop into multiple work items, and dispatches them to separate worker threads that are then scheduled to execute concurrently. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;The .NET Task Parallel Library (TPL) provides the run-time machinery to turn this program into a parallel program. At run-time, it automatically parallelizes the lambda expression that the &lt;I style="mso-bidi-font-style: normal"&gt;Parallel.For&lt;/I&gt; construct references. In this example, the Task Parallel Library takes the If Statement in the lambda expression and queues it up to run in parallel using the concurrent runtime library in .NET. The concurrent runtime library creates a thread pool and then delegates the processing of the lambda expression to these worker threads. The concurrent runtime attempts to allocate and schedule an optimal number of worker threads to this parallel task.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;While TPL makes it easier to express parallelism in your programs and eliminates most of the grunt work in setting up the runtime environment associated with parallel threads, it cannot guarantee that running this program in parallel on a machine with four or eight processors will actually speed up its execution time. That is the essence of the challenge of fine-grained parallelism. There is overhead associated with queuing work items to the thread pool the parallel run-time manages. This is overhead that the serial version of the program does not encounter. Only when the amount of work done inside the lambda expression executes for a long enough time does the benefit of parallelizing the lambda expression exceeds this cost, which must be amortized over each concurrent execution of the inner body of the Parallel.For loop. Note that this particular set of overheads is unavoidable. When you are dealing with fine-grained parallelism, the overhead of setting up the parallel run-time environment alone often exceeds the potential benefit of executing in parallel, notwithstanding other possible sources of contention-related delays that could further slow down execution time.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;So, it is important to realize that a code sample like the one I have taken here from the parallel programming white paper was chosen to illustrate the expressive power of the new language constructs. The example was intended to show the &lt;EM&gt;pattern&lt;/EM&gt; that developers should adhere to -- I am certain it wasn't intended to illustrate something specific you would actual do. When you are taking advantage of these new parallel programming language extensions, you need to be aware of the fine-grained:coarse-grained distinction.&amp;nbsp;This is emphatically not an example of a program that you will necessarily speed up by running it in parallel. It will take considerably more anlysis to figure out if parallelism is the right solution here. Speeding up a serial program by running portions of it in parallel isn’t always easy – even when that program has sections that are “embarrassingly parallel.”&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;/P&gt;&lt;SPAN style="mso-spacerun: yes"&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;What we’d like to be able to do is to estimate the actual performance improvement we can expect of this parallel program, compared to its original serial version. “Embarrassingly parallel” is another way of saying that, once parallelized, the serial portion of this program is expected to be quite small. However, this is also an example of fine-grained parallelism because the amount of code associated with the lambda expression that is passed to worker thread delegate to execute is also quite small. An experienced developer at this point should be asking whether the overhead of creating these working threads and dispatching them might, in fact, be greater than the benefit of executing this task in parallel. This relatively fixed overhead is especially important when the amount of work performed by each delegate is quite small – there is only a limited opportunity to amortize this overhead to initiate and manage the multithreaded operation across the execution time of each of the worker threads. It is extremely important to understand this in the case of fine-grained parallelism.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Next, we will turn to coarse-grained parallelism, where the odds are much better that you may be able to speed up program execution substantially using concurrent programming techniques. In the next blog entry, I will start to tackle more promising examples. The analysis of the performance costs associated with parallel programming tasks will become more complex. I will try to illustrate this analysis using a concrete programming example that will simulate coarse-grained parallelism. As we look at how this simple programming example scales on a multi-core machine, it will bring us face-to-face with the pitfalls even experienced developers can expect to encounter when they attempt to parallelize their existing serial applications. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;We will also start to look into the analysis tools that we are available in the next version of the Visual Studio Profiler that greatly help with understanding the performance of your .NET parallel program. In the meantime, if you'd like to get a head start on these new tools in the Visual Studio Profiler , be sure to check out &lt;A href="http://blogs.msdn.com/hshafi/archive/2009/05/18/visual-studio-2010-beta-1-parallel-performance-tools.aspx"&gt;&lt;FONT size=3 face=Calibri&gt;Hazim Shafi’s blog&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3 face=Calibri&gt; for more details.&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/o:wrapblock&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9718379" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/-NET/">.NET</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Scalability/">Scalability</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Parallel+programming/">Parallel programming</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Performance/">Performance</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Beta/">Beta</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Visual+Studio+Profiler/">Visual Studio Profiler</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Visual+Studio+Team+Developer/">Visual Studio Team Developer</category></item><item><title>Are we taking advantage of Parallelism?</title><link>http://blogs.msdn.com/b/ddperf/archive/2009/05/02/are-we-taking-advantage-of-parallelism.aspx</link><pubDate>Sun, 03 May 2009 01:38:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9584046</guid><dc:creator>Sunny Egbo</dc:creator><slash:comments>1</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/ddperf/rsscomments.aspx?WeblogPostID=9584046</wfw:commentRss><comments>http://blogs.msdn.com/b/ddperf/archive/2009/05/02/are-we-taking-advantage-of-parallelism.aspx#comments</comments><description>&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;Recently, a colleague of mine, Mark Friedman, posted a blog titled “&lt;/FONT&gt;&lt;/SPAN&gt;&lt;A href="http://blogs.msdn.com/ddperf/archive/2009/04/29/parallel-scalability-isn-t-child-s-play-part-2-amdahl-s-law-vs-gunther-s-law.aspx#9576239" mce_href="http://blogs.msdn.com/ddperf/archive/2009/04/29/parallel-scalability-isn-t-child-s-play-part-2-amdahl-s-law-vs-gunther-s-law.aspx#9576239"&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT color=#0000ff size=3&gt;Parallel Scalability Isn’t Child’s Play&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;” in which he reviewed the merits of Amdahl Law vs. Gunther’s Law for determining the practical limits to parallelization. I would not argue with the premise of Mark’s blog that Parallelism is not child’s play. However, I do have alternate views of the use of Amdahl Law and Gunther’s Law that I posted on his blog. &lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;I think my views and comments on Mark’s blog warrant another blog to fully explain.&lt;?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;Speaking of child’s play: my 10-year old son recently made a two-part movie titled “&lt;I style="mso-bidi-font-style: normal"&gt;the Way&lt;/I&gt;” and “&lt;I style="mso-bidi-font-style: normal"&gt;the Way Back&lt;/I&gt;” complete with a full storyline, multiple sound tracks and narrations. He put these movies together with only the help of his eight-year old sister, using sample movie clips and stock photographs he found on his computer hard drive. He asked me for help getting his two masterpieces onto a DVD capable of playing on the average home DVD player. Also, he asked about the length of a typical movie playing in movie theaters around the U.S. (approximately 2 hours) and how much these movies cost at the movie theater (approximately $12 for adults and $8 for children, minus the popcorn). Based on my answers, he determined that he will charge 25 cents for people to watch his movies, because he wanted everyone to attend. I wanted to ask him how much he would charge someone who decided to watch only one of the clips. However, I didn’t because I did not want to lose a price haggling war with a 10-year old. Besides, it would be terrible if you cannot find your way back.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;In any case, his movies were quite impressive. The most technologically savvy thing I did as a 10-year old kid was to build a telephone line with tomato soup cans and a string. Movie making was out of reach for me; but now it is child’s play.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;Today, parallelism is not child’s play. However, I hold out hope that in the future the typical computer program would be written with parallelism in mind. Is parallelism ever going to be child’s play in the future the way movie making is today? &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;Parallelism exists everywhere: &lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;instruction level, memory level, loop level and task level parallelism, etc. Also, parallelism has been with us for quite some time now. For the past several decades, hardware engineers have quietly been busy solving problems in parallel to improve processor and system level performance. However, for the past four or more years, hardware designers have encountered the twin brick walls created by memory speed and power. These walls have forced CPU architects and hardware designers to go multi-core in a major way. The doubling of the CPU frequency every 18 months, that was true for many decades, are no longer practical and have come to an abrupt end. Although, hardware performance continues to improve as my colleagues and I pointed out in our blog “&lt;/FONT&gt;&lt;/SPAN&gt;&lt;A href="http://blogs.msdn.com/ddperf/archive/2008/06/18/lessons-from-the-test-lab-investigating-a-pleasant-surprise.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2008/06/18/lessons-from-the-test-lab-investigating-a-pleasant-surprise.aspx"&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT color=#0000ff size=3&gt;Investigating a Pleasant Surprise&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;FONT size=3&gt;&lt;FONT face="Times New Roman"&gt;,&lt;/FONT&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;” the pace of CPU frequency increases has slowed considerably. Instead, hardware designers have been doubling the number of cores available on a single CPU socket every couple of years. &lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;To get the same level of performance that was previously possible, software engineers would now need to step up to the plate—to write software in a parallel and scalable fashion. They would need tools and frameworks that allow them to think about their problems, identify opportunities for parallelism and to analyze their solutions correctly and efficiently. &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;I am a big fan of Amdahl Law as an analysis framework. However, I do not subscribe to the narrow view that Amdahl’s Law applies only to parallelism, as most people who write about it seem to imply. I prefer the broader treatment of the Law by Hennessy and Patterson in their famous book “&lt;/FONT&gt;&lt;/SPAN&gt;&lt;A href="http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901/ref=sr_1_1?ie=UTF8&amp;amp;s=books&amp;amp;qid=1241294120&amp;amp;sr=1-1" mce_href="http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901/ref=sr_1_1?ie=UTF8&amp;amp;s=books&amp;amp;qid=1241294120&amp;amp;sr=1-1"&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT color=#0000ff size=3&gt;Computer Architecture: A Quantitative Approach&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;”—where Amdahl’s Law is used to estimate the opportunities between competing designs. Amdahl’s Law is very powerful for showing the areas that will likely yield the most fruitful performance gains. In my performance design, tuning and optimization work, I use Amdahl Law for prioritizing the areas of opportunities to focus my efforts to gain performance.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;Amdahl’s Law is not the limit to either absolute performance or parallelism as many authors seem to suggest. Gunther’s and Gustafson’s Laws are helpful for putting Amdahl’s Law in perspective. However, like Amdahl’s Law, these laws are not fundamental limits. The use of these three laws to estimate the level of parallelism that is possible is very flawed. Specifically, the use of these laws as fundamental limits can obfuscate the level of parallelism and performance inherent in typical computing problems. These laws gloss over a number of important points and practical aspects of obtaining parallelism in general purpose computing, including that:&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: Calibri; mso-bidi-theme-font: minor-latin"&gt;&lt;SPAN style="mso-list: Ignore"&gt;&lt;FONT size=3&gt;
&lt;P style="MARGIN-LEFT: 0.5in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: Calibri; mso-bidi-theme-font: minor-latin"&gt;&lt;SPAN style="mso-list: Ignore"&gt;1.&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;Many user tasks are non-monolithic and can be solved in a distributed fashion. Background tasks (e.g., virus scans) that often block single processor execution can now be done in a way that improves user experiences. The key is to identify unnecessary dependencies that would allow these tasks to proceed in parallel with other tasks in a multi-core computer.&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN-LEFT: 0.5in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: Calibri; mso-bidi-theme-font: minor-latin"&gt;&lt;SPAN style="mso-list: Ignore"&gt;2.&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;Some algorithms that have inefficient sequential solutions surprisingly have efficient parallel solutions. This fact should be comforting to fans of algorithms. For example, many applications require matrix multiplication, which turns out to be easily parallelizable. Although the best sequential algorithm for matrix multiplication has a time complexity of O(n&lt;SUP&gt;2.376&lt;/SUP&gt;), a straight-forward parallel solution has an asymptotic time complexity of O(log n) using n&lt;SUP&gt;2.376&lt;/SUP&gt; processor.&amp;nbsp;In other words, we can readily find a parallel solution for matrix multiplication that improves its runtime as more and more processor cores become available. Of course, you might have difficulty conceiving of n&lt;SUP&gt;2.376&lt;/SUP&gt; processors in a system--as a colleague mentioned recently. However, this is just another way of saying that matrix multiplication will benefit with more and more processors.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN-LEFT: 0.5in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: Calibri; mso-bidi-theme-font: minor-latin"&gt;&lt;SPAN style="mso-list: Ignore"&gt;3.&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;Some poor sequential algorithms can be easily parallelized to execute in less time than their sequential solutions. Also, we know that&amp;nbsp;some algorithms that have the best asymptotic time complexities achieve&amp;nbsp;their speed&amp;nbsp;by introducing&amp;nbsp;data dependencies that make parallelization&amp;nbsp;difficult and that the best asymptotic time complexity does not necessarily translate to the best runtime in real life. Hence, at some point, the benefit of the simpler parallelization of some&amp;nbsp;poor sequential algorithms that have little data dependencies&amp;nbsp;can outweigh the benefit of&amp;nbsp;more efficient sequential counterparts that have data dependencies. Hence, when considering parallel solutions it is not always necessary to start with the sequential solution with the best time complexity [also, see comment about Fortune and Wylie below].&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN-LEFT: 0.5in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: Calibri; mso-bidi-theme-font: minor-latin"&gt;&lt;SPAN style="mso-list: Ignore"&gt;4.&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;The real world performance of applications is not determined exclusively by the asymptotic time complexity of algorithms. Because of the increasing gap between CPU and memory speed, memory accesses are increasingly dominating the performance of applications running on modern CPUs. Although, the gap can be mitigated with large caches, every cache miss takes hundreds of CPU cycles to complete. Even a modest overlap in these memory accesses (Memory Level Parallelism) can improve application performance in noticeable ways.&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN-LEFT: 0.5in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="FONT-SIZE: 11pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: 'Times New Roman'; mso-bidi-theme-font: minor-bidi; mso-fareast-font-family: Calibri; mso-fareast-theme-font: minor-latin; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA"&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;Over the years, there have been efforts to classify computationally intractable problems. Many decision problems (i.e., Yes/No) and their optimization counterparts have been categorized into NP-complete and NP-Hard sets respectively. The Travelling Salesman (TSP), Online Bin-Packing and 3-Dimensional Matching problems are three famous examples of NP-Complete problems. In a similar fashion, problems that are difficult to parallelize have been categorized into the P-Complete set or the set of problems that are known to be inherently sequential. As you can imagine sorting is not P-Complete. Likewise, Matrix Multiplication is not in the P-Complete set. Processor scheduling can be done in O(log n) time units using &lt;I style="mso-bidi-font-style: normal"&gt;n&lt;/I&gt; processors—so, it is not P-Complete either. In an ultimate twist of irony, many NP-Hard problems have heuristic solutions that can be executed in parallel to approximate the real solutions. Hence, the natural inclination to think that NP-Complete problems cannot be parallelized is not borne out in practice.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;As it turns out, the real limit to parallelism seem not to be defined by Amdahl’s Law, Gunther’s Law, Gustafson’s Law or NP-Completeness, but by the P-Complete set. It appears that parallelizable problems are related to the asymptotic space complexity of their sequential solutions. According to the Fortune and Wylie’s Parallel Processing Thesis, any problem that can be solved with a poly-logarithmic space complexity can be parallelized efficiently. Because of&amp;nbsp;the time space trade-off of algorithms, this&amp;nbsp;implies that the sequential algorithm that achieves this&amp;nbsp;space complexity is not necessarily the&amp;nbsp;algorithm with&amp;nbsp;the best asymptotic time complexity. &lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;In any case,&amp;nbsp;because one can evaluate problems on multiple levels beyond algorithms (e.g., at the instruction, memory and data access, loop and task levels), the set of problems that can be parallelized appears to be quite large. The question is how to identify and take advantage of the parallelization opportunities that may be inherently available and to do so in an efficient and scalable manner. How can we parallelize loops? How do we&amp;nbsp;overlap high latency activities such as accesses to physical memory or I/O to amortize the cost of those activities? How do we minimize synchronizations? How do we partition tasks to eliminate bottlenecks&amp;nbsp;from the&amp;nbsp;critical paths? How do we dispatch work efficiently to improve efficient system utilization, improve throughput and improve latency? What areas of our application can benefit from what sets of efforts? These are some of the questions that allow for scalable designs.&amp;nbsp;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;Today, the tools to identify parallelism and scalability opportunities are very limited. The programming languages that allow programmers to express parallelism in a natural way are completely lacking. The tools to analyze and troubleshoot parallel implementations are limited as well. Debugging parallel implementation is particularly hard. However, I suspect that with some industry focus and incremental progress, we could continue to make parallelism accessible to average programmers in a few years. However, we are many years away.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size=3&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'"&gt;W&lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;hat are some of the fundamental limits preventing such tools to be built? Like Mark said on his blog, &lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'"&gt;achieving improved scalability using parallel programming techniques is certainly very challenging. But, can parallel programming be made less challenging with intuitive tools that expose parallel solutions in a natural way and allow programmers to exploit them? &lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;Can programming languages and tools improve to a point where a typical 10-year old will be able to write a parallel program as easily as they can put together a multi-track movie today?&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;o:p&gt;&lt;FONT size=3&gt;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;Sunny Egbo&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;/SPAN&gt;&amp;nbsp;&lt;/P&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9584046" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Performance+Engineering/">Performance Engineering</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/-NET/">.NET</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Scalability/">Scalability</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Parallel+programming/">Parallel programming</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Performance/">Performance</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Hardware/">Hardware</category></item><item><title>Parallel Scalability Isn’t Child’s Play, Part 2: Amdahl’s Law vs. Gunther’s Law</title><link>http://blogs.msdn.com/b/ddperf/archive/2009/04/29/parallel-scalability-isn-t-child-s-play-part-2-amdahl-s-law-vs-gunther-s-law.aspx</link><pubDate>Wed, 29 Apr 2009 07:51:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9575026</guid><dc:creator>Mark B Friedman</dc:creator><slash:comments>4</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/ddperf/rsscomments.aspx?WeblogPostID=9575026</wfw:commentRss><comments>http://blogs.msdn.com/b/ddperf/archive/2009/04/29/parallel-scalability-isn-t-child-s-play-part-2-amdahl-s-law-vs-gunther-s-law.aspx#comments</comments><description>&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;&lt;A title="Link to Parallel Scalability Part 1" href="http://blogs.msdn.com/ddperf/archive/2009/03/16/parallel-scalability-isn-t-child-s-play.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2009/03/16/parallel-scalability-isn-t-child-s-play.aspx"&gt;Part 1 of this series of blog entries&lt;/A&gt; discussed results from simulating the performance of a massively parallel SIMD application on several alternative multi-core architectures. These results were reported by researchers at Sandia Labs and publicized in a press release. Neil Gunther, my colleague from the Computer Measurement Group (CMG), referred to the Sandia findings as evidence supporting his &lt;I style="mso-bidi-font-style: normal"&gt;universal scalability law&lt;/I&gt;. This blog entry investigates Gunther’s model of parallel programming scalability, which, unfortunately, is not as well known as it should be. Gunther’s insight is especially useful in the current computing landscape, which is actively embracing parallel computing using multi-core workstations &amp;amp; servers.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Gunther’s scalability formula for parallel processing is a useful antidote to any overly optimistic expectations developers might have about the gains to be had from applying parallel programming techniques. Where Amdahl’s law can be used to establish a theoretical &lt;I style="mso-bidi-font-style: normal"&gt;upper limit&lt;/I&gt; to the speed-up that parallel programming techniques can provide, Gunther’s law can also model the retrograde performance that we frequently observe when parallel computing is used. In other words, Gunther’s scalability formula encapsulates the behavior we frequently observe where adding more and more processors to a parallel processing workload can result in &lt;I style="mso-bidi-font-style: normal"&gt;degraded&lt;/I&gt; performance. It is a more realistic model for people who adopt parallel programming techniques to enhance the scalability of their applications on multi-core hardware. So, without in any way trying to diminish enthusiasm for the entire enterprise, it is crucial to understand that achieving improved scalability using parallel programming techniques can be very challenging.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;As I discussed &lt;/FONT&gt;&lt;A href="http://www.cmg.org/measureit/issues/mit44/m_44_18.html"&gt;&lt;FONT color=#0000ff size=3 face=Calibri&gt;in a review of Gunther’s last book&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3 face=Calibri&gt;, Gunther’s law adds another parameter to the well-known Amdahl’s Law. Gunther calls this parameter &lt;I style="mso-bidi-font-style: normal"&gt;coherence&lt;/I&gt;. Parallel programs have additional costs associated with maintaining the “coherence” of shared data structures, memory locations that are accessed and updated by threads executing in parallel. By incorporating these coherence-related delays, Gunther’s formula is able to model the retrograde performance that all too frequently is observed empirically. The blue line marked “Conventional” in the chart Sandia Labs published (&lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/ddperf/archive/2009/03/16/parallel-scalability-isn-t-child-s-play.aspx"&gt;&lt;FONT size=3 face=Calibri&gt;Figure 1&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3 face=Calibri&gt; in the earlier blog) is a scalability curve that Gunther correctly cites is consistent with his model.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Let’s drill into the mathematics here for a moment. What Gunther’s calls his Universal Scalability law is an extension to the familiar multiprocessor scalability formula first suggested by Gene Amdahl. In &lt;/FONT&gt;&lt;A href="http://en.wikipedia.org/wiki/Amdahl%27s_law"&gt;&lt;FONT color=#0000ff size=3 face=Calibri&gt;Amdahl's law&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3 face=Calibri&gt;, &lt;I style="mso-bidi-font-style: normal"&gt;p&lt;/I&gt; is the proportion of a program that can be parallelized, leaving &lt;I style="mso-bidi-font-style: normal"&gt;1 −p&lt;/I&gt; to represent the part of the program that cannot be parallelized and remains serial. Amdahl’s insight was that the &lt;I style="mso-bidi-font-style: normal"&gt;1-p &lt;/I&gt;amount of time spent in the serial execution portion of the program creates an upper bound on how much its performance can be improved when parallelized. &lt;/FONT&gt;&lt;/P&gt;&lt;?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /&gt;&lt;o:wrapblock&gt;&lt;?xml:namespace prefix = v ns = "urn:schemas-microsoft-com:vml" /&gt;&lt;v:shapetype id=_x0000_t75 coordsize="21600,21600" o:spt="75" o:preferrelative="t" path="m@4@5l@4@11@9@11@9@5xe" filled="f" stroked="f"&gt;&lt;v:stroke joinstyle="miter"&gt;&lt;/v:stroke&gt;&lt;v:formulas&gt;&lt;v:f eqn="if lineDrawn pixelLineWidth 0"&gt;&lt;/v:f&gt;&lt;v:f eqn="sum @0 1 0"&gt;&lt;/v:f&gt;&lt;v:f eqn="sum 0 0 @1"&gt;&lt;/v:f&gt;&lt;v:f eqn="prod @2 1 2"&gt;&lt;/v:f&gt;&lt;v:f eqn="prod @3 21600 pixelWidth"&gt;&lt;/v:f&gt;&lt;v:f eqn="prod @3 21600 pixelHeight"&gt;&lt;/v:f&gt;&lt;v:f eqn="sum @0 0 1"&gt;&lt;/v:f&gt;&lt;v:f eqn="prod @6 1 2"&gt;&lt;/v:f&gt;&lt;v:f eqn="prod @7 21600 pixelWidth"&gt;&lt;/v:f&gt;&lt;v:f eqn="sum @8 21600 0"&gt;&lt;/v:f&gt;&lt;v:f eqn="prod @7 21600 pixelHeight"&gt;&lt;/v:f&gt;&lt;v:f eqn="sum @10 21600 0"&gt;&lt;/v:f&gt;&lt;/v:formulas&gt;&lt;v:path o:extrusionok="f" gradientshapeok="t" o:connecttype="rect"&gt;&lt;/v:path&gt;&lt;o:lock v:ext="edit" aspectratio="t"&gt;&lt;/o:lock&gt;&lt;/v:shapetype&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;v:shape style="Z-INDEX: 251668480; POSITION: absolute; MARGIN-TOP: 72.45pt; WIDTH: 276.5pt; HEIGHT: 234.2pt; MARGIN-LEFT: 0px; mso-position-horizontal: center" id=_x0000_s1026 type="#_x0000_t75"&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;&lt;v:imagedata src="file:///C:\Users\markfr\AppData\Local\Temp\msohtmlclip1\01\clip_image001.wmz" o:title=""&gt;&lt;/v:imagedata&gt;&lt;?xml:namespace prefix = w ns = "urn:schemas-microsoft-com:office:word" /&gt;&lt;w:wrap type="topAndBottom"&gt;&lt;/w:wrap&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/v:shape&gt;&lt;/o:wrapblock&gt;&lt;BR style="mso-ignore: vglayout" clear=all&gt;&lt;FONT size=3 face=Calibri&gt;Consider a sequential program that we want to speed up using parallel programming techniques. An old-fashioned way to think about this is to identify some portion of the program, &lt;I style="mso-bidi-font-style: normal"&gt;p&lt;/I&gt;, that can be executed in parallel, and then implement a &lt;B style="mso-bidi-font-weight: normal"&gt;Fork()&lt;/B&gt; to spawn parallel tasks, followed by a &lt;B style="mso-bidi-font-weight: normal"&gt;Join()&lt;/B&gt; to unify the processing and carry on sequentially afterwards. Conceptually, something like this:&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;IMG style="WIDTH: 365px; HEIGHT: 335px" title="Fork Join" alt="Fork Join" src="http://5l3vgw.bay.livefilestore.com/y1prpgpboRXSzjQVnNYycq6GJuJ8R8HJIlojyRLhYinSz8MbLbSRl-3NN9tSD_qBRNoLp4SGLDZHIzUL0yvuqRj9GczCqOudgK_/Fork-Join%20flowchart.jpg" width=365 height=335 mce_src="http://5l3vgw.bay.livefilestore.com/y1prpgpboRXSzjQVnNYycq6GJuJ8R8HJIlojyRLhYinSz8MbLbSRl-3NN9tSD_qBRNoLp4SGLDZHIzUL0yvuqRj9GczCqOudgK_/Fork-Join%20flowchart.jpg"&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;STRONG&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Calibri','sans-serif'; COLOR: #002060; FONT-SIZE: 10pt; mso-bidi-font-family: 'Times New Roman'; mso-bidi-theme-font: minor-bidi; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin; mso-bidi-font-size: 11.0pt"&gt;Figure 3.&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;STRONG&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Calibri','sans-serif'; COLOR: #002060; FONT-SIZE: 10pt; FONT-WEIGHT: normal; mso-bidi-font-family: 'Times New Roman'; mso-bidi-theme-font: minor-bidi; mso-bidi-font-weight: bold; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin; mso-bidi-font-size: 11.0pt"&gt; Parallel processing using a Fork/Join.&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;SPAN&gt;&lt;o:p&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Amdahl’s law simply observes that your ability to speed up this program using parallelism is a function of the proportion of the time, &lt;I style="mso-bidi-font-style: normal"&gt;p&lt;/I&gt;, spent in the parallel portion of the program, compared to &lt;I style="mso-bidi-font-style: normal"&gt;s&lt;/I&gt;, the time spent in the serial parts of the program. (Note that &lt;I style="mso-bidi-font-style: normal"&gt;p + s = 1&lt;/I&gt;, in this formulation.) Amdahl’s observation was meant as a direct challenge to hardware architects who were advocating building parallel computing hardware. It was also easy for those advocates of parallel computing approaches to dismiss Amdahl’s remarks since Dr. Amdahl was clearly invested in trying to build faster processors, no matter the cost.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Advocates of parallel computing, of course, are not blind to the hazards of the parallel processing approach. Scalability of the underlying hardware is one challenge. An even bigger challenge is writing multi-threaded programs. For starters, it is often far more difficult to conceptualize a parallel solution than a serial one. (We can speculate that this may simply be a function of the way our minds tend to work.) Parallel programs are also notoriously more difficult to debug. When you are debugging a multi-threaded program running in parallel on parallel hardware, events don’t always occur in the exact same sequence. This is known as &lt;I style="mso-bidi-font-style: normal"&gt;non-determinism&lt;/I&gt;, and it often leads to huge problems for developers because, for instance, it may be very difficult to reproduce the exact timing sequence that exposes an error in your logic.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Furthermore, once you manage to get your programs to run correctly in a parallel processing mode the performance wins of doing so are not a given. In the course of celebrating the performance wins they do get, developers can sometimes diminish an appreciation for how difficult it was to achieve those gains. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Notwithstanding the difficulties that need to be overcome, compelling reasons to look at parallel computation remain, including trying to solve problems that simply just won’t fit inside the largest computers we can build. Today, there is renewed interest in parallel programming because it is difficult for hardware designers to make processors run at higher and higher clock speeds using current semiconductor fabrication technology without them consuming excessive amounts of power and generating excessive amounts of heat in the process that must be dissipated. Power and cooling considerations are driving parallel computing today for portables, desktops, and servers.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;STRONG&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Comparing Gunther’s formula to Amdahl’s law. &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Meanwhile, Amdahl’s original insight remains relevant today. From Amdahl’s law, we understand that, no matter what degree of parallelism is achieved, the execution time of a program’s serial portion is a practical upper bound on the performance of its parallel counterpart. As an example, Figure 1 plots the scalability curve using Amdahl’s law where p = 0.9, when just 10% of the program remains serial. Notice that Amdahl’s law predicts the performance of a parallel program will level off as more and more processors are added. As you can, see Amdahl’s law shows diminishing returns from increasing the level of parallelism. You can see how the parallel approach becomes less and less cost-effective as more and more processors are added.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;IMG style="WIDTH: 334px; HEIGHT: 286px" title="Amdahl's Law vs. Gunther's Law" alt="Amdahl's Law vs. Gunther's Law" src="http://5l3vgw.bay.livefilestore.com/y1pN7Q-dLDfRpTM8sHTBUMnORaHYkqs1fnOq57tgpYfyBjjbKKjQ8XZRTTazyt9cNaxt2X31QRdackdQL_gF0tcM9PTxce5Hw8-/Amdahl%20vs%20Gunther%20laws.jpg" width=334 height=286 mce_src="http://5l3vgw.bay.livefilestore.com/y1pN7Q-dLDfRpTM8sHTBUMnORaHYkqs1fnOq57tgpYfyBjjbKKjQ8XZRTTazyt9cNaxt2X31QRdackdQL_gF0tcM9PTxce5Hw8-/Amdahl%20vs%20Gunther%20laws.jpg"&gt;&lt;IMG style="WIDTH: 0px; HEIGHT: 0px" title="Amdahl's Law &amp;amp; Gunther's Law" alt="Amdahl's Law &amp;amp; Gunther's Law" src="http://5l3vgw.bay.livefilestore.com/y1pN7Q-dLDfRpTM8sHTBUMnORaHYkqs1fnOq57tgpYfyBjjbKKjQ8XZRTTazyt9cNaxt2X31QRdackdQL_gF0tcM9PTxce5Hw8-/Amdahl%20vs%20Gunther%20laws.jpg" mce_src="http://5l3vgw.bay.livefilestore.com/y1pN7Q-dLDfRpTM8sHTBUMnORaHYkqs1fnOq57tgpYfyBjjbKKjQ8XZRTTazyt9cNaxt2X31QRdackdQL_gF0tcM9PTxce5Hw8-/Amdahl%20vs%20Gunther%20laws.jpg"&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT face=Calibri&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="LINE-HEIGHT: 115%; COLOR: #002060; FONT-SIZE: 10pt; mso-bidi-font-size: 11.0pt"&gt;Figure 4.&lt;/SPAN&gt;&lt;/B&gt;&lt;SPAN style="LINE-HEIGHT: 115%; COLOR: #002060; FONT-SIZE: 10pt; mso-bidi-font-size: 11.0pt"&gt; &lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;A comparison of Amdahl’s Law to Gunther’s Universal Scalability Model&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;FONT face=Calibri&gt;&lt;SPAN style="LINE-HEIGHT: 115%; COLOR: #002060; FONT-SIZE: 10pt; mso-bidi-font-size: 11.0pt"&gt;&lt;o:p&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT color=#000000 size=3&gt;Given that Amdahl was mainly acting as an advocate for building faster serial CPUs, something that he wanted to do anyway, his is by no means the last word on the subject. Researchers in numerical computing like the ones in Sandia Labs were encouraged a few years later by a paper from one of their own. John Gustafson of Sandia Labs published a well-known paper in 1988 entitled “&lt;/FONT&gt;&lt;A href="http://www.scl.ameslab.gov/Publications/Gus/AmdahlsLaw/Amdahls.html"&gt;&lt;FONT color=#0000ff size=3&gt;Reevaluating Amdahl's Law&lt;/FONT&gt;&lt;/A&gt;&lt;FONT color=#000000 size=3&gt;” that adopted a much more optimistic stance to parallel programming. The essence of Gustafson’s argument is that when parallel processing resources become available, programmers will jigger their software to take advantage of them:&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0.3in 10pt" class=MsoNormal&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-SIZE: 9pt"&gt;&lt;FONT color=#000000&gt;One does not take a fixed-size problem and run it on various numbers of processors except when doing academic research; in practice, &lt;I&gt;the problem size scales with the number of processors&lt;/I&gt;. When given a more powerful processor, the problem generally expands to make use of the increased facilities. Users have control over such things as grid resolution, number of timesteps, difference operator complexity, and other parameters that are usually adjusted to allow the program to be run in some desired amount of time. Hence, it may be most realistic to assume that &lt;I&gt;run time, not problem size&lt;/I&gt;, is constant.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT color=#000000 size=3&gt;Gustafson’s counter-argument does not refute Amdahl’s law so much as suggest there might be creative ways to work&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;around it. It encouraged parallel programming researchers to keep plugging away, pursuing creative ways to sidestep Amdahl’s law. Microsoft’s Herb Sutter, &lt;/FONT&gt;&lt;A href="http://www.ddj.com/architect/205900309"&gt;&lt;FONT color=#0000ff size=3&gt;in his popular Dr. Dobbs Journal column back in January 2008&lt;/FONT&gt;&lt;/A&gt;&lt;FONT color=#000000 size=3&gt;, cited Gustafson’s Law favorably to offer similar encouragement to software developers today that need to re-fashion their code to take advantage of parallel processing in the many-core, multi-core era. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT color=#000000 size=3&gt;Gunther’s augmentation of Amdahl’s law is grounded empirically, providing a more realistic assessment of scalability using parallel programming technology. Gunther’s formula is similar, but adds another parameter to Amdahl’s law, κ,&lt;SPAN style="mso-fareast-font-family: 'Times New Roman'; mso-fareast-theme-font: minor-fareast"&gt; &lt;/SPAN&gt;that represents something called &lt;I style="mso-bidi-font-style: normal"&gt;coherency&lt;/I&gt; delay: &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Calibri','sans-serif'; FONT-SIZE: 11pt; mso-bidi-font-family: 'Times New Roman'; mso-bidi-theme-font: minor-bidi; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin; mso-fareast-font-family: Calibri; mso-fareast-theme-font: minor-latin; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA"&gt;&lt;v:shapetype id=_x0000_t75 coordsize="21600,21600" o:spt="75" o:preferrelative="t" path="m@4@5l@4@11@9@11@9@5xe" filled="f" stroked="f"&gt;&lt;FONT color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;EM&gt;&lt;STRONG&gt;C(p) = p/(1+s(p-1) + kp(p-1))&lt;/STRONG&gt;&lt;/EM&gt;&lt;/FONT&gt;&lt;/v:shapetype&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Calibri','sans-serif'; FONT-SIZE: 11pt; mso-bidi-font-family: 'Times New Roman'; mso-bidi-theme-font: minor-bidi; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin; mso-fareast-font-family: Calibri; mso-fareast-theme-font: minor-latin; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA"&gt;&lt;v:shapetype coordsize="21600,21600" o:spt="75" o:preferrelative="t" path="m@4@5l@4@11@9@11@9@5xe" filled="f" stroked="f"&gt;&lt;FONT color=#000000&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;To show how the two formulas behave, in Figure 2 above, Amdahl’s law is compared to Gunther’s law for a program with the same 10% serial portion. I set the coherency delay factor in Gunther’s formula to 0.001. When a coherency delay is also modeled, notice that parallel scalability is no longer monotonically increasing as processors are added. When we allow for some amount of coherency delay, there comes a point when overall performance levels off and ultimately begins to &lt;EM&gt;decrease&lt;/EM&gt;. Gunther’s formula not only models the performance of a parallel program that encounters diminishing returns from increased levels of parallelism, it also highlights the performance degradation that can occur when the communication and coordination-related delays introduced by multiple threads needing to synchronize access to shared data structures becomes excessive.&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;Gunther’s formula lumps all the delays associated with communication and coordination among threads that require access to shared data structures into one factor &lt;EM&gt;k&lt;/EM&gt; that he calls &lt;EM&gt;coherence&lt;/EM&gt;. Unfortunately, Gunther himself provides little help in telling us how to estimate &lt;EM&gt;k,&lt;/EM&gt; the crucial coherency delay factor, beforehand, or measure it after the fact. Presumably, &lt;EM&gt;k&lt;/EM&gt; includes delays associated with accessing critical sections of code that are protected by shared locks, as well as instruction execution delays in the hardware associated with maintaining the “coherence” of shared data kept in processor caches that are accessed and updated by concurrently running threads. There are also additional “overheads” associated with spinning up multiple worker threads, queuing up work items for them to process, controlling their execution, and coordinating their ultimate completion that are new to the parallel processing environment that are all absent from the serial version of the same program.&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;As a practical developer trying to understand the behavior of my parallel application, personally, I would find Gunther’s formula much more useful if it helped me identify the sources of coherency delays my parallel programs encounter that are impacting its scalability. It would also be useful if Gunther’s insight could help me guided me as I work to try to eliminate or reduce these obstacles to scalability. That is the main subject of&lt;A title="forward pointer" href="http://blogs.msdn.com/ddperf/archive/2009/06/09/parallel-scalability-isn-t-child-s-play-part-3-the-problem-with-fine-grained-parallelism.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2009/06/09/parallel-scalability-isn-t-child-s-play-part-3-the-problem-with-fine-grained-parallelism.aspx"&gt; the next blog entry in this series&lt;/A&gt;. &lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;&lt;/FONT&gt;&lt;/v:shapetype&gt;&lt;/SPAN&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9575026" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/-NET/">.NET</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Scalability/">Scalability</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Parallel+programming/">Parallel programming</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Performance/">Performance</category></item><item><title>Parallel Scalability Isn’t Child’s Play</title><link>http://blogs.msdn.com/b/ddperf/archive/2009/03/16/parallel-scalability-isn-t-child-s-play.aspx</link><pubDate>Mon, 16 Mar 2009 20:39:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9481780</guid><dc:creator>Mark B Friedman</dc:creator><slash:comments>9</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/ddperf/rsscomments.aspx?WeblogPostID=9481780</wfw:commentRss><comments>http://blogs.msdn.com/b/ddperf/archive/2009/03/16/parallel-scalability-isn-t-child-s-play.aspx#comments</comments><description>&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;In &lt;A title="Neil Gunther's blog" href="http://perfdynamics.blogspot.com/2009/02/poor-scalability-on-multicore.html" mce_href="http://perfdynamics.blogspot.com/2009/02/poor-scalability-on-multicore.html"&gt;a recent blog entry&lt;/A&gt;, Dr. Neil Gunther, a colleague from the Computer Measurement Group (CMG), warned about unrealistic expectations being raised with regard to the performance of parallel programs on current multi-core hardware. Neil’s blog entry highlighted a dismal parallel programming experience publicized &lt;/FONT&gt;&lt;A title="Sandia Labs multi-core press release" href="http://www.sandia.gov/news/resources/releases/2009/multicore.html" mce_href="http://www.sandia.gov/news/resources/releases/2009/multicore.html"&gt;&lt;FONT color=#0000ff size=3 face=Calibri&gt;in a recent press release&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3 face=Calibri&gt; from the Sandia Labs in Albuquerque, New Mexico. Sandia Labs is a research facility operated by the U.S. Department of Energy. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;According to the press release, scientists at Sandia Labs &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: 0in 0.2in 10pt" class=MsoNormal&gt;&lt;SPAN style="FONT-SIZE: 9pt"&gt;&lt;FONT face=Calibri&gt;simulated key algorithms for deriving knowledge from large data sets. The simulations show a significant increase in speed going from two to four multicores, but an insignificant increase from four to eight multicores. Exceeding eight multicores causes a decrease in speed. Sixteen multicores perform barely as well as two, and after that, a steep decline is registered as more cores are added.” They concluded that this retrograde speed-up was due to deficiencies in “memory bandwidth as well as contention between processors over the memory bus available to each processor.&lt;?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Holy cow. The Lab’s scientists, who are heavily invested in parallel programming on supercomputers, simulated running programs on sixteen cores encapsulating “key algorithms for deriving knowledge from large data sets” that gave no better performance than running the same program on two cores. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Please note that these are simulated performance results, because 16-core machines of the type being simulated don’t currently exist. Indeed, I would not expect that 16-core machines of the type being simulated would ever exist. Which leads me to wonder what the point of this Sandia Labs exercise was.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Of course, for developers experienced in parallel programming, this result actually isn’t in itself all that surprising. Quite frequently, experienced developers find that running their multi-threaded application on massively parallel hardware does not scale well with the hardware capabilities. This was apparently the case for the applications the Sandia Labs folks simulated. So what? Should we just give up in our quest for parallel program scalability? &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Before drilling into Dr. Gunther’s specific interest in this disclosure, it is worth looking into the Sandia Labs finding in a bit more detail. For instance, did anyone, besides me, wonder what applications were being simulated?&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;In theory, “deriving knowledge from large data sets” is a category of computing program that readily lends itself to a solution using an &lt;/FONT&gt;&lt;A href="http://en.wikipedia.org/wiki/SIMD"&gt;&lt;FONT size=3 face=Calibri&gt;SIMD&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3 face=Calibri&gt; (&lt;B style="mso-bidi-font-weight: normal"&gt;S&lt;/B&gt;ingle &lt;B style="mso-bidi-font-weight: normal"&gt;I&lt;/B&gt;nstruction, &lt;B style="mso-bidi-font-weight: normal"&gt;M&lt;/B&gt;ultiple &lt;B style="mso-bidi-font-weight: normal"&gt;D&lt;/B&gt;ata) approach. The canonical example of an SIMD approach to “deriving knowledge from large data sets” is a database Search function conducted in parallel where the data set of interest is partitioned across &lt;I style="mso-bidi-font-style: normal"&gt;n&lt;/I&gt; processing units and their locally attached disks. For example, when the Thinking Machines CM-1 supercomputer publicly debuted in the mid-80s, the company demonstrated its capabilities using a parallel search of a database that was partitioned across all 64K nodes of the machine, which was based on the Connection Machine originally designed by MIT whiz kid Danny Hillis. Parallel search when executed across a partitioned dataset should scale linearly, or close enough for government work (pun intended). &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Whenever a problem lends itself to an SIMD approach (also known as “divide and conquer”), linear scalability of the SIMD algorithm does require first partitioning the data being accessed and then proceeding to process that data in parallel. I am sure the point of the Sandia Labs press release was not to disparage the SIMD approach to parallel processing; after all, that is a tried-and-true technique that they have used with great success over the years. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;On the contrary, it appears to be a critique of an approach to building parallel processing hardware&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;where you would increase the number of processing cores on the chip (just because you can with the most current semiconductor fabrication technology) without scaling the memory bandwidth proportionally. Since that is not what is happening hardware-wise, it strikes me that this implied criticism of the multi-core hardware&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;strategy Intel and AMD are pursuing is slaying a non-existent dragon. Both Intel and AMD recognize that memory bus bandwidth is a significant potential bottleneck in their multi-core products, and, as a result, both manufacturers are attempting to scale memory bandwidth proportional to the amount of processing power they deliver on a chip.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;So then what is all the fuss about? The Sandia Labs “news” starts to look like something the blogosphere is latching onto on an otherwise slow day for tech news, raising an alarm &amp;amp; potentially misleading naïve readers about what the conventional wisdom in multiprocessor chip architecture would be if anyone were actually trying to build multi-core microprocessors that way.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;STRONG&gt;Building a better multicore processor.&lt;/STRONG&gt;&lt;/P&gt;&lt;FONT size=3 face=Calibri&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;The point of the Sandia Labs press release publicizing these simulation results appears to be to suggest what they consider better approaches to packaging multi-core processors on a single socket. They released the following chart that that makes this point (reproduced here in Figure 1):&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&lt;IMG style="WIDTH: 450px; HEIGHT: 353px" title="Sandia Labs multicore simulation results" alt="Sandia Labs multicore simulation results" src="http://5l3vgw.bay.livefilestore.com/y1pr4F4aoYifbSInSEBRcbQ9TBEARzKw87EyYk2bricI-CoyRgTN--dE7SeFYj-q7Ll9D3mJePubLw_-B_yrrSvOQ/Sandia%20Labs%20simulated%20multicore%20performance%20(smaller).jpg" width=450 height=353 mce_src="http://5l3vgw.bay.livefilestore.com/y1pr4F4aoYifbSInSEBRcbQ9TBEARzKw87EyYk2bricI-CoyRgTN--dE7SeFYj-q7Ll9D3mJePubLw_-B_yrrSvOQ/Sandia%20Labs%20simulated%20multicore%20performance%20(smaller).jpg"&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoCaption&gt;&lt;STRONG&gt;&lt;FONT size=2&gt;&lt;FONT color=#4f81bd&gt;&lt;SPAN style="mso-no-proof: yes"&gt;Figure 1&lt;/SPAN&gt;. Sandia Labs simulation showing performance of their application vs. the number of processors.&lt;/FONT&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;FONT color=#4f81bd size=2&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT color=#000000 size=3&gt;Exactly what the Sandia Labs folks are reporting here is a little sketchy. Presumably, the simulations are based on observing the behavior of some of their key programs where they were able to measure performance running on “conventional” multi-core processors, perhaps, something like the quad-core machine I recently installed for my desktop that uses a memory bus with bandwidth in the range of 10 GB/sec. The press release seems to imply that the Sandia Labs baseline measurements were taken on current quad-core machines from Intel like mine, not the newer Nehalem processors where the memory architecture has been re-worked extensively. How useful or meaningful the results that Sandia Labs published may turn on this crucial point.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT color=#000000 size=3&gt;This Sandia Labs simulation then extrapolates out to 16 cores per socket (and beyond), simulating the manufacturer adding more cores to the die, apparently &lt;I style="mso-bidi-font-style: normal"&gt;leaving the memory architecture fundamentally unchanged&lt;/I&gt; as they moved to more cores. The Sandia Labs chart in Figure 1 is labeled to indicate that the memory bandwidth was held constant at 10 GB/sec. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT color=#000000 size=3&gt;This is more than a little suspicious. Hardware manufacturers like Intel and AMD understand clearly that the memory bus has to scale with the number of processors. The AMD &lt;/FONT&gt;&lt;A title="HyperTransport specifications" href="http://www.hypertransport.org/default.cfm?page=TechnologyLowLatency" mce_href="http://www.hypertransport.org/default.cfm?page=TechnologyLowLatency"&gt;&lt;FONT size=3&gt;HyperTransport&lt;/FONT&gt;&lt;/A&gt;&lt;FONT color=#000000 size=3&gt; bus architecture is quite explicit about this, and the latest spec for &lt;/FONT&gt;&lt;A title="HyperTransport 3.1" href="http://blogs.msdn.com/controlpanel/blogs/Exactly%20what%20the%20Sandia%20Labs%20folks%20are%20reporting%20here%20is%20a%20little%20sketchy.%20Presumably,%20the%20simulations%20are%20based%20on%20observing%20the%20behavior%20of%20some%20of%20their%20key%20programs%20where%20they%20were%20able%20to%20measure%20performance%20running%20on%20“conventional”%20multi-core%20processors,%20perhaps,%20something%20like%20the%20quad-core%20machine%20I%20recently%20installed%20for%20my%20desktop%20that%20uses%20a%20memory%20bus%20with%20bandwidth%20in%20the%20range%20of%2010%20GB/sec.%20The%20press%20release%20seems%20to%20imply%20that%20the%20Sandia%20Labs%20baseline%20measurements%20were%20taken%20on%20current%20quad-core%20machines%20from%20Intel%20like%20mine,%20not%20the%20newer%20Nehalem%20processors%20where%20the%20memory%20architecture%20has%20been%20re-worked%20extensively.%20How%20useful%20or%20meaningful%20the%20results%20that%20Sandia%20Labs%20published%20may%20turn%20on%20this%20crucial%20point." mce_href="http://blogs.msdn.com/controlpanel/blogs/Exactly what the Sandia Labs folks are reporting here is a little sketchy. Presumably, the simulations are based on observing the behavior of some of their key programs where they were able to measure performance running on “conventional” multi-core processors, perhaps, something like the quad-core machine I recently installed for my desktop that uses a memory bus with bandwidth in the range of 10 GB/sec. The press release seems to imply that the Sandia Labs baseline measurements were taken on current quad-core machines from Intel like mine, not the newer Nehalem processors where the memory architecture has been re-worked extensively. How useful or meaningful the results that Sandia Labs published may turn on this crucial point."&gt;&lt;FONT size=3&gt;HyperTransport version 3.1&lt;/FONT&gt;&lt;/A&gt;&lt;FONT color=#000000 size=3&gt; has an aggregate bandwidth in excess of 50 GB/sec.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT color=#000000 size=3&gt;Meanwhile, the memory bus capacity on the latest &lt;/FONT&gt;&lt;A title="Nhealem architecture announcement" href="http://blogs.msdn.com/ddperf/archive/2008/04/01/thoughts-on-intel-s-recent-hardware-announcements.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2008/04/01/thoughts-on-intel-s-recent-hardware-announcements.aspx"&gt;&lt;FONT size=3&gt;Nehalem&lt;/FONT&gt;&lt;/A&gt;&lt;FONT color=#000000 size=3&gt;-class processors from Intel has been boosted significantly. Alternatively, it is when you cannot scale the memory bus with processor capacity that machines with &lt;/FONT&gt;&lt;A title="Blogging about NUMA 2008" href="http://blogs.msdn.com/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx"&gt;&lt;FONT size=3&gt;NUMA&lt;/FONT&gt;&lt;/A&gt;&lt;FONT color=#000000 size=3&gt; architectures become more attractive. The AMD processors use HyperTransport links&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;in a ring topology that implicitly leads to NUMA-characteristics. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT color=#000000 size=3&gt;In Intel’s approach to NUMA scalability, some small number of processors share a common memory bus, forming a &lt;I style="mso-bidi-font-style: normal"&gt;node&lt;/I&gt;. Current Nehalem machines (also known as the Core i7 architecture) have four cores sharing the Front-side memory bus (FSB). The physical layout of this chip is photographed in Figure 2, showing four cores, connected to DDR3 DRAM using an integrated memory controller. I wasn’t able to come find a speed rating for the FSB in the Nehalem on Intel’s web site or elsewhere, other than ballpark estimates that puts its speed in the range of 30-40 GB/sec. The QuickConnect technology links that are used to link memory controllers support 25 GB/sec transfers, which is probably a safe lower bound on the capacity of the FSB. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT color=#000000 size=3&gt;&lt;IMG style="WIDTH: 526px; HEIGHT: 363px" title="Core i7 4-way multiprocessor photo" alt="Core i7 4-way multiprocessor photo" src="http://5l3vgw.bay.livefilestore.com/y1pS_IQwDypWmRE8pD4mgMliuhbypb0uOI730CaN7MKi5QtXsiDyzMJ9eE4o2-kp03n19hsvrPV-MEMRRbv9L1d3Q/Nehalem%20multicore%20chip%20photo.jpg" width=526 height=363 mce_src="http://5l3vgw.bay.livefilestore.com/y1pS_IQwDypWmRE8pD4mgMliuhbypb0uOI730CaN7MKi5QtXsiDyzMJ9eE4o2-kp03n19hsvrPV-MEMRRbv9L1d3Q/Nehalem%20multicore%20chip%20photo.jpg"&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;FONT color=#4f81bd size=2&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoCaption&gt;&lt;STRONG&gt;Figure 2. Aerial photograph showing the layout of the 4-way Core i7 (Nehalem) microprocessor chip.&lt;/STRONG&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT color=#000000 size=3&gt;The &lt;/FONT&gt;&lt;A href="https://cfwebprod.sandia.gov/cfdocs/CCIM/docs/pim-mpi.pdf"&gt;&lt;FONT color=#0000ff size=3&gt;PIM architecture&lt;/FONT&gt;&lt;/A&gt;&lt;FONT color=#000000 size=3&gt;, whose scalability curves are close to ideal for the Sandia Labs workloads is, probably not coincidentally, a processor architecture championed at Sandia Labs. The idea behind PIM machines &lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Calibri','sans-serif'; FONT-SIZE: 11pt; mso-ascii-theme-font: minor-latin; mso-fareast-font-family: Calibri; mso-fareast-theme-font: minor-latin; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: 'Times New Roman'; mso-bidi-theme-font: minor-bidi; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA"&gt;(&lt;U&gt;P&lt;/U&gt;rocessor &lt;U&gt;I&lt;/U&gt;n &lt;U&gt;M&lt;/U&gt;emory) &lt;/SPAN&gt;is that the processor (or processors) is embedded into the memory chip itself, which is a pretty interest approach to solving the “memory wall” that limits performance in today’s dominant computer architectures. Instead of loading up the microprocessor socket with more and more cores, which is the professed hardware roadmap at Intel &amp;amp; AMD, integrating memory into the socket is an intriguing alternative. Such machines, if anyone were to build them, would obviously have NUMA performance characteristics.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT color=#000000 size=3&gt;The debate is a bit academic for my taste, however, until these PIM architecture machines are a reality. For PIM architecture machines to ever get traction, either the microprocessor manufacturers would have to start building DRAM chips or the DRAM manufacturers would have to start building microprocessors. The way the semiconductor fabrication business is stratified today, that does not appear to be very likely in the near future.&lt;/FONT&gt;&lt;/P&gt;&lt;FONT color=#000000&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;So, in the end, the point of the Sandia Labs press release appears to be trying to publicize the multiprocessor hardware direction espoused mainly by Sandia Labs’ own researchers. Frankly, there have been lots and lots of different architectural approaches to parallel processing over the years, and it doesn’t look like any one approach is optimal for all computing situations. You ought to be pick another parallel programming workload to simulate in Figure 1 and get an entirely different ranking of the approaches.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;Still, the Sandia Labs simulation data are interesting mainly for they say about how difficult it is going to be for developers to write parallel programs that scale well on multi-core machines. No, achieving parallel isn’t child’s play for hardware manufacturers. Nor is it for software developers attempting to take advantage of parallel processing hardware, which is the subject I will start to drill into next time.&lt;/FONT&gt;&lt;/P&gt;&lt;/FONT&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoCaption&gt;&lt;A title="Continue to Part 2" href="http://blogs.msdn.com/ddperf/archive/2009/04/29/parallel-scalability-isn-t-child-s-play-part-2-amdahl-s-law-vs-gunther-s-law.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2009/04/29/parallel-scalability-isn-t-child-s-play-part-2-amdahl-s-law-vs-gunther-s-law.aspx"&gt;Continue to Part 2....&lt;/A&gt;.&lt;/P&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9481780" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Scalability/">Scalability</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Parallel+programming/">Parallel programming</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Hardware/">Hardware</category></item><item><title>Visual Studio 2010 Hardware Requirements</title><link>http://blogs.msdn.com/b/ddperf/archive/2008/12/23/visual-studio-2010-hardware-requirements.aspx</link><pubDate>Wed, 24 Dec 2008 10:56:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9251566</guid><dc:creator>David Berg</dc:creator><slash:comments>23</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/ddperf/rsscomments.aspx?WeblogPostID=9251566</wfw:commentRss><comments>http://blogs.msdn.com/b/ddperf/archive/2008/12/23/visual-studio-2010-hardware-requirements.aspx#comments</comments><description>&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT face=Calibri size=3&gt;Soma’s been talking about the upcoming Visual Studio 2010 release on his &lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/somasegar/default.aspx"&gt;&lt;FONT face=Calibri size=3&gt;blog&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face=Calibri size=3&gt;, which means I’m starting to get questions about what type of hardware you’re going to need to run VS2010 on.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT face=Calibri size=3&gt;Unfortunately, I can’t give you an official answer yet (other than to say, it depends on what you’re doing – obviously building small apps with one of the Express versions of Visual Studio won’t require the same resources as a multi-million line app using full blown Visual Studio Team System with lots of third party add-ins).&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT face=Calibri size=3&gt;What I can do is help put some of the things we’ve said about Visual Studio 2010 into context, to maybe help you make some better hardware decisions today:&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpFirst style="MARGIN: 0in 0in 0pt 0.75in; TEXT-INDENT: -0.5in; mso-add-space: auto; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="mso-fareast-font-family: Calibri; mso-fareast-theme-font: minor-latin; mso-bidi-font-family: Calibri; mso-bidi-theme-font: minor-latin"&gt;&lt;SPAN style="mso-list: Ignore"&gt;&lt;FONT face=Calibri size=3&gt;1)&lt;/FONT&gt;&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri size=3&gt;Memory – we’re trying to make VS2010 as frugal as we can here in order to run in as little memory as possible; however, we’re also adding a lot of functionality, and systems with more memory do tend to perform much better.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;So the general rule of buying systems still applies – spring for as much memory as you can afford.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;It’s hard to have too much memory, at a minimum you want to make sure that you’re not paging.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;That said, there’s very little benefit to making a text editor 64 bit (and lots of reasons not to), so anything over 4GB is likely to be wasted (unless you’re running or writing apps that need more).&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.75in; mso-add-space: auto"&gt;&lt;?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /&gt;&lt;o:p&gt;&lt;FONT face=Calibri size=3&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.75in; TEXT-INDENT: -0.5in; mso-add-space: auto; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="mso-fareast-font-family: Calibri; mso-fareast-theme-font: minor-latin; mso-bidi-font-family: Calibri; mso-bidi-theme-font: minor-latin"&gt;&lt;SPAN style="mso-list: Ignore"&gt;&lt;FONT face=Calibri size=3&gt;2)&lt;/FONT&gt;&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri size=3&gt;CPU – modern CPUs with their larger caches and tuned instruction pipelines tend to perform much better than one’s from just a few years ago (see our &lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/ddperf/archive/2008/06/18/lessons-from-the-test-lab-investigating-a-pleasant-surprise.aspx"&gt;&lt;FONT face=Calibri color=#0000ff size=3&gt;blog&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face=Calibri size=3&gt;).&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;If you’re going to do multi-threaded programming, you’ll want at least a dual core processor (and with the new &lt;/FONT&gt;&lt;A href="http://msdn.microsoft.com/en-us/concurrency/default.aspx"&gt;&lt;FONT face=Calibri size=3&gt;Parallel Computing&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face=Calibri size=3&gt; support in VS 2010, you &lt;U&gt;will&lt;/U&gt; want to do multi-threaded programming).&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in"&gt;&lt;o:p&gt;&lt;FONT face=Calibri size=3&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.75in; TEXT-INDENT: -0.5in; mso-add-space: auto; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="mso-fareast-font-family: Calibri; mso-fareast-theme-font: minor-latin; mso-bidi-font-family: Calibri; mso-bidi-theme-font: minor-latin"&gt;&lt;SPAN style="mso-list: Ignore"&gt;&lt;FONT face=Calibri size=3&gt;3)&lt;/FONT&gt;&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri size=3&gt;GPU – VS2010 will leverage WPF heavily to create richer editing and visualizations, so a decent GPU that supports at least DX9 is highly recommended (DX10 is preferred, but requires Vista).&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in"&gt;&lt;o:p&gt;&lt;FONT face=Calibri size=3&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpLast style="MARGIN: 0in 0in 10pt 0.75in; TEXT-INDENT: -0.5in; mso-add-space: auto; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="mso-fareast-font-family: Calibri; mso-fareast-theme-font: minor-latin; mso-bidi-font-family: Calibri; mso-bidi-theme-font: minor-latin"&gt;&lt;SPAN style="mso-list: Ignore"&gt;&lt;FONT face=Calibri size=3&gt;4)&lt;/FONT&gt;&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Disk – If you’re building a large project or working with a large database, a large high-speed disk is pretty important.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;For large projects, you can often benefit by spreading your work across multiple disk spindles.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;At an extreme, putting your tools on one drive, your source code on another, and your object files on a third drive allows the three major sources of disk IO in building a project to be carried out independently of each other.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;If you have to use a slower disk (e.g. a notebook) then be sure to get lots of memory.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;Also keep in mind that modern hard drives tend to have more built in caching, so the same speed drive bought recently will likely outperform one bought a few years ago.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT face=Calibri size=3&gt;So now that I’ve given you my thoughts on what hardware Visual Studio 2010 will need, what are your thoughts? &lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;What kind of hardware are you developing on today, and what do you expect to be using in the next couple years?&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;What are your expectations on how we should be leveraging your hardware to create a productive development environment?&lt;/FONT&gt;&lt;/P&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9251566" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Performance/">Performance</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Visual+Studio/">Visual Studio</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Hardware/">Hardware</category></item><item><title>PDC2008 preConference Workshop</title><link>http://blogs.msdn.com/b/ddperf/archive/2008/10/22/pdc2008-preconference-workshop.aspx</link><pubDate>Wed, 22 Oct 2008 18:36:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9011219</guid><dc:creator>Mark B Friedman</dc:creator><slash:comments>0</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/ddperf/rsscomments.aspx?WeblogPostID=9011219</wfw:commentRss><comments>http://blogs.msdn.com/b/ddperf/archive/2008/10/22/pdc2008-preconference-workshop.aspx#comments</comments><description>&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Over the past several weeks, I have been working overtime developing a presentation on web application performance to be given at the upcoming Professional Developer’s Conference (PDC), which is next week in Los Angeles. This is partly why I have been remiss about blogging this month. At least, that is my excuse, and I am sticking to it.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;The presentation is entitled “Performance by design using the .NET Framework” and I am presenting jointly with two colleagues in the Developer Division, Rico Mariani and Vance Morrison. It is one of ten PreConference sessions that are scheduled to run all day on Sunday. My portion of the session is an extended discussion of optimization &amp;amp; scaling strategies for web applications. The scope encompasses ASP.NET, AJAX, Silverlight, WPF &amp;amp; WCF. Information about the upcoming event is &lt;/FONT&gt;&lt;A href="http://www.microsoftpdc.com/Agenda/Preconference.aspx#performance-by-design-using-the-net-framework" mce_href="http://www.microsoftpdc.com/Agenda/Preconference.aspx#performance-by-design-using-the-net-framework"&gt;&lt;FONT color=#0000ff size=3 face=Calibri&gt;here&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3 face=Calibri&gt;.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;I have attended several PDCs in the past as a customer, and found them to be amazing events. Before the days of widespread blogging, the “Ask the Experts” sessions at the PDC were often the only way to get an authoritative answer to your question. The actual Conference sessions emphasize imminently arriving technology and future directions, aimed at the professional developer who needs to be able to anticipate and plan. The technical sessions run the gamut from Windows 7, the Windows Live Cloud computing initiatives, IE8, Surface and Windows for Workflow. There will be previews of the next version of the .NET Framework, Visual Studio, and the Visual Studio Team System. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;This is the first time I will be on the other side of the podium for the event. In our preCon session, Rico, Vance and I will focus on facilities available in the Framework today, including the best practices and tools we recommend to help you design &amp;amp; build an application that meets its performance and scalability requirements. The intended audience is experienced .NET developers. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;If you are reading this blog &amp;amp; coming to my session, be sure to say hello. I’d like to get the chance to meet you in person. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;-- Mark Friedman&lt;/P&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9011219" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Performance+Engineering/">Performance Engineering</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/-NET/">.NET</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Scalability/">Scalability</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Performance/">Performance</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Performance+testing/">Performance testing</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/PDC2008/">PDC2008</category></item><item><title>Mainstream NUMA and the TCP/IP stack: Final Thoughts</title><link>http://blogs.msdn.com/b/ddperf/archive/2008/09/18/mainstream-numa-and-the-tcp-ip-stack-final-thoughts.aspx</link><pubDate>Fri, 19 Sep 2008 00:18:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8957878</guid><dc:creator>Mark B Friedman</dc:creator><slash:comments>5</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/ddperf/rsscomments.aspx?WeblogPostID=8957878</wfw:commentRss><comments>http://blogs.msdn.com/b/ddperf/archive/2008/09/18/mainstream-numa-and-the-tcp-ip-stack-final-thoughts.aspx#comments</comments><description>&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;This is a continuation of Part IV of this article posted &lt;A class="" title=Link-back-to-Part4 href="http://blogs.msdn.com/ddperf/archive/2008/09/09/mainstream-numa-and-the-tcp-ip-stack-part-iv-paralleling-tcp-ip.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2008/09/09/mainstream-numa-and-the-tcp-ip-stack-part-iv-paralleling-tcp-ip.aspx"&gt;&lt;FONT color=#666666&gt;here&lt;/FONT&gt;&lt;/A&gt;.&amp;nbsp;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;Note that a final version of a white paper tying this series of five blog entries together (and a Powerpoint presentation on the subject) are attached.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;For many years, the effort to improve network performance on Windows and other platforms focused on reducing the host processing requirements associated with the need to service frequent interrupts from the NIC. In the many-core era where the clock speeds of processors are constrained by power considerations, this strategy is inadequate to the growing host processing requirements that accompany high-speed networking. It is necessary to augment technologies like interrupt moderation and TCP Offload Engine that improve the efficiency of network I/O with an approach that allows TCP/IP Receive packets to be processed in parallel across multiple CPUs. Together, MSI-X and RSS are technologies that enable host processing of TCP/IP packets to scale in the many-core world, albeit not without some compromises with the prevailing model of networking using isolated, layered components.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;SPAN style="COLOR: black"&gt;&lt;FONT face=Calibri&gt;Using MSI-X and RSS, for example, the Intel 82598 10 Gigabit Ethernet Controller mentioned earlier can be mapped to a maximum of 16 processor cores that could then be devoted to networking I/O interrupt handling. Capacity-wise, this is still not sufficient processing capacity to handle the theoretical maximum load equation 3 predicts for a 10 Gb Ethernet card, but it does represent a substantial scalability improvement.&lt;?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;With this understanding of what MSI-X and RSS accomplishes, let’s return for a moment to our NUMA server machine shown in Figure 6 below.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;IMG title="NUMA server with multiple RSS queues" style="WIDTH: 364px; HEIGHT: 633px" height=633 alt="NUMA server with multiple RSS queues" src="http://5l3vgw.bay.livefilestore.com/y1pP1tl3lheOVmfXixoNk6WzdhcLnXhAbVSW28AD1IJ3YyHN1ZbYhAQygJHF1fesNHfPK3ehJ6yE4w/Simple%20Two%20Node%20NUMA%20Server%20with%20two%20RSS%20Queues%20(vertical%20orientation).jpg" width=364 mce_src="http://5l3vgw.bay.livefilestore.com/y1pP1tl3lheOVmfXixoNk6WzdhcLnXhAbVSW28AD1IJ3YyHN1ZbYhAQygJHF1fesNHfPK3ehJ6yE4w/Simple%20Two%20Node%20NUMA%20Server%20with%20two%20RSS%20Queues%20(vertical%20orientation).jpg"&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;With MSI-X and Receive-Side Scaling, CPU 0 on node A and CPU 1 on node B are both enabled for processing network interrupts. Since RSS schedules the NDIS DPC to run on the same processor as the ISR, even at moderate networking loads, CPU 0 and 1 for all practical purposes become dedicated to the processing of high priority networking interrupts. &lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;Numerous economies of scale accrue using this approach. The same RSS process that sends all Receive packets from a single TCP connection to a specific CPU for processing improves the efficiency of that processing. The instruction execution rate of the TCP/IP protocol stack is enhanced significantly through this scheduling mechanism that enforces localization. Ultimately, TCP/IP application data buffers need to be allocated from local node memory and processed by threads confined to that node. Recently used data and instructions that networking ISRs and DPCs issue tend to reside in the dedicated cache (or caches) associated with the processor devoted to network I/O. Or, at the very least, they migrate to the last level cache that is shared by all the processors on the same NUMA node.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;Ultimately, of course, the TCP layer hands data from the network I/O to an application layer that is ready to receive and process it. The implications of RSS for the application threads that process TCP receive packets and build responses for TCP/IP to send back to network clients ought to be obvious, but I will spell them out anyway. For optimal performance, these application processing threads also need to be directed to run on the same NUMA node where the TCP Receive packet was processed. This localization of the application’s threads should, of course, be subject to other load balancing considerations to prevent the ideal node from becoming severely over-committed while other CPUs on other nodes are idling or under-utilized. The performance penalty for an application thread that must run on a different node than the one that processed the original TCP/IP Receive packet is considerable because it must access the data payload of the request remotely. Networked applications need to understand these performance and capacity considerations and schedule their threads accordingly to balance the work across NUMA nodes optimally.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;Consider the ASP.NET application threads that process incoming HTTP Requests and generate HTTP Response messages. If the HTTP Request packet is processed by CPU 0 on node A in a NUMA machine, the Request packet payload is allocated in node A local memory. The ASP.NET application thread running in User mode that processes that incoming HTTP Request will run much more efficiently if it is scheduled to run on one of the other processors on node A, where it can access the payload and build the Response message using local node memory. &lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;There is currently no mechanism in Windows today for kernel mode drivers like ndis.sys and http.sys to communicate to the application layers above them and specify the NUMA node on which that packet was originally processed. Communicating that information to the application layer is another grievous violation of the principle of isolation in the network protocol stack, but it is a necessary step to improve the performance of networking applications in the many-core era where even moderately sized server machines have NUMA characteristics.&lt;/FONT&gt;&lt;/P&gt;
&lt;H3 style="MARGIN: 10pt 0in 0pt"&gt;&lt;BR&gt;&lt;FONT face=Cambria color=#4f81bd size=2&gt;Links.&lt;/FONT&gt;&lt;/H3&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;Herb Sutter, “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software.” Dr. Dobb’s Journal, March 1, 2005. &lt;/FONT&gt;&lt;/SPAN&gt;&lt;A href="http://www.ddj.com/architect/184405990" mce_href="http://www.ddj.com/architect/184405990"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri color=#0000ff&gt;http://www.ddj.com/architect/184405990&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;NTttcp performance testing tool: &lt;/FONT&gt;&lt;/SPAN&gt;&lt;A href="http://www.microsoft.com/whdc/device/network/TCP_tool.mspx" mce_href="http://www.microsoft.com/whdc/device/network/TCP_tool.mspx"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri color=#0000ff&gt;http://www.microsoft.com/whdc/device/network/TCP_tool.mspx&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;Windows Performance Toolkit (WPT, aka xperf): &lt;/FONT&gt;&lt;/SPAN&gt;&lt;A href="http://msdn.microsoft.com/en-us/library/cc305218.aspx" mce_href="http://msdn.microsoft.com/en-us/library/cc305218.aspx"&gt;&lt;FONT color=#0000ff&gt;&lt;FONT face=Calibri&gt;&lt;SPAN style="mso-bookmark: xperfLink"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;http://msdn.microsoft.com/en-us/library/cc305218.aspx&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: xperfLink"&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;SPAN style="mso-bookmark: xperfLink"&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;David Kanter, “The Common System Interface: Intel's Future Interconnect,” &lt;/FONT&gt;&lt;/SPAN&gt;&lt;A href="http://www.realworldtech.com/includes/templates/articles.cfm?ArticleID=RWT082807020032&amp;amp;mode=print" mce_href="http://www.realworldtech.com/includes/templates/articles.cfm?ArticleID=RWT082807020032&amp;amp;mode=print"&gt;&lt;FONT color=#0000ff&gt;&lt;FONT face=Calibri&gt;&lt;SPAN style="mso-bookmark: TheCommonSystemInterface"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;http://www.realworldtech.com/includes/templates/articles.cfm?ArticleID=RWT082807020032&amp;amp;mode=print&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: TheCommonSystemInterface"&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;SPAN style="mso-bookmark: TheCommonSystemInterface"&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;Windows NUMA support: &lt;/FONT&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;A href="http://msdn.microsoft.com/en-us/library/aa363804.aspx" mce_href="http://msdn.microsoft.com/en-us/library/aa363804.aspx"&gt;&lt;FONT color=#0000ff&gt;&lt;FONT face=Calibri&gt;&lt;SPAN style="mso-bookmark: WindowsNUMAsupport"&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;http://msdn.microsoft.com/en-us/library/aa363804.aspx&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: WindowsNUMAsupport"&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;SPAN style="mso-bookmark: WindowsNUMAsupport"&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt; &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;SPAN style="mso-bookmark: WindowsNUMAsupport"&gt;&lt;/SPAN&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;Intel white paper: &lt;/FONT&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;A href="http://www.intel.com/technology/ioacceleration/306484.pdf" mce_href="http://www.intel.com/technology/ioacceleration/306484.pdf"&gt;&lt;FONT color=#0000ff&gt;&lt;FONT face=Calibri&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: IOATwhitepaper"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;Accelerating High-Speed Networking with Intel® I/O Acceleration Technology&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: IOATwhitepaper"&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: IOATwhitepaper"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: IOATwhitepaper"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;Mark B. Friedman, “&lt;/FONT&gt;&lt;A class="" title=SANCapacityPlanningLink name=SANCapacityPlanningLink&gt;&lt;/A&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;A href="http://www.demandtech.com/Resources/Papers/Intro%20to%20SAN%20capacity%20planning.pdf" mce_href="http://www.demandtech.com/Resources/Papers/Intro%20to%20SAN%20capacity%20planning.pdf"&gt;&lt;FONT color=#0000ff&gt;&lt;FONT face=Calibri&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: IOATwhitepaper"&gt;&lt;SPAN style="mso-bookmark: SANCapacityPlanningLink"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;An Introduction to SAN Capacity Planning&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: IOATwhitepaper"&gt;&lt;SPAN style="mso-bookmark: SANCapacityPlanningLink"&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;SPAN style="mso-bookmark: SANCapacityPlanningLink"&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: IOATwhitepaper"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;,” &lt;I style="mso-bidi-font-style: normal"&gt;Proceedings&lt;/I&gt;, Computer Measurement Group, Dec. 2001.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: IOATwhitepaper"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;Jeffrey Mogul’s “TCP offload is a dumb idea whose time has come,” &lt;I style="mso-bidi-font-style: normal"&gt;Proceedings&lt;/I&gt; of the 9th conference on Hot Topics in Operating Systems - Volume 9, 2003. &lt;/FONT&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;A href="http://portal.acm.org/citation.cfm?id=1251059&amp;amp;dl=ACM&amp;amp;coll=portal&amp;amp;CFID=71988909&amp;amp;CFTOKEN=98964748" mce_href="http://portal.acm.org/citation.cfm?id=1251059&amp;amp;dl=ACM&amp;amp;coll=portal&amp;amp;CFID=71988909&amp;amp;CFTOKEN=98964748"&gt;&lt;FONT color=#0000ff&gt;&lt;FONT face=Calibri&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: IOATwhitepaper"&gt;&lt;SPAN style="mso-bookmark: TCPOffloadDumbIdeaLink"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;http://portal.acm.org/citation.cfm?id=1251059&amp;amp;dl=ACM&amp;amp;coll=portal&amp;amp;CFID=71988909&amp;amp;CFTOKEN=98964748&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: IOATwhitepaper"&gt;&lt;SPAN style="mso-bookmark: TCPOffloadDumbIdeaLink"&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: IOATwhitepaper"&gt;&lt;SPAN style="mso-bookmark: TCPOffloadDumbIdeaLink"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt; &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;SPAN style="mso-bookmark: TCPOffloadDumbIdeaLink"&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: IOATwhitepaper"&gt;&lt;/SPAN&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;Dell Computer Corporation, “&lt;/FONT&gt;&lt;A class="" title=DellTCPOffloadwhitepaper name=DellTCPOffloadwhitepaper&gt;&lt;/A&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;A href="http://www.dell.com/downloads/global/vectors/ps3q06-20060132-Broad_com.pdf" mce_href="http://www.dell.com/downloads/global/vectors/ps3q06-20060132-Broad_com.pdf"&gt;&lt;FONT color=#0000ff&gt;&lt;FONT face=Calibri&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: DellTCPOffloadwhitepaper"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;Boosting Data Transfer with TCP Offload Engine Technology&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: DellTCPOffloadwhitepaper"&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;SPAN style="mso-bookmark: DellTCPOffloadwhitepaper"&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;.”&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;Microsoft Corporation, KB 951037, &lt;/FONT&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;A href="http://support.microsoft.com/kb/951037" mce_href="http://support.microsoft.com/kb/951037"&gt;&lt;FONT face=Calibri&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: KB951037"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;http://support.microsoft.com/kb/951037&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: KB951037"&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: KB951037"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt; &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: KB951037"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;Microsoft Corporation, Windows Driver Development Kit (DDK) documentation, &lt;/FONT&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;A href="http://msdn.microsoft.com/en-us/library/cc264906.aspx" mce_href="http://msdn.microsoft.com/en-us/library/cc264906.aspx"&gt;&lt;FONT color=#0000ff&gt;&lt;FONT face=Calibri&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: KB951037"&gt;&lt;SPAN style="mso-bookmark: WindowsDDKLink"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;http://msdn.microsoft.com/en-us/library/cc264906.aspx&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: KB951037"&gt;&lt;SPAN style="mso-bookmark: WindowsDDKLink"&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: KB951037"&gt;&lt;SPAN style="mso-bookmark: WindowsDDKLink"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: KB951037"&gt;&lt;SPAN style="mso-bookmark: WindowsDDKLink"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;Microsoft Corporation, KB 927168, &lt;/FONT&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;A href="http://support.microsoft.com/kb/927168" mce_href="http://support.microsoft.com/kb/927168"&gt;&lt;FONT color=#0000ff&gt;&lt;FONT face=Calibri&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: KB951037"&gt;&lt;SPAN style="mso-bookmark: WindowsDDKLink"&gt;&lt;SPAN style="mso-bookmark: KB927168"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;http://support.microsoft.com/kb/927168&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: KB951037"&gt;&lt;SPAN style="mso-bookmark: WindowsDDKLink"&gt;&lt;SPAN style="mso-bookmark: KB927168"&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: KB951037"&gt;&lt;SPAN style="mso-bookmark: WindowsDDKLink"&gt;&lt;SPAN style="mso-bookmark: KB927168"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: KB951037"&gt;&lt;SPAN style="mso-bookmark: WindowsDDKLink"&gt;&lt;SPAN style="mso-bookmark: KB927168"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;Microsoft Corporation,&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;NDIS 6.0 Receive-Side Scaling documentation, &lt;/FONT&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;A href="http://msdn.microsoft.com/en-us/library/ms795609.aspx" mce_href="http://msdn.microsoft.com/en-us/library/ms795609.aspx"&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: KB951037"&gt;&lt;SPAN style="mso-bookmark: WindowsDDKLink"&gt;&lt;SPAN style="mso-bookmark: KB927168"&gt;&lt;SPAN style="mso-bookmark: NDIS6ReceiveSideScaling"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt; mso-ascii-font-family: Calibri; mso-hansi-font-family: Calibri; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT face=Calibri color=#0000ff&gt;http://msdn.microsoft.com/en-us/library/ms795609.aspx&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: KB951037"&gt;&lt;SPAN style="mso-bookmark: WindowsDDKLink"&gt;&lt;SPAN style="mso-bookmark: KB927168"&gt;&lt;SPAN style="mso-bookmark: NDIS6ReceiveSideScaling"&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN style="mso-bookmark: NDIS6ReceiveSideScaling"&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: KB927168"&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: WindowsDDKLink"&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: KB951037"&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=8957878" width="1" height="1"&gt;</description><enclosure url="http://cid-12a53f90793d2c8b.skydrive.live.com/self.aspx/DDPEBlogImages/Presentations%20and%20Papers/Mainstream%20NUMA%20and%20the%20TCP%20|5CMG%20paper%208220%20draft|6.docx" length="24241" type="text/html; charset=utf-8" /><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Performance+Engineering/">Performance Engineering</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/-NET/">.NET</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Scalability/">Scalability</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Parallel+programming/">Parallel programming</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Performance/">Performance</category></item><item><title>Mainstream NUMA and the TCP/IP stack, Part IV: Parallelizing TCP/IP</title><link>http://blogs.msdn.com/b/ddperf/archive/2008/09/09/mainstream-numa-and-the-tcp-ip-stack-part-iv-paralleling-tcp-ip.aspx</link><pubDate>Tue, 09 Sep 2008 02:35:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8935373</guid><dc:creator>Mark B Friedman</dc:creator><slash:comments>2</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/ddperf/rsscomments.aspx?WeblogPostID=8935373</wfw:commentRss><comments>http://blogs.msdn.com/b/ddperf/archive/2008/09/09/mainstream-numa-and-the-tcp-ip-stack-part-iv-paralleling-tcp-ip.aspx#comments</comments><description>&lt;FONT face=Calibri&gt;
&lt;P&gt;This is a continuation of Part III of this article posted &lt;A class="" title=Link-back-to-Part3 href="http://blogs.msdn.com/ddperf/archive/2008/08/06/mainstream-numa-and-the-tcp-ip-stack-part-iii-a-look-back-at-strategies-to-scale-high-speed-networking.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2008/08/06/mainstream-numa-and-the-tcp-ip-stack-part-iii-a-look-back-at-strategies-to-scale-high-speed-networking.aspx"&gt;&lt;FONT color=#666666&gt;here&lt;/FONT&gt;&lt;/A&gt;.&amp;nbsp;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;In the many-core era, the host processor overhead associated with processing TCP/IP interrupts is not a capacity problem, since CPU cycles on the host computer are plentiful and becoming more plentiful all the time. The problem is that the individual processors themselves are not fast enough, nor are they growing much faster. To craft a solution that works in the many-core era, there is a clear need to enhance the hardware and software in the TCP/IP protocol stack to run in parallel across multiple processors and take advantage of the available capacity. There are two hardware and software technologies that are associated with that capability today:&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri&gt;Extended Message-Signaled Interrupts (MSI-X): a hardware technology that allows the NIC to support multiple interrupt vectors, enabling multiple processor cores to handle interrupts from the NIC simultaneously.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri&gt;Receive-Side Scaling (RSS): the protocol used in the NDIS driver software to manage multiple interrupt vectors and communicate to the hardware to ensure that session-oriented TCP packets are delivered in sequence to a processor-specific interrupt queue. &lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;MSI-X and RSS work together to allow the processing of TCP/IP Receive packets to scale in parallel across multiple processor cores&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;&lt;I style="mso-bidi-font-style: normal"&gt;Message Signaled Interrupts (MSI-X)&lt;/I&gt;. MSI-X is an architectural change that allows a device to send interrupts to be processed on multiple CPUs. Historically, on the Intel architecture, devices were limited to sending interrupts to a single target. Concentrating all hardware interrupts on a single processor boosts the instruction execution rate of the Interrupt Service Routine (ISR) by increasing the chances of a processor cache warm start. In the many-core era, limiting the device to one processor that it can interrupt is a severe capacity constraint. MSI-X capabilities allow the NICs to scale on many-core processors. &lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;One key feature of Windows’ support for MSI-X devices is the ability to specify a policy that automatically assigns MSI-X interrupts to CPUs based on the OS’s understanding of the underlying NUMA topology of the machine. An NDIS-driver that supports MSI-X devices can specify an &lt;I style="mso-bidi-font-style: normal"&gt;IrqPolicySpreadMessagesAcrossAllProcessors&lt;/I&gt; policy that automatically distributes interrupts across an optimal set of eligible processors. On some NUMA machines, the performance of the device connection is affected by the underlying topology of the multi-node connections. For instance, certain device-to-processor node connections may be low latency local ones, while others are higher latency remote connections. For performance reasons, you want NIC interrupts to be processed on nodes that are connected locally and access local memory on that node exclusively. For optimal scalability, you then want to balance device interrupts across all the NUMA nodes that are interconnected. The &lt;I style="mso-bidi-font-style: normal"&gt;IrqPolicySpreadMessagesAcrossAllProcessors &lt;/I&gt;policy understands these performance considerations, and distributes the device interrupts to the right set of processors automatically.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;Figure 6 illustrates one way the &lt;I style="mso-bidi-font-style: normal"&gt;IrqPolicySpreadMessagesAcrossAllProcessors&lt;/I&gt; policy could be used to distribute interrupts from the NIC across nodes in a simple NUMA machine. A server with two quad-core sockets is shown, with each socket connected to a block of local RAM. Memory accesses from a processor core to local RAM are considerably faster than an access to remote memory attached via a bridge to the other multi-core socket. An optimal configuration is to process TCP/IP interrupts on CPU 0 on the first node and on CPU 1 on the second node, as depicted, balancing the networking I/O load across nodes. &lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;IMG title="NUMA machine with two RSS queues" style="WIDTH: 364px; HEIGHT: 633px" height=633 alt="NUMA machine with two RSS queues" src="http://5l3vgw.bay.livefilestore.com/y1plB-YgF_mL03SwgvEEnctbltcS7spfhLdNbX9F-mfjqSgfnbCfwdqGeSijVe4EpzZB3v3PL7MFqc/Simple%20Two%20Node%20NUMA%20Server%20with%20two%20RSS%20Queues%20(vertical%20orientation).jpg" width=364 mce_src="http://5l3vgw.bay.livefilestore.com/y1plB-YgF_mL03SwgvEEnctbltcS7spfhLdNbX9F-mfjqSgfnbCfwdqGeSijVe4EpzZB3v3PL7MFqc/Simple%20Two%20Node%20NUMA%20Server%20with%20two%20RSS%20Queues%20(vertical%20orientation).jpg"&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT face=Calibri&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="FONT-SIZE: 9pt; COLOR: black; LINE-HEIGHT: 115%"&gt;Figure 6.&lt;/SPAN&gt;&lt;/B&gt;&lt;SPAN style="FONT-SIZE: 9pt; COLOR: black; LINE-HEIGHT: 115%"&gt; Two NUMA nodes in a Windows Server machine configured to use MSI-X and RSS to process TCP/IP Receive packets across multiple processors.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;FONT face=Calibri&gt;&lt;SPAN style="FONT-SIZE: 9pt; COLOR: black; LINE-HEIGHT: 115%"&gt;&lt;?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /&gt;&lt;o:p&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT size=2&gt;While Receive-Side Scaling (RSS) does not require MSI-X, the two technologies normally go hand-in-hand. We restrict the RSS discussion here to the manner in which MSI-X devices are supported, which is both the simplest and most common case.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT size=2&gt;&lt;I style="mso-bidi-font-style: normal"&gt;&lt;SPAN style="COLOR: black"&gt;Receive-Side Scaling (RSS)&lt;/SPAN&gt;&lt;/I&gt;&lt;SPAN style="COLOR: black"&gt;. &lt;/SPAN&gt;RSS complements the Windows support for MSI-X. It allows the workload associated with processing network interrupts to be spread across multiple CPUs. With RSS, the DPC routine that we have seen is responsible for performing the bulk of the host processing is also scheduled to run on the same CPU where the interrupt service routine (ISR) just ran. Concentrating all the work associated with network interrupt processing on the same CPU improves instruction execution rates because data associated with the packet is likely to remain in the processor caches. It also dramatically reduces delays spent in unproductive spin lock code associated with serialization. Optimistic, non-blocking per processor locking strategies are effective under these circumstances. By default under RSS, even the Send processing associated with an ACK message is also processed on the same CPU where the Receive was processed to take advantage of the same performance considerations.&lt;SPAN style="COLOR: black"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT size=2&gt;There is one complication, however, that arises when network interrupts are distributed across multiple CPUs that RSS is forced to address. If packets are distributed randomly across multiple CPUs, this can conflict with the important function of the TCP protocol that guarantees delivery of data in sequence to the application. Suppose packets for a group of TCP connections are processed across two CPUs and one CPU in the bunch is lightly loaded while the other is heavily loaded. Older packets received on the lightly loaded CPU could easily be processed first. Receiving packets out of order in TCP triggers Fast Retransmits, for example, that could degrade both the network and delay the application, not to mention serialization delays before TCP can safely notify the application layer that Request data is available for processing.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT size=2&gt;Given this complication, RSS distributes connections, not individual packets. RSS has a mechanism that sends all the packets associated with any one TCP connection to the same processor. This preserves the order of delivery of received data packets, which avoids needless requests for TCP retransmits. Crucially, the processor associated with the specific connection must be communicated to the NIC, which must arrange Received packets into the correct message queues accordingly, prior to signaling the host processor by raising an interrupt. This coordination, of course, is another violation of the isolation principle of the layered networking stack. It is worth noting that nasty side effects can arise as a result of this willful violation of the layered networking architecture; see, for example, &lt;/FONT&gt;&lt;A href="http://support.microsoft.com/kb/927168"&gt;&lt;FONT color=#0000ff size=2&gt;KB927168&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=2&gt; &lt;/FONT&gt;&lt;FONT size=2&gt;documenting a conflict between RSS and Internet Connection Sharing on Vista that was later fixed in WS2008 and Vista SP1.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT size=2&gt;To achieve good performance, however, it is absolutely necessary for the NIC to schedule all the packets for the same TCP session to same host processor. It can only do that by peeking into the TCP header and finding the port indicator, which it then uses to calculate the right CPU to deliver the packet to. This calculation is based on a hash table that is passed to the NIC by the NDIS driver software. RSS even includes a capability to adjust the load across the CPUs that are enabled for processing NIC interrupts dynamically. The protocol stack in Windows can re-balance the interrupt load by modifying the hashing table passed to the NIC that is used in determining the proper CPU. &lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;This mechanism can be used in case some CPUs remain overloaded for an extended period of time because, for example, some TCP connections are more chatty and persistent than others. &lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT size=2&gt;Speaking of maintaining a balanced system, long-running tasks such as large file copies associated with a single ftp, SMB or media server session present inherent difficulties under RSS. The general problem is that the throughput of any one session is ultimately limited by host processor speed. With many-core processors, it is important to figure out how to use parallel data divide-and-conquer techniques to break long serial operations into smaller sub-tasks that can be executed concurrently. Providing the capability to spread long, data-intensive operations across multiple TCP sessions is one possible approach.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT size=2&gt;For further technical details on RSS, see &lt;A href="http://msdn.microsoft.com/en-us/library/ms795609.aspx"&gt;&lt;SPAN style="FONT-SIZE: 9pt; mso-ascii-font-family: Calibri; mso-hansi-font-family: Calibri; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT color=#0000ff&gt;http://msdn.microsoft.com/en-us/library/ms795609.aspx&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;. One interesting aspect of the RSS specification is that the DPC, not the ISR, is responsible for re-enabling the processor for more interrupts from the NIC. This prevents the NIC from sending any more Receive packets to the processor until the previous set has been completely processed. This effectively acts as both a serialization mechanism and a form of interrupt moderation that adaptively adjusts the delay between interrupts based on the specific processing load at the CPU.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT size=2&gt;This blog entry is continued &lt;A class="" title="Link to Part V" href="http://blogs.msdn.com/ddperf/archive/2008/09/18/mainstream-numa-and-the-tcp-ip-stack-final-thoughts.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2008/09/18/mainstream-numa-and-the-tcp-ip-stack-final-thoughts.aspx"&gt;here&lt;/A&gt;.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&amp;nbsp;&lt;/P&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=8935373" width="1" height="1"&gt;</description></item><item><title>Performance improvements in Service Pack 1 for VS 2008 and .NET FX 3.5</title><link>http://blogs.msdn.com/b/ddperf/archive/2008/08/13/service-pack-1-for-vs-2008-and-net-fx-3-5-released.aspx</link><pubDate>Wed, 13 Aug 2008 14:31:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8860575</guid><dc:creator>David Berg</dc:creator><slash:comments>6</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/ddperf/rsscomments.aspx?WeblogPostID=8860575</wfw:commentRss><comments>http://blogs.msdn.com/b/ddperf/archive/2008/08/13/service-pack-1-for-vs-2008-and-net-fx-3-5-released.aspx#comments</comments><description>&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT face=Calibri size=3&gt;We just announced the release of &lt;/FONT&gt;&lt;A href="http://msdn.microsoft.com/en-us/vstudio/products/cc533447.aspx" mce_href="http://msdn.microsoft.com/en-us/vstudio/products/cc533447.aspx"&gt;&lt;FONT face=Calibri size=3&gt;Service Pack 1 for VS 2008 and .NET FX 3.5&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face=Calibri size=3&gt;.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;A major push for this release was continuing to enhance performance and reliability, as Soma noted in his &lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/somasegar/archive/2008/08/11/service-pack-1-for-vs-2008-and-net-fx-3-5-released.aspx" mce_href="http://blogs.msdn.com/somasegar/archive/2008/08/11/service-pack-1-for-vs-2008-and-net-fx-3-5-released.aspx"&gt;&lt;FONT face=Calibri size=3&gt;most recent blog entry&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face=Calibri size=3&gt;. &lt;/FONT&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;I want to take a minute to drill into the major performance improvements you will find in this release of Visual Studio.&lt;?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;H1 style="MARGIN: 24pt 0in 0pt"&gt;&lt;FONT size=5&gt;&lt;FONT color=#365f91&gt;&lt;FONT face=Cambria&gt;Framework Performance Enhancements&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/H1&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;U&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;NET FX (CLR):&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/U&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpFirst style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;New .NET Framework Client Profile - a smaller .NET Framework redist optimized for .NET client applications.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;The new redist weighs in at around 28 MB, enabling a smaller, faster, more reliable installation experience for .NET client applications on machines that do not already have the .NET Framework installed. The framework was refactored so that it now includes system core libraries and components (including LINQ), language support, XML, Windows Forms, WPF, Deployment, Web Services remoting and serialization, data access, and a few others.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;See the &lt;/FONT&gt;&lt;FONT face=Calibri size=3&gt;BCL Team &lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/bclteam/archive/2008/05/21/net-framework-client-profile-justin-van-patten.aspx" mce_href="http://blogs.msdn.com/bclteam/archive/2008/05/21/net-framework-client-profile-justin-van-patten.aspx"&gt;&lt;SPAN style="mso-comment-continuation: 2"&gt;&lt;FONT face=Calibri size=3&gt;blog&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN class=MsoCommentReference&gt;&lt;SPAN style="FONT-SIZE: 8pt; LINE-HEIGHT: 115%"&gt;&lt;SPAN style="mso-special-character: comment"&gt;&lt;FONT face=Calibri&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt; for the full list and more details.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Client applications should also see an improvement in cold startup scenarios especially for rich graphics WPF-based apps.We also made improvements to the working set of Ngen’d images, which also helps cold startup scenarios .&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpLast style="MARGIN: 0in 0in 10pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Support for Address Space Layout Randomization (ASLR) on Vista and WS 2008.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;ASLR uses fast kernel mode virtual base address relocation to improve both memory layout and security.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;U&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;WPF:&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/U&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpFirst style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Cold startup up to 40% faster, depending on the scenario and application size, without the need to modify any of your code.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Additional support for text and graphics to deliver better performance. For example, effects like DropShadow and Blur were initially implemented using software rendering; with SP1 these are now implemented using hardware acceleration.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Faster text rendering, mostly when used in specific scenarios such as VisualBrushes, DrawingBrushes, and Viewport2DVisual3D.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Scrolling improvements with Container Recycling.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Improved working set using TreeView virtualization &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpLast style="MARGIN: 0in 0in 10pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;A much improved WriteableBitmap that enables real-time bitmap updates from a software surface.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT face=Calibri size=3&gt;Jossef Goldberg’s &lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/jgoldb/default.aspx" mce_href="http://blogs.msdn.com/jgoldb/default.aspx"&gt;&lt;FONT face=Calibri size=3&gt;blog&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face=Calibri size=3&gt; is a great source of information on WPF performance tips and tricks.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;His detailed list of SP1 performance improvements is posted &lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/jgoldb/archive/2008/05/15/what-s-new-for-performance-in-wpf-in-net-3-5-sp1.aspx" mce_href="http://blogs.msdn.com/jgoldb/archive/2008/05/15/what-s-new-for-performance-in-wpf-in-net-3-5-sp1.aspx"&gt;&lt;FONT face=Calibri size=3&gt;here&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;U&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;WCF:&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/U&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P class=MsoListParagraph style="MARGIN: 0in 0in 10pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri size=3&gt;Support for asynchronous HTTP module/handlers on IIS 7.0.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;Supports better thread management and improved throughput for systems with heavy backend processing requirements. (See Wenlong’s &lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/wenlong/archive/2008/08/13/orcas-sp1-improvement-asynchronous-wcf-http-module-handler-for-iis7-for-better-server-scalability.aspx" mce_href="http://blogs.msdn.com/wenlong/archive/2008/08/13/orcas-sp1-improvement-asynchronous-wcf-http-module-handler-for-iis7-for-better-server-scalability.aspx"&gt;&lt;FONT face=Calibri size=3&gt;blog&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt; for the technical details.)&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;U&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Windows Forms:&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/U&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P class=MsoListParagraph style="MARGIN: 0in 0in 10pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;General performance improvements, mostly due to underlying improvements in the CLR.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;U&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Data handling:&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/U&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpFirst style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Improved throughput in ADO.NET scenarios (2x+ requests/second for some scenarios).&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpLast style="MARGIN: 0in 0in 10pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Performance improvements in XLINQ over XML containing many small elements.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;H1 style="MARGIN: 24pt 0in 0pt"&gt;&lt;FONT size=5&gt;&lt;FONT color=#365f91&gt;&lt;FONT face=Cambria&gt;Visual Studio Performance Enhancements&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/H1&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;U&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Visual Web Developer:&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/U&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpFirst style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Improved typing performance in the designer in complex pages (especially with MutiView control) 100x&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri size=3&gt;Fixed the issues with &lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/webdevtools/archive/2008/06/18/faster-switch-to-design-view-in-vs-2008-sp1-rtm.aspx" mce_href="http://blogs.msdn.com/webdevtools/archive/2008/06/18/faster-switch-to-design-view-in-vs-2008-sp1-rtm.aspx"&gt;&lt;FONT face=Calibri size=3&gt;Switching to Design View&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Opening Web Sites is up to 10x faster!&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Building Web Sites is up to 3x faster!&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Opening Web Forms is up to 2x faster!&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;General performance improvements in startup and shutdown.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpLast style="MARGIN: 0in 0in 10pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri size=3&gt;Plus lots of new features and fixes (see the &lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/webdevtools/archive/2008/08/11/web-development-updates-in-visual-studio-2008-sp1.aspx" mce_href="http://blogs.msdn.com/webdevtools/archive/2008/08/11/web-development-updates-in-visual-studio-2008-sp1.aspx"&gt;&lt;FONT face=Calibri size=3&gt;team blog&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;).&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;U&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Visual Basic .NET:&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/U&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpFirst style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Performance improvements in Intellisense and listing errors.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpLast style="MARGIN: 0in 0in 10pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Improvements in compiler and build throughput (most notably for projects with large amounts of XML comments in a single file)&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;U&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Visual C#:&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/U&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P class=MsoListParagraph style="MARGIN: 0in 0in 10pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Up to 2x improvements in bringing up Intellisense with a large number of types.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;U&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;XAML Editing:&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/U&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P class=MsoListParagraph style="MARGIN: 0in 0in 10pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Improved designer startup and form load time.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;U&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Debugging:&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/U&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpFirst style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Improvements in symbol and source downloading and the ability to cancel out of symbol download from a slow symbol server.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpLast style="MARGIN: 0in 0in 10pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri size=3&gt;Fix to a performance problem in the debugger when you are stepping through source code that is downloaded from Microsoft Reference Source Server that was caused by downloading the source files again for each breakpoint. Previously released as &lt;/FONT&gt;&lt;A href="http://support.microsoft.com/kb/944899" mce_href="http://support.microsoft.com/kb/944899"&gt;&lt;FONT face=Calibri color=#0000ff size=3&gt;KB944899&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;. &lt;SPAN style="COLOR: red"&gt;(Please &lt;A href="http://www.microsoft.com/downloads/details.aspx?FamilyId=A494B0E0-EB07-4FF1-A21C-A4663E456D9D&amp;amp;displaylang=en" mce_href="http://www.microsoft.com/downloads/details.aspx?FamilyId=A494B0E0-EB07-4FF1-A21C-A4663E456D9D&amp;amp;displaylang=en"&gt;uninstall this KB&lt;/A&gt; before installing the SP.)&lt;/SPAN&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;U&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;XML Editing:&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/U&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpFirst style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Loading XML is up to 3x faster!&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpLast style="MARGIN: 0in 0in 10pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Improved editing performance.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;H1 style="MARGIN: 24pt 0in 0pt"&gt;&lt;FONT size=5&gt;&lt;FONT color=#365f91&gt;&lt;FONT face=Cambria&gt;Team Foundation Server:&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/H1&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;In this Service Pack, a large part of the focus was to improve the performance and scalability of Team Foundation Server. Key changes include faster synchronization with Active Directory, improved check-in concurrency, a faster way to create source tree branches, online index rebuilding for less maintenance downtime and better support for checking very large sets of code. &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;TFS improved the number of projects a server can support. You should experience better scalability of the server, as well as the client experience when connecting to a server with a large number of projects on it.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpFirst style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Opening Source Controlled Solutions is up to 2.5x faster!&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Deleting files is up to 2x faster!&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Improved Work Item performance (loading, saving, querying).&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Improved UI navigation performance.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Improved performance working with TFS work items in Excel and Project&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Improved performance and reliability of the Visual SourceSafe migration tool.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpLast style="MARGIN: 0in 0in 10pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri size=3&gt;See Brian Harry’s BLOG for more about the &lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/bharry/archive/2008/08/11/vs-vsts-tfs-net-3-5-sp1-is-shipping.aspx" mce_href="http://blogs.msdn.com/bharry/archive/2008/08/11/vs-vsts-tfs-net-3-5-sp1-is-shipping.aspx"&gt;&lt;FONT face=Calibri size=3&gt;Service Pack Release&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face=Calibri size=3&gt; and &lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/bharry/archive/2008/04/28/team-foundation-server-2008-sp1.aspx" mce_href="http://blogs.msdn.com/bharry/archive/2008/04/28/team-foundation-server-2008-sp1.aspx"&gt;&lt;FONT face=Calibri size=3&gt;Team Foundation Server improvements&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt; and scalability.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;H1 style="MARGIN: 24pt 0in 0pt"&gt;&lt;FONT size=5&gt;&lt;FONT color=#365f91&gt;&lt;FONT face=Cambria&gt;Other:&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/H1&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;And, of course, there are lots of new features including the new Data Entity Framework, ADO.NET data services, support for SQL Server 2008’s new features, updated components for Visual Basic and Visual C++ (including a MFC-based Office 2007 Ribbon), and new designer capabilities that improve performance indirectly by improving developer productivity.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;o:p&gt;&lt;FONT face=Calibri size=3&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT face=Calibri size=3&gt;Some of these performance fixes were previously released as hot fixes (see our &lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/ddperf/archive/2008/05/12/vs2008-sp1-and-net-fx-beta-performance-improvements.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2008/05/12/vs2008-sp1-and-net-fx-beta-performance-improvements.aspx"&gt;&lt;FONT face=Calibri size=3&gt;blog on the beta&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face=Calibri size=3&gt;).&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;If you installed some of those hot fixes you may need to &lt;/FONT&gt;&lt;A href="http://www.microsoft.com/downloads/details.aspx?FamilyId=A494B0E0-EB07-4FF1-A21C-A4663E456D9D&amp;amp;displaylang=en" mce_href="http://www.microsoft.com/downloads/details.aspx?FamilyId=A494B0E0-EB07-4FF1-A21C-A4663E456D9D&amp;amp;displaylang=en"&gt;&lt;FONT face=Calibri size=3&gt;remove them&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt; before installing the Service Patch.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;See the release notes on the download page for more information.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT face=Calibri size=3&gt;Should you encounter any performance problems we’ve missed, please continue to let us know here on the blog or by e-mail to &lt;/FONT&gt;&lt;A href="mailto:devperf@Microsoft.com" mce_href="mailto:devperf@Microsoft.com"&gt;&lt;FONT face=Calibri color=#0000ff size=3&gt;devperf@Microsoft.com&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=8860575" width="1" height="1"&gt;</description></item><item><title>Mainstream NUMA and the TCP/IP stack, Part III: A look back at older strategies to scale high-speed networking</title><link>http://blogs.msdn.com/b/ddperf/archive/2008/08/06/mainstream-numa-and-the-tcp-ip-stack-part-iii-a-look-back-at-strategies-to-scale-high-speed-networking.aspx</link><pubDate>Wed, 06 Aug 2008 02:04:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8835243</guid><dc:creator>Mark B Friedman</dc:creator><slash:comments>2</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/ddperf/rsscomments.aspx?WeblogPostID=8835243</wfw:commentRss><comments>http://blogs.msdn.com/b/ddperf/archive/2008/08/06/mainstream-numa-and-the-tcp-ip-stack-part-iii-a-look-back-at-strategies-to-scale-high-speed-networking.aspx#comments</comments><description>&lt;FONT face=Calibri&gt;
&lt;P&gt;This is a continuation of Part II of this article posted &lt;A class="" title=Link-back-to-Part2 href="http://blogs.msdn.com/ddperf/archive/2008/07/27/mainstream-numa-and-the-tcp-ip-stack-part-i-programming-ccnuma-machines.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2008/07/27/mainstream-numa-and-the-tcp-ip-stack-part-i-programming-ccnuma-machines.aspx"&gt;here&lt;/A&gt;.&lt;/P&gt;&lt;/FONT&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;By necessity, both the hardware and the software devoted to processing network traffic need to evolve in the many-core era to become multiprocessor-oriented. On servers that have NUMA architectures, that multiprocessing support needs to acquire a NUMA flavoring. The technology that allows network interrupts to be processed concurrently across multiple processors includes support for &lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri&gt;multiple Descriptor Queues in the networking hardware, &lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri&gt;Extended Message Signaled Interrupts (MSI-X) to allow hardware interrupts to be serviced concurrently on more than one processor, and&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri&gt;the software support in Windows known as Receive-Side Scaling (or RSS). &lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;With ccNUMA architecture machines becoming more mainstream, it is clear that multi-processor support should also include being NUMA-aware. &lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;For an idea of how fast the NICs are getting, a typical 1 Gb Ethernet card supports 4 transmit and 4 receive interrupt queues and can spread interrupts across as many as 4 host processors for load balancing under RSS. A 10 Gb Ethernet card necessarily supports even high levels of parallelism. For example, the dual-ported Intel 82598 10 Gigabit Ethernet Controller provides 32 transmit queues and 64 receive queues per port, which can be mapped to a maximum of 16 processor cores. Note that this increase in parallel processing capacity is only a 4x improvement over recent 1 Gb Ethernet cards, which is probably inadequate to exploit the increased bandwidth fully.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;Let’s consider briefly some of the ideas to improve TCP/IP performance that have been implemented in the recent past. The strategies discussed here either increase the efficiency of host computer processing of TCP/IP packets or attempt to off load some of these software functions onto networking hardware. These strategies have proven effective, but they do not offer enough capacity relief to keep pace with the steady advance of networking speeds. The way out of the current mismatch between high speed networks and the host processing requirements they generate is a parallel processing approach where TCP/IP interrupts are distributed across multiple CPUs. &lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;Interestingly, some of the changes outlined here to streamline TCP/IP host processing fly in the face of a major precedent. The layered architecture of the networking stack is widely regarded as one of the storied accomplishments of software engineering. Several of the efforts to improve the performance of host processing of TCP/IP interrupts involve shattering the strict isolation of components that the layered networking model advocates.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;In theory, at least, the layered approach simplifies complex software that is designed to function smoothly in diverse environments. Layering also supports development of components that can proceed independently and in parallel. In principle, each layer defines and adheres to a standard set of services, or &lt;I style="mso-bidi-font-style: normal"&gt;interfaces&lt;/I&gt;, that it provides to the component in the layer immediately above it. An upper layer communicates with a level below it using only this predefined set of abstract interfaces. (The set of services provided and consumed by two adjacent layers, in effect, defines a &lt;I style="mso-bidi-font-style: normal"&gt;contract&lt;/I&gt;.) Furthermore, in the design of the networking protocol, components are isolated. It is an article of faith among software engineers that layered architectures, when properly defined and implemented, greatly contribute to the robustness and reliability of the software built using those design principles.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;
&lt;TABLE class=MsoNormalTable style="BORDER-RIGHT: medium none; BORDER-TOP: medium none; BACKGROUND: #f9f9f9; BORDER-LEFT: medium none; BORDER-BOTTOM: medium none; BORDER-COLLAPSE: collapse; mso-border-alt: solid #AAAAAA .5pt; mso-yfti-tbllook: 1184" cellSpacing=0 cellPadding=0 border=1 class="MsoNormalTable"&gt;
&lt;TBODY&gt;
&lt;TR style="HEIGHT: 18.7pt; mso-yfti-irow: 0; mso-yfti-firstrow: yes"&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #aaaaaa 1pt solid; PADDING-LEFT: 2.4pt; BACKGROUND: #f2f2f2; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #aaaaaa 1pt solid; WIDTH: 384.9pt; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; HEIGHT: 18.7pt; mso-border-alt: solid #AAAAAA .5pt" width=513 colSpan=4&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt; TEXT-ALIGN: center" align=center&gt;&lt;B&gt;&lt;SPAN style="COLOR: black; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'; mso-bidi-font-size: 10.0pt"&gt;&lt;FONT size=3&gt;TCP/IP Layers&lt;?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/B&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;
&lt;TR style="mso-yfti-irow: 1"&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; BACKGROUND: #f2f2f2; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #aaaaaa 1pt solid; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt"&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt; TEXT-ALIGN: center" align=center&gt;&lt;B&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;Data unit&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/B&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; BACKGROUND: #f2f2f2; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #f0f0f0; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt"&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt; TEXT-ALIGN: center" align=center&gt;&lt;B&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;Layer&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/B&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; BACKGROUND: #f2f2f2; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #f0f0f0; WIDTH: 148.2pt; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" width=198&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt; TEXT-ALIGN: center" align=center&gt;&lt;B&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;Function&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/B&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 0.75pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 0.75pt; BACKGROUND: #f2f2f2; PADDING-BOTTOM: 0.75pt; BORDER-LEFT: #f0f0f0; WIDTH: 103.5pt; PADDING-TOP: 0.75pt; BORDER-BOTTOM: #aaaaaa 1pt solid; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" vAlign=top width=138&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt; TEXT-ALIGN: center" align=center&gt;&lt;B&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;Example&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/B&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;
&lt;TR style="mso-yfti-irow: 2"&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #aaaaaa 1pt solid; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt" vAlign=top&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;Data&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #f0f0f0; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" vAlign=top&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;5. &lt;/SPAN&gt;&lt;A title="Application layer" href="http://en.wikipedia.org/wiki/Application_layer"&gt;&lt;SPAN style="FONT-SIZE: 9pt; COLOR: windowtext; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'; TEXT-DECORATION: none; text-underline: none"&gt;Application&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #f0f0f0; WIDTH: 148.2pt; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" vAlign=top width=198&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;Application-specific&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 0.75pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 0.75pt; PADDING-BOTTOM: 0.75pt; BORDER-LEFT: #f0f0f0; WIDTH: 103.5pt; PADDING-TOP: 0.75pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" vAlign=top width=138&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN lang=FR style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'; mso-ansi-language: FR"&gt;HTTP, SMTP, RPC, SOAP, etc.&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;
&lt;TR style="mso-yfti-irow: 3"&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #aaaaaa 1pt solid; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt" vAlign=top&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;Segment&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #f0f0f0; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" vAlign=top&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;4. &lt;/SPAN&gt;&lt;A title="Transport layer" href="http://en.wikipedia.org/wiki/Transport_layer"&gt;&lt;SPAN style="FONT-SIZE: 9pt; COLOR: windowtext; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'; TEXT-DECORATION: none; text-underline: none"&gt;Transport&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #f0f0f0; WIDTH: 148.2pt; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" vAlign=top width=198&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;End-to-end connections (sessions) and reliable delivery&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 0.75pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 0.75pt; PADDING-BOTTOM: 0.75pt; BORDER-LEFT: #f0f0f0; WIDTH: 103.5pt; PADDING-TOP: 0.75pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" vAlign=top width=138&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;TCP/ UDP&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;
&lt;TR style="mso-yfti-irow: 4"&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #aaaaaa 1pt solid; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt" vAlign=top&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;Packet/Datagram&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #f0f0f0; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" vAlign=top&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;3. &lt;/SPAN&gt;&lt;A title="Network layer" href="http://en.wikipedia.org/wiki/Network_layer"&gt;&lt;SPAN style="FONT-SIZE: 9pt; COLOR: windowtext; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'; TEXT-DECORATION: none; text-underline: none"&gt;Network&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #f0f0f0; WIDTH: 148.2pt; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" vAlign=top width=198&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;A title="Logical address" href="http://en.wikipedia.org/wiki/Logical_address"&gt;&lt;SPAN style="FONT-SIZE: 9pt; COLOR: windowtext; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'; TEXT-DECORATION: none; text-underline: none"&gt;Logical addressing&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt; and routing; segmentation &amp;amp; re-assembly&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 0.75pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 0.75pt; PADDING-BOTTOM: 0.75pt; BORDER-LEFT: #f0f0f0; WIDTH: 103.5pt; PADDING-TOP: 0.75pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" vAlign=top width=138&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;IP&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;
&lt;TR style="mso-yfti-irow: 5"&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #aaaaaa 1pt solid; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt" vAlign=top&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;Frame&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #f0f0f0; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" vAlign=top&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;2. &lt;/SPAN&gt;&lt;A title="Data link layer" href="http://en.wikipedia.org/wiki/Data_link_layer"&gt;&lt;SPAN style="FONT-SIZE: 9pt; COLOR: windowtext; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'; TEXT-DECORATION: none; text-underline: none"&gt;Data link&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #f0f0f0; WIDTH: 148.2pt; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" vAlign=top width=198&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;Physical addressing (MAC)&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 0.75pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 0.75pt; PADDING-BOTTOM: 0.75pt; BORDER-LEFT: #f0f0f0; WIDTH: 103.5pt; PADDING-TOP: 0.75pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" vAlign=top width=138&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;Ethernet, ATM&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;
&lt;TR style="mso-yfti-irow: 6; mso-yfti-lastrow: yes"&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #aaaaaa 1pt solid; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt" vAlign=top&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;Bit&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #f0f0f0; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" vAlign=top&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;1. &lt;/SPAN&gt;&lt;A title="Physical layer" href="http://en.wikipedia.org/wiki/Physical_layer"&gt;&lt;SPAN style="FONT-SIZE: 9pt; COLOR: windowtext; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'; TEXT-DECORATION: none; text-underline: none"&gt;Physical&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #f0f0f0; WIDTH: 148.2pt; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" vAlign=top width=198&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;Media, signal and binary transmission&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 0.75pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 0.75pt; PADDING-BOTTOM: 0.75pt; BORDER-LEFT: #f0f0f0; WIDTH: 103.5pt; PADDING-TOP: 0.75pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" vAlign=top width=138&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;Optical fiber, coax, twisted pair&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT face=Calibri&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="FONT-SIZE: 10pt; COLOR: black; LINE-HEIGHT: 115%; mso-bidi-font-size: 11.0pt"&gt;Table 4.&lt;/SPAN&gt;&lt;/B&gt;&lt;SPAN style="FONT-SIZE: 10pt; COLOR: black; LINE-HEIGHT: 115%; mso-bidi-font-size: 11.0pt"&gt; The TCP/IP layered networking model.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;FONT face=Calibri&gt;&lt;SPAN style="FONT-SIZE: 10pt; COLOR: black; LINE-HEIGHT: 115%; mso-bidi-font-size: 11.0pt"&gt;&lt;o:p&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;Table 4 is a standard representation of the layered networking model used in TCP/IP, which has gained almost universal acceptance in computer-computer communications. Take the ubiquitous IP layer, for instance. IP implements a Best Effort service model to deliver packets from one station to another using routing. By design, it is connectionless, session-less and unreliable. Delivery of packets to the correct destination is not guaranteed, but IP does take a “best effort” approach to accomplish this. For the applications that require these services, the higher-level TCP Host protocol guarantees that packets are delivered reliably and in order to the designated application layer above it. It does this using a session-oriented protocol that preserves the state of the messaging-passing session between packets. TCP has also evolved complicated, performance-oriented flow and congestion control mechanisms that are beyond the scope of the current discussion.&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;The layering approach to networking introduces one additional and crucial design constraint. When one station is transferring a message to another, only components at the same level in the protocol stack can exchange data and communicate with each other. For example, only the TCP component in the receiver is supposed to be able to understand and process information placed into the TCP packet header by the sender. However, both the TCP Offload Engine and Receive-Side Scaling utilize knowledge of what is going on the upper layer TCP protocol down in the Data Link (or MAC) layer in the receiver, a serious violation of the principle that the layers in the protocol stack remain totally isolated from each other. Apparently, this is a case where the serious performance issues trump the pure design principles, and the evolution of TCP/IP has always been sensitive to practical issues of scaling. It is not that you absolutely cannot violate the contract that governs the ways layers communicate, but it is something that should be done very thoughtfully so that your once clean interfaces don’t start to look like swiss cheese. &lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;A crucial factor that works to encourage breaking with precedent is that the protocols from the TCP layer down to the hardware all adhere very strictly to the standards in order to promote interoperability. This strict compliance has the effect of hardening the services and interfaces between these layers in cement. This rigidity actually reduces the risk of side effects when a lower level component presumptuously usurps a service that architecturally is defined as the responsibility of some higher level.&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;Another factor is also at work. Within a layer, components that conform to the same contract layer can, in theory, be freely substituted for each other. This principle of the layered approach is supposed to promote development of a profusion of components that implement different sets of services, but still adhere to the strict requirements of the standard. In fact, the need for interoperability severely limits the proliferation of components that can be freely substituted for each other. Ethernet is almost always the hardware used at the bottom of the stack due to its superior cost/performance. Ethernet is always followed by IP, which is then usually followed by TCP. UDP can be freely substituted for TCP at the host processing layer, but only when the TCP services that ensure reliable delivery of packets and flow control can be dispensed with. In practice, there is very little variety among the components you will see operating in every networking protocol stack. TCP/IP over Ethernet, being ubiquitous, achieves the highest possible degree of interconnectivity and interoperability.&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;With the TCP/IP stack so pervasive and so stable and dominant, it then becomes possible to think the unthinkable. It becomes difficult to resist the temptation to violate the principle of isolation if you can demonstrate a big enough performance win. Having the Ethernet layer peek into the TCP packet headers and optimize their processing is acceptable when the violation of this sacrosanct principle of layering yields sufficient performance or scalability improvements. &lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt" mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;H3 style="MARGIN: 10pt 0in 0pt"&gt;&lt;FONT face=Cambria color=#4f81bd size=3&gt;Recent performance improvements to TCP/IP host processing. &lt;/FONT&gt;&lt;/H3&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;Before we drill into the current set of architectural changes to the networking stack, let’s explore briefly some of the more successful strategies for reducing the host computer processing requirements associated with TCP/IP interrupt processing that have been explored in the past. Stateless processing associated with the IP layer were some the earliest functions identified that could be performed on the NIC and eliminate some amount of host processing. &lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;These offloaded functions include Checksum and segmentation for large Sends, both of which were supported in the Windows 2000 timeframe. Because the IP protocol is stateless and connectionless, there are virtually no side effects to performing these functions on the NIC, even if it does violate the principle of strict isolation between layers in the protocol stack.&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;Another set of performance improvements that have been implemented recently do potentially generate serious side effects that must be handled rather delicately. These include: &lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;interrupt moderation&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;jumbo frames&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;TCP Offload engine (TOE)&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;We will drill into these three approaches next, discussing some of the potential side effects, performance trade-offs, and other issues they raise. It is probably also worthwhile to mention netDMA, which is the Windows support for Intel’s I/O Acceleration Technology (I/OAT), in this context. I/OAT makes targeted improvements in the processor memory architecture to improve the efficiency of NIC-to-memory transfers. &lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;Each of these approaches has worked to a degree, but none has produced enough of a breakthrough in performance to address the underlying condition, the growing mismatch between host processing requirements and network bandwidth. &lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;As noted earlier, the CPU load associated with the processing Ethernet packets with TCP/IP at a server is a long-standing and persistent performance problem that has escaped a satisfactory solution in the past. For many years, the thrust of conventional solutions to the problem was straightforward – namely, any means possible for reducing the number of interrupts that the host computer needs to process. Two of the more effective approaches to reducing the number of interrupts are to use some form of &lt;I style="mso-bidi-font-style: normal"&gt;interrupt moderation&lt;/I&gt; or so-called &lt;I style="mso-bidi-font-style: normal"&gt;jumbo frames&lt;/I&gt;, basically larger packets than the Ethernet standard supports. Both approaches are effective to a degree, but also have serious built-in limitations and drawbacks.&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;I style="mso-bidi-font-style: normal"&gt;Interrupt moderation&lt;/I&gt; on the NIC is widely used today to reduce the host interrupt processing rate. It is successful, but only to the extent of addressing the processing load associated with each interrupt, which, as indicated in the protocol overhead measurements discussed earlier, is relatively minor. A NIC that supports interrupt moderation can delay the host interrupt for up to a specified period of time with the hope that the NIC will receive additional networks packets to process during the delay. Then, instead of each packet causing an interrupt, the host processor can process multiple packets in a single interrupt. In the measurements reported in Part 1, interrupt moderation was used to cut the host processor interrupt rate in half. When you consider as we have earlier, the potential rate of networks interrupts that a 10Gb Ethernet card can drive, some form of interrupt moderation on the NIC becomes essential for the smooth operation of the host processor.&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;Interrupt moderation helps, but not enough to relieve the bottleneck at the host CPU. The host processing associated with the TCP/IP protocol appears to scale as a function of &lt;I style="mso-bidi-font-style: normal"&gt;both&lt;/I&gt; the number of interrupts and the amount of data being transferred between the NIC and the host computer. As the average size of data payloads increases, the processing bottleneck shifts to memory latency. See, for example, the bottleneck analysis presented in the Intel white paper “&lt;A href="http://www.intel.com/technology/ioacceleration/306484.pdf"&gt;&lt;FONT color=#0000ff&gt;Accelerating High-Speed Networking with Intel® I/O Acceleration Technology&lt;/FONT&gt;&lt;/A&gt;.” &lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;Interrupt moderation should be used cautiously in situations where the fastest possible network latency is required, such as two communicating infrastructure servers connected to the same high-speed networking backbone. It also has to be implemented carefully to ensure it does not interfere with the TCP congestion control functions that try to measure round trip time (RTT). &lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;I style="mso-bidi-font-style: normal"&gt;Jumbo frames&lt;/I&gt;. Sending data across the wire in so-called &lt;I style="mso-bidi-font-style: normal"&gt;jumbo frames&lt;/I&gt; also significantly reduces the number of host interrupts. And there is little question that the size of the Ethernet MTU is sub-optimal for many networking transmission workloads. Consider the relatively large data payloads that routinely need to be transferred between a back-end database machine and the clusters of front-end and middle tier machines in a typical clustered, multi-tier web service application today. Using jumbo frames of, say, 9K payloads on the high speed network backbone linking these servers leads to a 6:1 reduction in the number of host processor interrupts required to transfer sizable blocks of data. When servers are connected to a Storage Area Network (SAN) using iSCSI, even larger frames are desirable.&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;In fact, jumbo frames appears to be such a simple, effective solution within the confines of the data center that it naturally leads to consideration of what other aspects of the TCP/IP protocol that are sub-optimal in that environment could also be modified. For example, when there is frequent high speed communication between very reliable components, the TCP/IP requirement to acknowledge positively the receipt of every packet is overkill, and it very tempting to break with the standard and relax that requirement. The superior cost/performance of high speed Ethernet-based networking makes it very tempting to consider as an alternative interconnect technology to use with both SANs and High Performance Computing (HPC) clusters. In both these cases there are alternatives linkage technologies that outperform TCP/IP that are also considerably more expensive. For a further discussion of this issue in the context of SAN performance, see “&lt;A href="http://www.demandtech.com/Resources/Papers/Intro%20to%20SAN%20capacity%20planning.pdf"&gt;&lt;FONT color=#0000ff&gt;An Introduction to SAN Capacity Planning&lt;/FONT&gt;&lt;/A&gt;.” And for the HPC flavor of this same discussion, see Jeffrey Mogul’s “&lt;A href="http://portal.acm.org/citation.cfm?id=1251059&amp;amp;dl=ACM&amp;amp;coll=portal&amp;amp;CFID=71988909&amp;amp;CFTOKEN=98964748"&gt;&lt;FONT color=#0000ff&gt;TCP offload is a dumb idea whose time has come&lt;/FONT&gt;&lt;/A&gt;.”&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;Unfortunately, using non-standard jumbo frames introduces a significant compatibility problem that severely limits the effectiveness of the solution. The great majority of network clients will reject frames larger than the standard Ethernet MTU of 1500 bytes. In effect, you can send jumbo frames between specific host computers that are equipped to handle them on a dedicated backbone segment readily enough, but you cannot reliably send them to just any machine connected using the IP internetworking layer. So implementing jumbo frames requires more complicated routing schemes. TCP/IP RFC 2923 section 2.1, which is supported in Windows XP SP3, Vista, and Windows Server 2008, allows two TCP peers to negotiate the largest size MTU that can be transmitted between them. But the connectionless and stateless IP routing mechanism means that no single packet transmitted between station A and B need follow the same route twice. Given that the precise route to the destination station is dynamically constructed for each packet, any intermediate router that did not support jumbo frames would reject any non-standard packets it received and prevent successful transmission to the receiver.&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;I style="mso-bidi-font-style: normal"&gt;TCP Offload Engine&lt;/I&gt;. A TCP Offload Engine (or TOE) is another solution that has been implemented to reduce the host processing required for TCP/IP interrupts. As the name suggests, in this approach, certain TCP/IP protocol functions are performed directly on the NIC, either reducing the amount of processing that must be performed in the host machine, or eliminating host interrupts associated with certain TCP/IP housekeeping operations entirely. Areas where significant performance gains are experienced with TOE include the elimination of expensive memory copy operations, offloading segmentation and reassembly (a function of the IP layer), and offloading some of the TCP housekeeping functions that ensure reliable connections (mainly, ACK processing and TCP retransmission timers). Moving these functions onto the NIC results in a reduction of the total number of interrupts that need to be processed by the host machine. Potential performance benefits associated with TOE are quantified here: “&lt;A href="http://www.dell.com/downloads/global/vectors/ps3q06-20060132-Broad_com.pdf"&gt;&lt;FONT color=#0000ff&gt;Boosting Data Transfer with TCP Offload Engine Technology&lt;/FONT&gt;&lt;/A&gt;.” You can see that the potential CPU savings are considerable. &lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;TCP Offload Engine, however, is a grievous violation of the layered architecture of the networking protocol stack. The TCP Chimney Offload feature that provides TOE support in Windows, for example, required an extensive re-architecture of the TCP/IP stack. TOE introduces many breaking changes. See the KnowledgeBase article entitled ”&lt;A href="http://support.microsoft.com/kb/951037"&gt;&lt;FONT color=#0000ff&gt;Information about the TCP Chimney Offload feature in Windows Server 2008&lt;/FONT&gt;&lt;/A&gt;” detailing the many limitations, reflecting what networking functions can &amp;amp; can’t safely be offloaded in which computing environments. For instance, any networked machine that enforces an IPsec-based security policy where it is necessary to inspect each individual packet cannot use TOE. Neither is TOE currently compatible with either server virtualization technology or common forms of clustering based on virtual IP addresses. &lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;The modest benefits in many environments and the complexities introduced due to explicit violations of the layered model of the network protocol argue against a general TOE solution. Another strong criticism of the TOE approach is that it merely moves the bottleneck from the host processor to the NIC. As RSS penetrates the market for high-speed networking, I believe that interest in the TOE approach will wane. If you do have a processing bottleneck on the host machine as a result of high-speed networking, with an RSS solution, at least, the bottleneck is visible, and there are inexpensive mechanisms to help deal with it. A processing bottleneck on the NIC is opaque and resists any capacity solution other than to swap in a more expensive card, assuming one exists, and hope that the new one is significantly faster and more powerful than the old one. &lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;I style="mso-bidi-font-style: normal"&gt;Intel I/OAT&lt;/I&gt;. Intel’s I/OAT introduces memory architecture improvements that give the NIC access to a dedicated DMA (direct memory access) engine for copying data between host memory and the NIC. These architectural changes are known as the Intel QuickData Technology DMA subsystem. With both interrupt moderation and the TCP Offload of IP segmentation and re-assembly, the processor tends to receive fewer interrupts to process, but each interrupt results in larger amounts of data that needs to be processed by the host. The networking protocol stack services the initial interrupt from the NIC and examines the Receive data block while it is running in kernel mode. Most data blocks associated with networking I/O subsequently need to be copied into the networking application’s private address space. A performance analysis showed that, especially with larger blocks of data, this second memory-to-memory copy operation was responsible for a very large portion of the host processor load.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;An Intel white paper &lt;A href="http://www.intel.com/technology/ioacceleration/317106.pdf"&gt;&lt;FONT color=#0000ff&gt;here&lt;/FONT&gt;&lt;/A&gt; describes this analysis in some detail. &lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;Ultimately, the result of this performance analysis was the set of I/OAT architectural improvements that permit this second memory-to-memory operation to be performed by a DMA provider engine (located on the Northbridge chip set currently) that requires no additional host processor bandwidth. The memory copy operation occupies the memory controller, but does not consume Front Side Bus bandwidth, which also frees up the host processor to perform other CPU tasks. Interestingly, Windows support for this technology, described in the &lt;A href="http://msdn.microsoft.com/en-us/library/cc264906.aspx"&gt;&lt;FONT color=#0000ff&gt;Driver Development Kit (DDK) documentation&lt;/FONT&gt;&lt;/A&gt;, is actually very general, but to date the only netDMA client available is the tcpip.sys kernel mode driver that processes networking interrupts. It ought to be possible for disk I/O controllers to also exploit I/OAT architectural improvements sometime in the future. However, data blocks associated with disk I/O, which are cached by default in the system address space in Windows, are not necessarily subject to multiple copy operations, depending on the cache interface used.&lt;/P&gt;
&lt;P style="MARGIN: 10pt 0in 0pt" mce_keep="true"&gt;&lt;A class="" title="Mainstream NUMA and TCP/IP: Part IV" href="http://blogs.msdn.com/ddperf/archive/2008/09/09/mainstream-numa-and-the-tcp-ip-stack-part-iv-paralleling-tcp-ip.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2008/09/09/mainstream-numa-and-the-tcp-ip-stack-part-iv-paralleling-tcp-ip.aspx"&gt;Continue to Part IV&lt;/A&gt;.&lt;/P&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=8835243" width="1" height="1"&gt;</description></item><item><title>Mainstream NUMA &amp; the TCP/IP stack: Part 2: Programming ccNUMA machines</title><link>http://blogs.msdn.com/b/ddperf/archive/2008/07/27/mainstream-numa-and-the-tcp-ip-stack-part-i-programming-ccnuma-machines.aspx</link><pubDate>Sun, 27 Jul 2008 21:02:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8780016</guid><dc:creator>Mark B Friedman</dc:creator><slash:comments>1</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/ddperf/rsscomments.aspx?WeblogPostID=8780016</wfw:commentRss><comments>http://blogs.msdn.com/b/ddperf/archive/2008/07/27/mainstream-numa-and-the-tcp-ip-stack-part-i-programming-ccnuma-machines.aspx#comments</comments><description>&lt;P&gt;This is a continuation of Part I of this article posted &lt;A class="" title="Link-back to Part 1" href="http://blogs.msdn.com/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx"&gt;here&lt;/A&gt;.&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;In Part 1 of this article, we looked at the capacity issues that are driving architectural changes in the TCP/IP networking stack. While network interfaces are increasing in throughput capacity, processor speeds in the multi-core era are not keeping pace. Meanwhile, the TCP/IP protocol has grown in complexity so that host processing requirements are increasing, too. The only way for networked computers to scale in the multi-core era is to begin distributing networking I/O operations across multiple processors. Since bigger server machines rely on NUMA architectures for scalability, high speed networking is also evolving to exploit machines with NUMA architectures in an optimal fashion.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;Machines with NUMA (non-uniform memory access speeds) architectures are usually large scale multiprocessors that are assembled using building blocks, or &lt;I style="mso-bidi-font-style: normal"&gt;nodes&lt;/I&gt;, that each contain some number of CPUs, some amount of RAM, and various other peripheral connections. Nodes are often configured on separate boards, for example, or specific segments of a board. Multiple nodes are then interconnected with high speed links of some sort that permit all the memory that is configured to be available to executing programs. There are many schools of thought on what the best interconnection technology is. Some manufacturers favor tree structures, some favor directory schemes, some favor network-like routing. A key feature of the architecture is that the latency of a memory fetch depends on the physical location of the RAM being accessed. Accessing RAM attached to the local node is faster than a memory fetch to a remote location that is physically located on another node. &lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;Within &lt;A class="" title="Nehalem Hyperlink1" href="http://www.realworldtech.com/includes/templates/articles.cfm?ArticleID=RWT040208182719&amp;amp;mode=print" mce_href="http://www.realworldtech.com/includes/templates/articles.cfm?ArticleID=RWT040208182719&amp;amp;mode=print"&gt;one of the new Intel Nehalem many-core microprocessor&lt;/A&gt;, for example, all the processor cores and their logical processors can access local memory at a uniform speed. Figure 3 is a schematic diagram depicting a 4-way Nehalem multiprocessor chip that is connected to a bank of RAM. The configuration of processors and RAM shown in Figure 3 is a building block that is used in creating a larger scale machine by connecting two or more of such nodes together. &lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;IMG title="Quad-core node" style="WIDTH: 362px; HEIGHT: 520px" height=520 alt="Quad-core node" src="http://5l3vgw.bay.livefilestore.com/y1pysaX_fyaHyL_hZhhGIyXP5RhSKILXbj8AXnupeLec_hHtxoKXb6Z48TZYahS02yXpSrpH6b9-mY/Quad-core%20Node%20Drawing.jpg" width=362 mce_src="http://5l3vgw.bay.livefilestore.com/y1pysaX_fyaHyL_hZhhGIyXP5RhSKILXbj8AXnupeLec_hHtxoKXb6Z48TZYahS02yXpSrpH6b9-mY/Quad-core%20Node%20Drawing.jpg"&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="FONT-SIZE: 9pt; mso-bidi-font-size: 11.0pt"&gt;Figure 3.&lt;/SPAN&gt;&lt;/B&gt;&lt;SPAN style="FONT-SIZE: 9pt; mso-bidi-font-size: 11.0pt"&gt; &lt;EM&gt;A schematic diagram depicting a NUMA node showing locally-attached RAM and a multi-core socket.&lt;?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/EM&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;A two-node NUMA server is illustrated in Figure 4, which shows a direct connection between the memory controller on node A and the memory controller on node B. This is the relatively simple case. A thread executing on node A can access any RAM location on either node, but an access to a local memory address is considerably faster. The latency to access to a remote memory location is several times slower. (Definitive timings are not available as of this writing because early versions of the hardware are just starting to become available.)&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;A class="" title="Two-node NUMA server based on Nehalem" href="http://5l3vgw.bay.livefilestore.com/y1pNycwWsLj-tQabletYlwpg3Jnn7wvCJGYF-7IKnkz7PITD2CeK6cTdNqU3uDM8GRBK0iw64sEKZ8/Simple%20Two%20Node%20NUMA%20Server%20Drawing%20(vertical%20orientation).jpg" mce_href="http://5l3vgw.bay.livefilestore.com/y1pNycwWsLj-tQabletYlwpg3Jnn7wvCJGYF-7IKnkz7PITD2CeK6cTdNqU3uDM8GRBK0iw64sEKZ8/Simple%20Two%20Node%20NUMA%20Server%20Drawing%20(vertical%20orientation).jpg"&gt;&lt;/A&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="FONT-SIZE: 9pt; mso-bidi-font-size: 11.0pt"&gt;&lt;IMG title="Two-node NUMA server based on Nehalem" style="WIDTH: 407px; HEIGHT: 1072px" height=1072 alt="Two-node NUMA server based on Nehalem" src="http://5l3vgw.bay.livefilestore.com/y1pNycwWsLj-tQabletYlwpg3Jnn7wvCJGYF-7IKnkz7PITD2CeK6cTdNqU3uDM8GRBK0iw64sEKZ8/Simple%20Two%20Node%20NUMA%20Server%20Drawing%20(vertical%20orientation).jpg" width=407 mce_src="http://5l3vgw.bay.livefilestore.com/y1pNycwWsLj-tQabletYlwpg3Jnn7wvCJGYF-7IKnkz7PITD2CeK6cTdNqU3uDM8GRBK0iw64sEKZ8/Simple%20Two%20Node%20NUMA%20Server%20Drawing%20(vertical%20orientation).jpg"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/B&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="FONT-SIZE: 9pt; mso-bidi-font-size: 11.0pt"&gt;Figure 4.&lt;/SPAN&gt;&lt;/B&gt;&lt;SPAN style="FONT-SIZE: 9pt; mso-bidi-font-size: 11.0pt"&gt; &lt;EM&gt;A two-NUMA server showing a cross-node link that is used when a thread on one node needs to access a remote memory location.&lt;o:p&gt;&lt;/o:p&gt;&lt;/EM&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;As the number of nodes increases, it is no longer feasible for every node to be directly connected to every other node, nor can each bank of RAM that is installed be accessed in a single hop. The specific technology used to link nodes may introduce additional variation in the cost of accessing remote memory. From any one node, it could take longer to access memory on some nodes than others. For instance, some nodes may be accessed in a single hop across a direct link, while other accesses may require multiple hops. Some manufacturers favor routing through a shared directory service, for example. Your mileage may vary.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;Specifically, in the Intel architecture, manufacturers are supplying a cache coherent flavor of NUMA servers (ccNUMA). Cache coherence is implemented using a snooping protocol to ensure that threads executing on each NUMA node have access to the most current copy of the contents of the distributed memory. Details of the snooping protocol used in Intel ccNUMA machines are discussed &lt;/FONT&gt;&lt;A href="http://www.realworldtech.com/includes/templates/articles.cfm?ArticleID=RWT082807020032&amp;amp;mode=print"&gt;&lt;FONT face=Calibri color=#0000ff&gt;here&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face=Calibri&gt;.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;AMD has taken a somewhat different tack in building its multi-core processors. For communication on chip between processors, AMD uses a technology known as HyperTransport, which is a dedicated, per-processor 2-way high speed link. Multiple processors cores are then linked on the chip in a ring topology as depicted in Figure 5. The ring topology has the effect of scaling the bus bandwidth that is used as an interconnect linearly with the number of the processors. But the architecture leads to NUMA characteristics. A thread executing on CPU 0 can access a local memory location, a remote memory location that is local to CPU 1 at the cost of one hop across the HT link, or a remote memory location that is local to CPU 2 at the cost of two hops across HT links.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;IMG title="AMD multi-core socket" style="WIDTH: 435px; HEIGHT: 435px" height=435 alt="AMD multi-core socket" src="http://5l3vgw.bay.livefilestore.com/y1pt8apQ0QRaEO0kR9KRDE29WNelvL0WCkG3i6aQTMLuL52t-DmDG1bUcWKUlO_qNaHWOaCGRePA_w/AMD%20multicore%20socket.jpg" width=435 mce_src="http://5l3vgw.bay.livefilestore.com/y1pt8apQ0QRaEO0kR9KRDE29WNelvL0WCkG3i6aQTMLuL52t-DmDG1bUcWKUlO_qNaHWOaCGRePA_w/AMD%20multicore%20socket.jpg"&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="FONT-SIZE: 9pt; mso-bidi-font-size: 11.0pt"&gt;Figure 5.&lt;/SPAN&gt;&lt;/B&gt;&lt;SPAN style="FONT-SIZE: 9pt; mso-bidi-font-size: 11.0pt"&gt; &lt;EM&gt;The AMD approach to multi-core processors has NUMA characteristics. A program executing on CPU 0 that accesses RAM that is local to CPU 2 requires two hops across the HyperTransport links that connect the processors in a ring.&lt;o:p&gt;&lt;/o:p&gt;&lt;/EM&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;Historically, application development for NUMA machines meant understanding the performance costs associated with accessing remote memory on a specific hardware platform. Since manufacturers employ different proprietary interconnection schemes in their multi-tiered NUMA machines, application developer are challenged to find the right balance in exploiting a specific proprietary architecture that may then limit the ability to port the application to a different platform in the future. It may be possible to connect nodes in a NUMA machine in an asymmetric configuration, for example, where the performance cost function associated with accessing different memory locations is decidedly irregular.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;To scale well, a multi-threaded program running on a NUMA machine needs to be aware of the machine environment and understand which memory references are local to the node and which are remote. A thread that was running on one NUMA node that migrates to another node pays a heavy price every time it has to fetch results from remote memory locations. The difficulty programmers face when trying to develop a scalable, multi-threaded application for a NUMA architecture machine is understanding their memory usage pattern and how it maps to the NUMA topography. When NUMA considerations were confined to expensive, high-end supercomputers, the inherent complexities developers faced in programming them were considered relatively esoteric concerns. However, in the era of many-core processors, NUMA is poised to become a mainstream architecture. &lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;In theory, it is possible to craft an optimal solution when threads and the memory they access are &lt;I style="mso-bidi-font-style: normal"&gt;balanced&lt;/I&gt; across NUMA processing nodes. In order to achieve an optimal balancing of the machines resources without overloading any of them, programs need to understand the CPU and memory resources that individual tasks executing in parallel require and understand how to best map those resources to the topography of the machine. Then they require a suitable scheduling mechanism to achieve the desired result. Achieving an optimal balance, as a practical matter, is not easy, in the face of variability in the resources required by any of execution threads, a complication that may then require dynamic adjustments to the scheduling policy in effect. &lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;The Windows OS is already NUMA-aware to a degree and, thus, supports a NUMA programming model. For example, once dispatched, threads have node affinity and tend to stay dispatched on an available processor within a node. Windows OS memory management is also NUMA-aware, maintaining per node allocation pools. The OS not only resists migrating threads to another node, it also tries to ensure that most memory allocated are satisfied locally using per node memory management data structures. &lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;Windows also provides a number of NUMA-oriented APIs that applications can use to keep their threads from migrating off-node and also enable them to direct memory allocations to a specific physical processing node. For more information on the NUMA support in Windows, see the MSDN Help topic “&lt;/FONT&gt;&lt;A href="http://msdn.microsoft.com/en-us/library/aa363804.aspx"&gt;&lt;FONT face=Calibri color=#0000ff&gt;NUMA Support&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face=Calibri&gt;.” &lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;To help application developers deal better with the complexities of NUMA architectures in the future, the Windows NUMA support needs to evolve. One potential approach would be for the OS to attempt to calculate a performance cost function at start-up that it would then expose to driver and application programs when they start up and run. Conceivably, the OS might also need to adjust this performance cost function to response to configuration changes that occur dynamically, such as any power management event that affects memory latency. These changes would then have to be communicated to NUMA-aware drivers and applications somehow so they could adapt to changing conditions.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;A class="" title=Link-to-Part3 href="http://blogs.msdn.com/ddperf/archive/2008/08/06/mainstream-numa-and-the-tcp-ip-stack-part-iii-a-look-back-at-strategies-to-scale-high-speed-networking.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2008/08/06/mainstream-numa-and-the-tcp-ip-stack-part-iii-a-look-back-at-strategies-to-scale-high-speed-networking.aspx"&gt;Continue to Part III of this article.&lt;/A&gt; &lt;/P&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=8780016" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Performance+Engineering/">Performance Engineering</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Scalability/">Scalability</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Parallel+programming/">Parallel programming</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Performance/">Performance</category></item><item><title>Lessons from the test lab: investigating a pleasant surprise</title><link>http://blogs.msdn.com/b/ddperf/archive/2008/06/18/lessons-from-the-test-lab-investigating-a-pleasant-surprise.aspx</link><pubDate>Thu, 19 Jun 2008 00:33:20 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8618468</guid><dc:creator>Jonathan Hardwick</dc:creator><slash:comments>8</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/ddperf/rsscomments.aspx?WeblogPostID=8618468</wfw:commentRss><comments>http://blogs.msdn.com/b/ddperf/archive/2008/06/18/lessons-from-the-test-lab-investigating-a-pleasant-surprise.aspx#comments</comments><description>&lt;p&gt;This post describes our recent investigation into an interesting performance problem: benchmarks that we were surprised to find running significantly faster than we expected on new hardware. Along the way we discuss useful benchmarking tools, how to validate results, and why it pays to know exactly what hardware you're running on.&lt;/p&gt;  &lt;p&gt;This all started in our performance test lab. During the development of Visual Studio, each new build undergoes a suite of automated performance tests, running in a lab full of identical machines. These performance tests allow us to track Visual Studio's performance over time, and &lt;a href="http://blogs.msdn.com/ddperf/archive/2008/05/20/visual-studio-performance-testing-noise-is-enemy-1.aspx"&gt;detect performance regressions&lt;/a&gt; (when something gets unexpectedly worse). We recently added a batch of new machines in our lab, and that's when the fun started.&lt;/p&gt;  &lt;p&gt;&lt;b&gt;Pop Quiz: How Much Faster?&lt;/b&gt;&lt;/p&gt;  &lt;p&gt;Old machine: dual-core Intel Pentium D 830 processor, running at 3 GHz, with 1 GB of RAM.&lt;/p&gt;  &lt;p&gt;New machine: quad-core Intel Xeon 5355 processor, running at 2.66 GHz, with 4 GB of RAM. &lt;/p&gt;  &lt;p&gt;Given the differences in the two hardware configurations above, how much faster would you expect the new machine to be when running a Visual Studio performance test? Lower than, same as, twice, three times or four times the performance of the older machine? &lt;/p&gt;  &lt;p&gt;One line of reasoning might look at the relative clock frequencies of the processors on the two machines. This might lead you to expect the newer processor cores to perform slower than the older cores, since their clock frequency is 11% lower. By this reasoning you might conclude that single-threaded applications would perform poorly on the new machine. &lt;/p&gt;  &lt;p&gt;Another line of reasoning would factor in the number of cores in the two systems. Since the new machine has twice the number of cores, you might expect it to have about twice the performance on multi-threaded applications. (If you also accounted for the lower clock frequency, you'd end up with a figure of 1.78 times the performance of the old machine.) &lt;/p&gt;  &lt;p&gt;A third approach might estimate the impact of RAM size. We’ve quadrupled the amount of RAM, so maybe any benchmarks that used to page to disk can now execute entirely in memory and hence will be orders of magnitude faster. [We'll cheat here and tell you that our benchmarks are generally not memory constrained]. &lt;/p&gt;  &lt;p&gt;So far, all these options seem plausible. What's your guess? &lt;/p&gt;  &lt;p&gt;What we naively expected to find lay somewhere between the first two lines of reasoning - that the new machines would be 1-2 times faster than the old machines, depending on the particular benchmark.&lt;/p&gt;  &lt;p&gt;What we actually found is that many of our single-threaded CPU-bound benchmarks run about &lt;strong&gt;twice as fast&lt;/strong&gt; on the new machine, while scalable multi-threaded benchmarks run up to &lt;strong&gt;four times as fast&lt;/strong&gt;. This was a pleasant surprise, because it significantly reduces the overall time to run all the benchmarks. But it did leave us wondering why we were getting much greater speedups than our naive explanations would suggest. The rest of this post explores that question.&lt;/p&gt;  &lt;p&gt;&lt;b&gt;Using WinSAT and SPEC to Validate Benchmark Results&lt;/b&gt;&lt;/p&gt;  &lt;p&gt;To make sure this wasn't a fluke result, we used the &lt;a href="http://msdn.microsoft.com/en-us/library/ms737378(VS.85).aspx"&gt;Windows System Assessment Tool&lt;/a&gt; (winsat.exe). This is a built-in tool that can give quickly give a representative view of a machine's performance. It is multi-threaded, taking full advantage of all the cores on a machine. Here are the WinSAT CPU results: &lt;/p&gt;  &lt;table cellspacing="0" cellpadding="2" width="422" border="0"&gt;&lt;tbody&gt;     &lt;tr&gt;       &lt;td valign="top" width="189"&gt;&lt;strong&gt;Benchmark&lt;/strong&gt;&lt;/td&gt;        &lt;td valign="top" width="89"&gt;&lt;strong&gt;Old Machine&lt;/strong&gt;&lt;/td&gt;        &lt;td valign="top" width="95"&gt;&lt;strong&gt;New Machine&lt;/strong&gt;&lt;/td&gt;        &lt;td valign="top" width="47"&gt;&lt;strong&gt;Speedup&lt;/strong&gt;&lt;/td&gt;     &lt;/tr&gt;      &lt;tr&gt;       &lt;td valign="top" width="189"&gt;CPU – Compression (MB/s)&lt;/td&gt;        &lt;td valign="top" width="89"&gt;70.5&lt;/td&gt;        &lt;td valign="top" width="95"&gt;262.0&lt;/td&gt;        &lt;td valign="top" width="47"&gt;3.7&lt;/td&gt;     &lt;/tr&gt;      &lt;tr&gt;       &lt;td valign="top" width="189"&gt;CPU – Encryption (MB/s)&lt;/td&gt;        &lt;td valign="top" width="89"&gt;52.3&lt;/td&gt;        &lt;td valign="top" width="95"&gt;139.3&lt;/td&gt;        &lt;td valign="top" width="47"&gt;2.7&lt;/td&gt;     &lt;/tr&gt;   &lt;/tbody&gt;&lt;/table&gt;  &lt;p&gt;We also wanted to validate our results against other real-world benchmarks. For this we turned to the &lt;a href="http://www.spec.org/"&gt;SPEC website&lt;/a&gt;. SPEC produces a series of benchmark suites, plus a very formal process that ensures results are reproducible and can fairly be applied across different manufacturers. More importantly for our purposes, SPEC posts all reported benchmark results on their web site. You won’t always be able to find your exact machine listed, but after using results from a tool like CPU-Z you can generally find results from a machine with the same CPU configuration and clock speed. &lt;/p&gt;  &lt;p&gt;We used the &amp;quot;CINT2006&amp;quot; benchmarks – this is a widely-used benchmark suite concentrating on integer performance. We compared results for both CINT2006, which is a good test of single-threaded performance, and CINT2006 Rate, which tests the ability of a system to execute multiple copies of CINT2006, and is therefore a better test of multi-threaded performance. For two representative machines that are similar to our old and new hardware, here are the results:&lt;/p&gt;  &lt;p&gt;&lt;/p&gt;  &lt;table cellspacing="0" cellpadding="2" width="422" border="0"&gt;&lt;tbody&gt;     &lt;tr&gt;       &lt;td valign="top" width="189"&gt;&lt;strong&gt;Benchmark&lt;/strong&gt;&lt;/td&gt;        &lt;td valign="top" width="89"&gt;&lt;strong&gt;Old Machine&lt;/strong&gt;&lt;/td&gt;        &lt;td valign="top" width="95"&gt;&lt;strong&gt;New Machine&lt;/strong&gt;&lt;/td&gt;        &lt;td valign="top" width="47"&gt;&lt;strong&gt;Speedup&lt;/strong&gt;&lt;/td&gt;     &lt;/tr&gt;      &lt;tr&gt;       &lt;td valign="top" width="189"&gt;CINT2006&lt;/td&gt;        &lt;td valign="top" width="89"&gt;9.85&lt;/td&gt;        &lt;td valign="top" width="95"&gt;15.5&lt;/td&gt;        &lt;td valign="top" width="47"&gt;1.6&lt;/td&gt;     &lt;/tr&gt;      &lt;tr&gt;       &lt;td valign="top" width="189"&gt;CINT2006 Rate&lt;/td&gt;        &lt;td valign="top" width="89"&gt;18.0&lt;/td&gt;        &lt;td valign="top" width="95"&gt;44.4&lt;/td&gt;        &lt;td valign="top" width="47"&gt;2.5&lt;/td&gt;     &lt;/tr&gt;   &lt;/tbody&gt;&lt;/table&gt;  &lt;p&gt;The WinSAT and SPEC results confirm that the new machines are much faster than our naive expectations, even for benchmarks such as CINT2006 that cannot take advantage of the extra cores. So what were we missing? &lt;/p&gt;  &lt;p&gt;&lt;b&gt;Using CPU-Z to Examining Machine Configurations&lt;/b&gt;&lt;/p&gt;  &lt;p&gt;To answer this, we need a deeper understanding of the configurations of the two systems. &lt;/p&gt;  &lt;p&gt;Unfortunately, finding detailed configuration information isn't always straightforward. For example, we know that level two (L2) cache size impacts performance, but Windows doesn't report it, and it's not easy to reboot into the BIOS to take a look at cache size when the machine is located in a remote test lab. This is where machine reporting tools like &lt;a href="http://www.cpuid.com/cpuz.php"&gt;CPU-Z&lt;/a&gt; come in. You can run CPU-Z remotely on an unknown machine and get back a nicely formatted HTML report showing exactly what the hardware is. Here's a deeper look at our old and new systems:&lt;/p&gt;  &lt;table cellspacing="0" cellpadding="2" width="408" border="0"&gt;&lt;tbody&gt;     &lt;tr&gt;       &lt;td valign="top" width="155"&gt;&lt;strong&gt;Feature&lt;/strong&gt;&lt;/td&gt;        &lt;td valign="top" width="118"&gt;&lt;strong&gt;Old Machine&lt;/strong&gt;&lt;/td&gt;        &lt;td valign="top" width="141"&gt;&lt;strong&gt;New Machine&lt;/strong&gt;&lt;/td&gt;     &lt;/tr&gt;      &lt;tr&gt;       &lt;td valign="top" width="155"&gt;CPU name&lt;/td&gt;        &lt;td valign="top" width="118"&gt;Pentium D 830          &lt;br /&gt;(“Smithfield”)&lt;/td&gt;        &lt;td valign="top" width="141"&gt;Xeon X5355          &lt;br /&gt;(“Clovertown”)&lt;/td&gt;     &lt;/tr&gt;      &lt;tr&gt;       &lt;td valign="top" width="155"&gt;CPU speed&lt;/td&gt;        &lt;td valign="top" width="118"&gt;3.00 GHz&lt;/td&gt;        &lt;td valign="top" width="141"&gt;2.66 GHz&lt;/td&gt;     &lt;/tr&gt;      &lt;tr&gt;       &lt;td valign="top" width="155"&gt;Number of cores&lt;/td&gt;        &lt;td valign="top" width="118"&gt;2&lt;/td&gt;        &lt;td valign="top" width="141"&gt;4&lt;/td&gt;     &lt;/tr&gt;      &lt;tr&gt;       &lt;td valign="top" width="155"&gt;L1 cache (per core)&lt;/td&gt;        &lt;td valign="top" width="118"&gt;16 KB&lt;/td&gt;        &lt;td valign="top" width="141"&gt;32 KB&lt;/td&gt;     &lt;/tr&gt;      &lt;tr&gt;       &lt;td valign="top" width="155"&gt;L2 cache (total)&lt;/td&gt;        &lt;td valign="top" width="118"&gt;2 MB&lt;/td&gt;        &lt;td valign="top" width="141"&gt;8 MB&lt;/td&gt;     &lt;/tr&gt;      &lt;tr&gt;       &lt;td valign="top" width="155"&gt;System RAM&lt;/td&gt;        &lt;td valign="top" width="118"&gt;1 GB DDR2&lt;/td&gt;        &lt;td valign="top" width="141"&gt;4 GB DDR2&lt;/td&gt;     &lt;/tr&gt;   &lt;/tbody&gt;&lt;/table&gt;  &lt;p&gt;&lt;b&gt;Using BCDEdit to Disable Cores&lt;/b&gt;&lt;/p&gt;  &lt;p&gt;Now we can try to tease out the relative impacts of the many changes from the old configurations the new configurations. The first and easiest step is to disable two out of four cores on a new machine, to enable a fairer &amp;quot;apples to apples&amp;quot; comparison of cores between old and new machines.&lt;/p&gt;  &lt;p&gt;To do this we used the Windows BCDEdit tool, which replaces the old method of editing BOOT.INI by hand. Here we were particularly concerned with the order in which cores are disabled. This is important because the 8 MB of L2 cache in the Xeon “Clovertown” processors is divided: two of the four cores share 4 MB, and the other two cores share the other 4 MB. To keep our benchmark comparisons as fair as possible, we wanted to make sure that only one of the L2 caches was in use after disabling two cores. We used CPU-Z again after rebooting to confirm this.&lt;/p&gt;  &lt;p&gt;Now we were in a position to do a fairer “cores to cores” comparison between the old and new machines. Here's a summary from WinSAT: &lt;/p&gt;  &lt;table cellspacing="0" cellpadding="2" width="422" border="0"&gt;&lt;tbody&gt;     &lt;tr&gt;       &lt;td valign="top" width="189"&gt;&lt;strong&gt;Benchmark&lt;/strong&gt;&lt;/td&gt;        &lt;td valign="top" width="89"&gt;&lt;strong&gt;Old Machine&lt;/strong&gt;&lt;/td&gt;        &lt;td valign="top" width="95"&gt;&lt;strong&gt;New (2 cores)&lt;/strong&gt;&lt;/td&gt;        &lt;td valign="top" width="47"&gt;&lt;strong&gt;Speedup&lt;/strong&gt;&lt;/td&gt;     &lt;/tr&gt;      &lt;tr&gt;       &lt;td valign="top" width="189"&gt;CPU – Compression (MB/s)&lt;/td&gt;        &lt;td valign="top" width="89"&gt;70.5&lt;/td&gt;        &lt;td valign="top" width="95"&gt;131.9&lt;/td&gt;        &lt;td valign="top" width="47"&gt;1.9&lt;/td&gt;     &lt;/tr&gt;      &lt;tr&gt;       &lt;td valign="top" width="189"&gt;CPU – Encryption (MB/s)&lt;/td&gt;        &lt;td valign="top" width="89"&gt;52.3&lt;/td&gt;        &lt;td valign="top" width="95"&gt;69.7&lt;/td&gt;        &lt;td valign="top" width="47"&gt;1.3&lt;/td&gt;     &lt;/tr&gt;      &lt;tr&gt;       &lt;td valign="top" width="189"&gt;Memory Bandwidth (MB/s)&lt;/td&gt;        &lt;td valign="top" width="89"&gt;4,041&lt;/td&gt;        &lt;td valign="top" width="95"&gt;3,360&lt;/td&gt;        &lt;td valign="top" width="47"&gt;0.8&lt;/td&gt;     &lt;/tr&gt;   &lt;/tbody&gt;&lt;/table&gt;  &lt;p&gt;Now we can really see the advantage of the latest processors – on a core-for-core basis, they are 1.3-1.9x faster on the CPU-intensive WinSAT benchmarks, despite having lower clock frequencies.&lt;/p&gt;  &lt;p&gt;Good, now on to the next… wait a second. Look at that memory bandwidth result. Our new machines have &lt;i&gt;less&lt;/i&gt; memory bandwidth than the old machines? That doesn't look right: although memory performance hasn't been keeping pace with CPU speeds, it &lt;i&gt;has&lt;/i&gt; been improving over time. Compared to a three-year-old machine, we'd expect these new machines to have slightly better memory bandwidth, and definitely not worse. What gives?&lt;/p&gt;  &lt;p&gt;&lt;b&gt;Memory Channels&lt;/b&gt;&lt;/p&gt;  &lt;p&gt;A primary limiting factor to memory bandwidth is the number of memory channels that are in use. And this turns out to be the problem here: although the new machines have four memory channels and eight memory slots, only two of those slots are filled, because the vendor supplied us with two 2 GB memory modules per machine. This maximizes future expansion potential – we can take the machine up to 16 GB without throwing away any of our initial investment in memory. But in the meantime using two memory slots limits us to two memory channels in use. If instead we had four 1 GB memory modules we'd have four memory channels in use, improving memory interleaving from 2:1 to 4:1 and increasing memory bandwidth. To confirm this, we populated four memory slots on one of the new machines (going from 4 GB to 8 GB) and reran WinSAT:&lt;/p&gt;  &lt;table cellspacing="0" cellpadding="2" width="422" border="0"&gt;&lt;tbody&gt;     &lt;tr&gt;       &lt;td valign="top" width="189"&gt;&lt;strong&gt;Benchmark&lt;/strong&gt;&lt;/td&gt;        &lt;td valign="top" width="89"&gt;&lt;strong&gt;2 channels&lt;/strong&gt;&lt;/td&gt;        &lt;td valign="top" width="95"&gt;&lt;strong&gt;4 channels&lt;/strong&gt;&lt;/td&gt;        &lt;td valign="top" width="47"&gt;&lt;strong&gt;Speedup&lt;/strong&gt;&lt;/td&gt;     &lt;/tr&gt;      &lt;tr&gt;       &lt;td valign="top" width="189"&gt;Memory Bandwidth (MB/s)&lt;/td&gt;        &lt;td valign="top" width="89"&gt;3,360&lt;/td&gt;        &lt;td valign="top" width="95"&gt;4,134&lt;/td&gt;        &lt;td valign="top" width="47"&gt;1.2&lt;/td&gt;     &lt;/tr&gt;   &lt;/tbody&gt;&lt;/table&gt;  &lt;p&gt;&lt;b&gt;Conclusions&lt;/b&gt;&lt;/p&gt;  &lt;p&gt;It's always possible to run more experiments to further isolate and explain benchmark results, but after a while you reach a point of diminishing returns. With the results we have so far, we can already draw some useful conclusions. &lt;/p&gt;  &lt;p&gt;The first conclusion is that our naive explanations greatly underestimated just how much better the newer processors are at executing real benchmarks, despite their slower clock speeds. The results from WinSAT and SPEC clearly show this, with core-to-core performance that is 1.3-1.9x faster on the new machines, depending on the benchmark. &lt;/p&gt;  &lt;p&gt;This is perhaps the most important lesson for developers to learn: clock speeds are no longer a good indicator of true performance. Although clock speeds have plateaued, processor designers continue to find ways to make each new generation significantly faster than the last. In our case, the old machines have Pentium D processors (“Smithfield”), while the new machines have Xeon 5-series processors (“Clovertown”).&amp;#160; And while the newer processors have slightly slower clock speeds, their micro-architecture executes more instructions per clock cycle. &lt;/p&gt;  &lt;p&gt;The second conclusion is that it's very hard to perform fair comparisons. The two machines have several configuration differences, including clock frequency, number of cores, core micro-architecture, cache sizes, bus speed, memory size and speed, and so on. We showed an example of isolating the effect of just one of these differences, the number of cores, using the BCDEdit tool. Isolating the effect of every single difference would require much more effort.&lt;/p&gt;  &lt;p&gt;Indeed, some of these differences are interrelated, and it is hard to change one without affecting another. For example, CPU architects make their micro-architecture design decisions based on cache sizes. Now imagine a hypothetical experiment that tried to isolate the effect of L2 cache size by giving each core just 1 MB of cache. This would be especially hard on the newer processors, which have been designed on the assumption that they have 2 MB of L2 cache per core&lt;a href="file://tkzaw-pro-13/#_ftn1_6097" name="_ftnref1_6097"&gt;[1]&lt;/a&gt;. In trying to perform a fairer comparison, we would have actually handicapped one system!&lt;/p&gt;  &lt;p&gt;Our final conclusion is that it truly pays to benchmark and compare systems. In our case, the simplest possible benchmark (WinSAT) showed an unexpected memory bandwidth loss, which we then traced back to a machine mis-configuration. So that was the final pleasant surprise: if we hadn't gotten curious about why the new machines were so much faster, we would never have found that they could be faster still!&lt;/p&gt;  &lt;p&gt;David Berg    &lt;br /&gt;Sunny Egbo     &lt;br /&gt;Jonathan Hardwick     &lt;br /&gt;Peter Okonski&lt;/p&gt;  &lt;hr align="left" width="33%" size="1" /&gt;  &lt;p&gt;&lt;a href="file://tkzaw-pro-13/#_ftnref1_6097" name="_ftn1_6097"&gt;[1]&lt;/a&gt; Because two cores share a single 4 MB L2 cache on the Clovertown processors, the exact size of the cache that is used by each core is not fixed at 2 MB per core; the use will vary during program execution. Cache hungry threads might get more of the cache, while less cache hungry threads get less. Even when two cache hungry threads run on the two cores, their memory hotspots are asynchronous; thus, the net effect is that each thread gets more of the cache when they need it and less when they don’t need it.&lt;/p&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=8618468" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Performance/">Performance</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Visual+Studio/">Visual Studio</category><category domain="http://blogs.msdn.com/b/ddperf/archive/tags/Performance+testing/">Performance testing</category></item></channel></rss>