<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://blogs.msdn.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Developer Division Performance Engineering blog</title><link>http://blogs.msdn.com/ddperf/default.aspx</link><description>News and commentary on developing scalable Windows applications (with Visual Studio)</description><dc:language>en-US</dc:language><generator>CommunityServer 2.1 SP1 (Build: 61025.2)</generator><item><title>Improving the Start-up Performance of the WPF and Silverlight Designer in Visual Studio 2010 Beta 2</title><link>http://blogs.msdn.com/ddperf/archive/2009/11/02/improving-the-start-up-performance-of-the-wpf-and-silverlight-designer-in-visual-studio-2010-beta-2.aspx</link><pubDate>Mon, 02 Nov 2009 18:34:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9916292</guid><dc:creator>David Berg</dc:creator><slash:comments>2</slash:comments><comments>http://blogs.msdn.com/ddperf/comments/9916292.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ddperf/commentrss.aspx?PostID=9916292</wfw:commentRss><description>&lt;P&gt;I wanted to let you know about a last minute change that went into VS 2010 Beta 2 that you can use to improve the startup performance for the WPF and Silverlight Designer.&amp;nbsp; The change went in late and it was a little risky, so we decided to leave it off until we had a chance to do some more testing with it.&amp;nbsp; You can turn the change on yourself via a registry key.&amp;nbsp; We expect the change will be on all the time in the final product, so changing the registry key is strictly a Beta 2 issue.&lt;/P&gt;
&lt;P&gt;You can read more about it here: &lt;A href="http://social.msdn.microsoft.com/Forums/en-US/vswpfdesigner/thread/4511d43f-c134-4329-a970-e374252a620e"&gt;http://social.msdn.microsoft.com/Forums/en-US/vswpfdesigner/thread/4511d43f-c134-4329-a970-e374252a620e&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;If you have any feedback on how this works (or doesn't work) for you, please let me know.&amp;nbsp; You can contact me at &lt;A href="mailto:DevPerf@Microsoft.com"&gt;DevPerf@Microsoft.com&lt;/A&gt;.&lt;/P&gt;
&lt;P&gt;Regards,&lt;/P&gt;
&lt;P&gt;Dave&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9916292" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ddperf/archive/tags/Performance/default.aspx">Performance</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Beta/default.aspx">Beta</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Visual+Studio/default.aspx">Visual Studio</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Visual+Studio+2010/default.aspx">Visual Studio 2010</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/WPF/default.aspx">WPF</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Sliverlight/default.aspx">Sliverlight</category></item><item><title>VS2010 Performance and Bad Video Drivers/Hardware</title><link>http://blogs.msdn.com/ddperf/archive/2009/10/29/vs2010-performance-and-bad-video-drivers-hardware.aspx</link><pubDate>Fri, 30 Oct 2009 05:49:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9915126</guid><dc:creator>David Berg</dc:creator><slash:comments>1</slash:comments><comments>http://blogs.msdn.com/ddperf/comments/9915126.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ddperf/commentrss.aspx?PostID=9915126</wfw:commentRss><description>&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT face=Calibri size=3&gt;We’ve received a few performance complaints around Visual Studio 2010 (Beta 2) performance that can be traced to old video drivers or GPU virtualization issues.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT face=Calibri size=3&gt;If you’re seeing slow / broken screen updates verify you have the latest drivers for your system. If this doesn’t resolve your rendering issues, you may be able to work around the problem by forcing software emulation mode by changing one registry key:&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt 0.5in"&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;[HKEY_CURRENT_USER\Software\Microsoft\Avalon.Graphics]&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt 0.5in"&gt;&lt;FONT face=Calibri size=3&gt;"DisableHWAcceleration"=dword:00000001&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT face=Calibri size=3&gt;As you can probably guess, this can be undone with:&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt 0.5in"&gt;&lt;FONT face=Calibri size=3&gt;[HKEY_CURRENT_USER\Software\Microsoft\Avalon.Graphics]&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt 0.5in"&gt;&lt;FONT face=Calibri size=3&gt;"DisableHWAcceleration"=dword:00000000&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT face=Calibri size=3&gt;If you have these issues, please, let me know by either posting here or preferably by e-mailing &lt;/FONT&gt;&lt;A href="mailto:DevPerf@Microsoft.com"&gt;&lt;FONT face=Calibri color=#0000ff size=3&gt;DevPerf@Microsoft.com&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face=Calibri size=3&gt; with details about the problems you are/were seeing.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;(When you e-mail, please run DXDIAG first and attach the DXDIAG.TXT file.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;This will give us a lot of information about your system, including what drivers you’re running.)&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT face=Calibri size=3&gt;We’re very interested in finding any hardware/software incompatibilities we might have missed and getting them cleaned up. &lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT face=Calibri size=3&gt;If you do find you need to use software emulation, and then get new drivers or hardware later, don’t forget to switch software emulation back off so you can benefit from the improved performance.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Note that this will impact all WPF applications on the system, not just Visual Studio.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /&gt;&lt;o:p&gt;&lt;FONT face=Calibri size=3&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT face=Calibri size=3&gt;Regards,&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT face=Calibri size=3&gt;Dave Berg&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT face=Calibri size=3&gt;Developer Division Performance Team&lt;/FONT&gt;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9915126" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ddperf/archive/tags/Performance/default.aspx">Performance</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Beta/default.aspx">Beta</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Visual+Studio/default.aspx">Visual Studio</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Visual+Studio+2010/default.aspx">Visual Studio 2010</category></item><item><title>Tell us about VS2010 Beta2</title><link>http://blogs.msdn.com/ddperf/archive/2009/10/29/tell-us-about-vs2010-beta2.aspx</link><pubDate>Thu, 29 Oct 2009 21:49:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9914979</guid><dc:creator>David Berg</dc:creator><slash:comments>2</slash:comments><comments>http://blogs.msdn.com/ddperf/comments/9914979.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ddperf/commentrss.aspx?PostID=9914979</wfw:commentRss><description>&lt;P&gt;&lt;FONT face="Times New Roman" size=3&gt;Last week we &lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/jasonz/archive/2009/10/19/announcing-vs2010-net-framework-beta-2.aspx"&gt;&lt;FONT face="Times New Roman" size=3&gt;shipped Beta 2&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face="Times New Roman" size=3&gt; for broad distribution.&amp;nbsp; Many of you have already sent us comments and improvement suggestions (thanks!)&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size=3&gt;At this point we are down to our final set of bug fixing, perf tuning, etc.&amp;nbsp; We’re very interested in your feedback so we can take action on it before we ship the final version.&amp;nbsp; To help make it easy, you can &lt;/FONT&gt;&lt;A href="https://mscuillume.smdisp.net/Collector/Survey.ashx?Name=D10G1"&gt;&lt;FONT face="Times New Roman" size=3&gt;take this simple survey&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face="Times New Roman" size=3&gt;.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="TEXT-ALIGN: center" align=center&gt;&lt;A href="https://mscuillume.smdisp.net/Collector/Survey.ashx?Name=D10G1"&gt;&lt;SPAN style="TEXT-DECORATION: none; text-underline: none"&gt;&lt;FONT face="Times New Roman" size=3&gt;&lt;IMG id=_x0000_i1025 title=image height=92 alt=image src="http://blogs.msdn.com/blogfiles/jasonz/WindowsLiveWriter/VS2010Beta2FeedbackSurvey_C49A/image_3.png" width=240 border=0&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size=3&gt;One thing in particular we are hearing a lot of feedback on is performance.&amp;nbsp; We are working hard on the next round of perf improvements.&amp;nbsp; You can supply your feedback through the survey.&amp;nbsp; When you give us your feedback, the more actionable you can make it the better.&amp;nbsp; We need to know what operations you are doing (like editing, debugging, etc), what kind of hardware you have (CPU, RAM, disk), and your hosting scenario (main machine, running in VM, terminal server, etc).&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size=3&gt;Thanks in advance for your feedback!&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size=3&gt;Dave&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size=3&gt;P.S. If you'd like to talk to me directly about your performance experience, you can reply here or e-mail us at &lt;A href="mailto:DevPerf@Microsoft.com"&gt;DevPerf@Microsoft.com&lt;/A&gt;.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size=3&gt;&lt;/FONT&gt;&amp;nbsp;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9914979" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ddperf/archive/tags/Performance/default.aspx">Performance</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Visual+Studio/default.aspx">Visual Studio</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Visual+Studio+2010/default.aspx">Visual Studio 2010</category></item><item><title>Parallel Scalability Isn’t Child’s Play, Part 3: The Problem with Fine-Grained Parallelism</title><link>http://blogs.msdn.com/ddperf/archive/2009/06/09/parallel-scalability-isn-t-child-s-play-part-3-the-problem-with-fine-grained-parallelism.aspx</link><pubDate>Tue, 09 Jun 2009 23:17:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9718379</guid><dc:creator>MarkBFriedman</dc:creator><slash:comments>1</slash:comments><comments>http://blogs.msdn.com/ddperf/comments/9718379.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ddperf/commentrss.aspx?PostID=9718379</wfw:commentRss><description>&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT face=Calibri&gt;&lt;A title="Part 2 in this series" href="http://blogs.msdn.com/ddperf/archive/2009/04/29/parallel-scalability-isn-t-child-s-play-part-2-amdahl-s-law-vs-gunther-s-law.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2009/04/29/parallel-scalability-isn-t-child-s-play-part-2-amdahl-s-law-vs-gunther-s-law.aspx"&gt;&lt;FONT size=3&gt;In the last blog entry in this series&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3&gt;, I introduced the model for parallel program scalability proposed by Neil Gunther, which I praised for being a realistic antidote to more optimistic, but better known, formulas. Gunther’s model adds a new parameter to the more familiar Amdahl’s law. The additional parameter&lt;I style="mso-bidi-font-style: normal"&gt; k&lt;/I&gt;, representing &lt;I style="mso-bidi-font-style: normal"&gt;coherence&lt;/I&gt;-related delays, enables Gunther’s formula to model behavior where the performance of a parallel program can actually degrade at higher and higher levels of parallelization.&amp;nbsp;&lt;/FONT&gt;&lt;FONT size=3&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;Although I don’t know that the coherence delay factor that Gunther’s formula adds fully addresses the range and depth of the performance issues surrounding fine-grained parallelism, it is certainly one of the key factors Gunther’s law expresses that earlier formulations do not.&lt;/FONT&gt;&lt;/P&gt;&lt;/FONT&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Developers experienced in building parallel programs recognize that Gunther’s formula echoes an inconvenient truth, namely, that the task of achieving performance gains using parallel programming techniques is often quite arduous. For example, in a recent blog entry entitled “&lt;/FONT&gt;&lt;A href="http://software.intel.com/en-us/articles/when-to-say-no-to-parallelism/"&gt;&lt;FONT size=3 face=Calibri&gt;When to Say No to Parallelism&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3 face=Calibri&gt;,” Sanjiv Shah, a colleague at Intel, expressed similar sentiments. One very good piece of advice Sanjiv gives is that you should not even be thinking about parallelism until you have an efficient single-threaded version of your program debugged and running. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Let’s continue, for a moment, in the same vein as “&lt;/FONT&gt;&lt;A href="http://software.intel.com/en-us/articles/when-to-say-no-to-parallelism/"&gt;&lt;FONT size=3 face=Calibri&gt;When to Say No to Parallelism&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3 face=Calibri&gt;.” Let’s look at the major sources of coherence-related delays in various kinds of parallel programs, how and why they occur, and what, if anything, can be done about them. Ultimately, I will try to tie this discussion into one about tools, especially some great new tools in Visual Studio Team System (see &lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/hshafi/archive/2009/05/18/visual-studio-2010-beta-1-parallel-performance-tools.aspx"&gt;&lt;FONT size=3 face=Calibri&gt;Hazim Shafi’s blog&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3 face=Calibri&gt; for details) that help you understand contention in your multi-threaded apps. When you use these new tools to gather and analyze the thread contention data for your app, it helps when you understand some of the common patterns to look for. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;The first aspect of the coherence delays Gunther is concerned with that we will look at are the bare &lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;I style="mso-bidi-font-style: normal"&gt;minimum&lt;/I&gt; additional costs that a parallel program running multiple threads must pay, compared to running the same program single threaded. To simplify matters, we will look at the best possible prospect for parallel programming, an algorithm that is both &lt;I style="mso-bidi-font-style: normal"&gt;embarrassingly parallel&lt;/I&gt; and easy to partition into roughly equal sized subprogram chunks. There are two basic costs to consider: one that is paid up front for initialization of the parallel runtime environment, and one that is paid incrementally each time one of the parallel tasks executes. It is also worth noting that these are unavoidable costs. I will lump both costs into an &lt;I style="mso-bidi-font-style: normal"&gt;overhead&lt;/I&gt; category associated with Gunther’s coherence delay factor &lt;I style="mso-bidi-font-style: normal"&gt;k&lt;/I&gt;. The embarrassingly parallel programs we will consider&amp;nbsp;here will incurr these minimum processor overhead penalties when they are transformed to execute in parallel.&lt;/FONT&gt;&lt;/P&gt;
&lt;H3 style="MARGIN: 10pt 0in 0pt"&gt;&lt;FONT color=#4f81bd size=3 face=Cambria&gt;Fine grained parallelism. &lt;/FONT&gt;&lt;/H3&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;To frame this part of the discussion, let’s also consider the characterization of workloads and their amenability to parallelization into either &lt;I style="mso-bidi-font-style: normal"&gt;fine-grained&lt;/I&gt; or &lt;I style="mso-bidi-font-style: normal"&gt;coarse-grained&lt;/I&gt; ones. This distinction implicitly recognizes the impact of coherency delay factors on scalability. With fine-grained parallelism, the overhead of setting up the parallel runtime &amp;amp; executing the tasks in parallel can easily exceed the benefits. By definition, the initialization overhead of spinning up multiple threads and dispatching them is not nearly so big an issue when the program lends itself to coarse-grained parallelism. Plus, when you are looking at a very long running process with many opportunities to exploit parallelism, it is important to understand you should only have to incur the setup cost once. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Coarse-grained parallelism occurs when each parallel worker thread is assigned to computing a relatively long running function:&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&lt;IMG style="WIDTH: 338px; HEIGHT: 419px" title="coarse-grained parallelism" alt="coarse-grained parallelism" src="http://5l3vgw.bay.livefilestore.com/y1pa53nIchv7Sk5zdNifva6i3kLD1Eu2MsrdAIOayN3r2OUL9XBpRy72kD_wGhKM1uYeb9TjLCGfyiz9E3hg5bqtg/Coarse-grained%20parallelism%20Fork-Join%20flowchart.jpg" width=338 height=419 mce_src="http://5l3vgw.bay.livefilestore.com/y1pa53nIchv7Sk5zdNifva6i3kLD1Eu2MsrdAIOayN3r2OUL9XBpRy72kD_wGhKM1uYeb9TjLCGfyiz9E3hg5bqtg/Coarse-grained%20parallelism%20Fork-Join%20flowchart.jpg"&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&lt;STRONG&gt;Figure 5. Coarse-grained parallelism.&lt;/STRONG&gt;&lt;/P&gt;&lt;?xml:namespace prefix = o /&gt;&lt;o:wrapblock&gt;&lt;?xml:namespace prefix = v ns = "urn:schemas-microsoft-com:vml" /&gt;&lt;v:shapetype id=_x0000_t75 coordsize="21600,21600" o:spt="75" o:preferrelative="t" path="m@4@5l@4@11@9@11@9@5xe" filled="f" stroked="f"&gt;&lt;v:stroke joinstyle="miter"&gt;&lt;/v:stroke&gt;&lt;v:formulas&gt;&lt;v:f eqn="if lineDrawn pixelLineWidth 0"&gt;&lt;/v:f&gt;&lt;v:f eqn="sum @0 1 0"&gt;&lt;/v:f&gt;&lt;v:f eqn="sum 0 0 @1"&gt;&lt;/v:f&gt;&lt;v:f eqn="prod @2 1 2"&gt;&lt;/v:f&gt;&lt;v:f eqn="prod @3 21600 pixelWidth"&gt;&lt;/v:f&gt;&lt;v:f eqn="prod @3 21600 pixelHeight"&gt;&lt;/v:f&gt;&lt;v:f eqn="sum @0 0 1"&gt;&lt;/v:f&gt;&lt;v:f eqn="prod @6 1 2"&gt;&lt;/v:f&gt;&lt;v:f eqn="prod @7 21600 pixelWidth"&gt;&lt;/v:f&gt;&lt;v:f eqn="sum @8 21600 0"&gt;&lt;/v:f&gt;&lt;v:f eqn="prod @7 21600 pixelHeight"&gt;&lt;/v:f&gt;&lt;v:f eqn="sum @10 21600 0"&gt;&lt;/v:f&gt;&lt;/v:formulas&gt;&lt;v:path o:extrusionok="f" gradientshapeok="t" o:connecttype="rect"&gt;&lt;/v:path&gt;&lt;?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /&gt;&lt;o:lock v:ext="edit" aspectratio="t"&gt;&lt;/o:lock&gt;&lt;/v:shapetype&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;v:shape style="Z-INDEX: 251662336; POSITION: absolute; MARGIN-TOP: 0px; WIDTH: 231.55pt; HEIGHT: 286.85pt; MARGIN-LEFT: 0px; mso-position-horizontal: center" id=_x0000_s1026 type="#_x0000_t75"&gt;&lt;v:imagedata src="file:///C:\Users\markfr\AppData\Local\Temp\msohtmlclip1\01\clip_image001.wmz" o:title=""&gt;&lt;/v:imagedata&gt;&lt;?xml:namespace prefix = w ns = "urn:schemas-microsoft-com:office:word" /&gt;&lt;w:wrap type="topAndBottom"&gt;&lt;/w:wrap&gt;&lt;/v:shape&gt;&lt;FONT size=3 face=Calibri&gt;while fine-grained parallelism looks more like this:&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;IMG style="WIDTH: 338px; HEIGHT: 300px" title="fine-grained parallelism" alt="fine-grained parallelism" src="http://5l3vgw.bay.livefilestore.com/y1pdnte4FSJ18igb7DEZZ2veO2esTg2BykZz6FavZ-37bbyLejUIQlFBc4mboJUbzN3jUp3uq-FV8MA2RGunfMe8Q/Fine-grained%20parallelism%20Fork-Join%20flowchart.jpg" width=338 height=300 mce_src="http://5l3vgw.bay.livefilestore.com/y1pdnte4FSJ18igb7DEZZ2veO2esTg2BykZz6FavZ-37bbyLejUIQlFBc4mboJUbzN3jUp3uq-FV8MA2RGunfMe8Q/Fine-grained%20parallelism%20Fork-Join%20flowchart.jpg"&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;STRONG&gt;Figure 6. Fine-grained parallelism.&lt;/STRONG&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;The difference, of course, lies in how long, relatively speaking, the worker thread processing the unit of work executes.&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;The rationale for the fine-grained:coarse-grained distinction is its significance to performance. We identify those parallel algorithms that execute in the worker thread long enough to recover the costs of setting up and running the parallel environment as coarse-grained. The benefits of running such programs in parallel are much easier to realize. On the other hand, the one full proof way to identify fine-grained parallelism is to find embarrassingly parallel programs with very short execution time spans for each parallel task. When executing fine-grained parallel programs, there is a very high risk of slowing down the performance of the program, instead of improving it. (If this sounds like a bit of circular reasoning, it most surely is.) &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Now let’s drill into these costs. In the .NET Framework, setting up a parallel execution environment is usually associated with the ThreadPool object. (If you are not very familiar with how to set up and use a ThreadPool in .NET, you might want to read &lt;/FONT&gt;&lt;A href="http://msdn.microsoft.com/en-us/library/0ka9477y.aspx"&gt;&lt;FONT color=#0000ff size=3 face=Calibri&gt;this&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3 face=Calibri&gt; bit of documentation first that shows some simple C# examples. If you really want to understand all the ins and outs of the .NET ThreadPool, you should considering picking up a copy of Joe Duffy’s very thorough and authoritative book, &lt;/FONT&gt;&lt;A href="http://www.amazon.com/Concurrent-Programming-Windows-Microsoft-Development/dp/032143482X/ref=sr_1_1?ie=UTF8&amp;amp;s=books&amp;amp;qid=1241140965&amp;amp;sr=1-1"&gt;&lt;FONT color=#0000ff size=3 face=Calibri&gt;Concurrent Programming in Windows&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3 face=Calibri&gt;.) &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&lt;BR style="mso-ignore: vglayout" clear=all&gt;&lt;FONT size=3 face=Calibri&gt;The thing about using the ThreadPool object in .NET is that you don’t need to write a lot of code on your own to get it up and running. With very little coding effort, you can be running a parallel program. In the .NET Framework, there are newer programming constructs in the &lt;/FONT&gt;&lt;A href="http://msdn.microsoft.com/en-us/concurrency/default.aspx"&gt;&lt;FONT color=#0000ff size=3 face=Calibri&gt;Task Parallel Library&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3 face=Calibri&gt; that are designed to make it easier for developers to express parallelism and exploit multi-core and many-core computers. But underneath the covers, many of the new TPL constructs are still using the CLR ThreadPool. So whatever overheads are associated with this older, less elegant approach still apply to any of the newer parallel programming constructs.&lt;/FONT&gt;&lt;/P&gt;
&lt;H3 style="MARGIN: 10pt 0in 0pt"&gt;&lt;FONT color=#4f81bd size=3 face=Cambria&gt;The ThreadPool in .NET&lt;/FONT&gt;&lt;/H3&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;The basic pattern for using the ThreadPool is to call the QueueUserWorkItem() method on the ThreadPool class, passing a delegate that performs the actual processing of the request, along with a set of parameters that are wrapped into a singleton Object. The parameters delineate the unit of work that is being requested. Typically, you also pass to the delegate a &lt;A title="ManualResetEvent reference" href="http://msdn.microsoft.com/en-us/library/system.threading.manualresetevent.aspx" mce_href="http://msdn.microsoft.com/en-us/library/system.threading.manualresetevent.aspx"&gt;ManualResetEvent&lt;/A&gt;, what is known as a &lt;EM&gt;synchronization primitive&lt;/EM&gt;. This event is used by the delegate to signal the Main task that the worker thread is finished processing the Work Item request.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Structurally, you have to write:&lt;/FONT&gt;&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;
&lt;DIV style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;a class that wraps the Work Item parameter list,&lt;/FONT&gt;&lt;/DIV&gt;&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;a C# delegate that runs in the worker thread to process the Work Item request,&lt;/FONT&gt;&lt;/DIV&gt;&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;a dispatcher routine that queues work items for the thread pool delegate to process, and&lt;/FONT&gt;&lt;/DIV&gt;&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;finally, an event handler to get control when the worker thread completes.&lt;/FONT&gt;&lt;/DIV&gt;&lt;/LI&gt;&lt;/OL&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;For example, to implement the Fork/Join pattern in .NET using the built-in ThreadPool Object, create (1) a wrapper for the parameter list:&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: auto auto 0pt; mso-add-space: auto; mso-layout-grid-align: none" class=MsoNormalCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; COLOR: blue; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;public&lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt; &lt;SPAN style="COLOR: blue"&gt;class&lt;/SPAN&gt; &lt;SPAN style="COLOR: #2b91af"&gt;WorkerThreadParms&lt;/SPAN&gt; &lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: auto auto 0pt; mso-add-space: auto; mso-layout-grid-align: none" class=MsoNormalCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;{&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; TEXT-INDENT: 0.5in; MARGIN: auto auto 0pt; mso-add-space: auto; mso-layout-grid-align: none" class=MsoNormalCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; COLOR: blue; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;private&lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt; &lt;SPAN style="COLOR: #2b91af"&gt;ManualResetEvent&lt;/SPAN&gt; _thisevent;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; TEXT-INDENT: 0.5in; MARGIN: auto auto 0pt; mso-add-space: auto; mso-layout-grid-align: none" class=MsoNormalCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;…&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: auto auto 0pt; mso-add-space: auto; mso-layout-grid-align: none" class=MsoNormalCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; COLOR: blue; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;public&lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt; &lt;SPAN style="COLOR: #2b91af"&gt;ManualResetEvent&lt;/SPAN&gt; thisevent&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: auto auto 0pt; mso-add-space: auto; mso-layout-grid-align: none" class=MsoNormalCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;{&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; TEXT-INDENT: 0.5in; MARGIN: 0in 0in 0pt 0.5in; mso-add-space: auto; mso-layout-grid-align: none" class=MsoNormalCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; COLOR: blue; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;get&lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt; { &lt;SPAN style="COLOR: blue"&gt;return&lt;/SPAN&gt; _thisevent; &lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; TEXT-INDENT: 0.5in; MARGIN: auto auto 0pt; mso-add-space: auto; mso-layout-grid-align: none" class=MsoNormalCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;}&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; TEXT-INDENT: 0.5in; MARGIN: auto auto 0pt; mso-add-space: auto; mso-layout-grid-align: none" class=MsoNormalCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;…&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: auto auto 0pt; mso-add-space: auto; mso-layout-grid-align: none" class=MsoNormalCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN style="COLOR: blue"&gt;public&lt;/SPAN&gt; WorkerThreadParms(&lt;SPAN style="COLOR: #2b91af"&gt;ManualResetEvent&lt;/SPAN&gt; signalwhendone, …,)&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: auto auto 0pt; mso-add-space: auto; mso-layout-grid-align: none" class=MsoNormalCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;{&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: auto auto 0pt; mso-add-space: auto; mso-layout-grid-align: none" class=MsoNormalCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN style="mso-tab-count: 1"&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-tab-count: 1"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;_thisevent = signalwhendone;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: auto auto 0pt; mso-add-space: auto; mso-layout-grid-align: none" class=MsoNormalCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;&lt;SPAN style="mso-tab-count: 2"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;…&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: auto auto 0pt; mso-add-space: auto; mso-layout-grid-align: none" class=MsoNormalCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;}&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: auto auto 0pt; mso-add-space: auto; mso-layout-grid-align: none" class=MsoNormalCxSpMiddle&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;}&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: 0in 0in 0pt; mso-layout-grid-align: none" class=MsoNormal&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;&lt;o:p&gt;&amp;nbsp;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;SPAN style="mso-no-proof: yes"&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;(2) the ThrealPool delegate that unwraps the parameter list, performs some work, then signals the main thread when it is done:&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: 0in 0in 0pt; mso-layout-grid-align: none" class=MsoNormal&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; COLOR: blue; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;public&lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt; &lt;SPAN style="COLOR: blue"&gt;void&lt;/SPAN&gt; ThreadPoolDelegate(&lt;SPAN style="COLOR: #2b91af"&gt;Object&lt;/SPAN&gt; parm)&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: 0in 0in 0pt; mso-layout-grid-align: none" class=MsoNormal&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;{&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; TEXT-INDENT: 0.5in; MARGIN: 0in 0in 0pt; mso-layout-grid-align: none" class=MsoNormal&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; COLOR: #2b91af; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;WorkerThreadParms&lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt; p = (&lt;SPAN style="COLOR: #2b91af"&gt;WorkerThreadParms&lt;/SPAN&gt;) parm;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: 0in 0in 0pt; mso-layout-grid-align: none" class=MsoNormal&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="COLOR: #2b91af"&gt;ManualResetEvent&lt;/SPAN&gt; signal = p.thisevent;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;… &lt;SPAN style="COLOR: #00b050"&gt;//Do some work here&lt;/SPAN&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;…&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="TEXT-INDENT: 0.5in; MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;signal.Set(); &lt;SPAN style="COLOR: #00b050"&gt;//Signal main task when done&lt;/SPAN&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;}&lt;/SPAN&gt;&lt;/P&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;(3) a (simple) dispatcher loop:&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: 0in 0in 0pt; mso-layout-grid-align: none" class=MsoNormal&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; COLOR: #2b91af; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;ManualResetEvent&lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;[] thisevent = &lt;SPAN style="COLOR: blue"&gt;new&lt;/SPAN&gt; &lt;SPAN style="COLOR: #2b91af"&gt;ManualResetEvent&lt;/SPAN&gt;[tasks];&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: 0in 0in 0pt; mso-layout-grid-align: none" class=MsoNormal&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; COLOR: blue; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;for&lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt; (&lt;SPAN style="COLOR: blue"&gt;int&lt;/SPAN&gt; j = 0; j &amp;lt; tasks; j++)&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: 0in 0in 0pt; mso-layout-grid-align: none" class=MsoNormal&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;{&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; TEXT-INDENT: 0.5in; MARGIN: 0in 0in 0pt; mso-layout-grid-align: none" class=MsoNormal&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;thisevent[j] = &lt;SPAN style="COLOR: blue"&gt;new&lt;/SPAN&gt; &lt;SPAN style="COLOR: #2b91af"&gt;ManualResetEvent&lt;/SPAN&gt;(&lt;SPAN style="COLOR: blue"&gt;false&lt;/SPAN&gt;);&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: 0in 0in 0pt; mso-layout-grid-align: none" class=MsoNormal&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="COLOR: #2b91af"&gt;WorkerThreadParms&lt;/SPAN&gt; p = &lt;SPAN style="COLOR: blue"&gt;new&lt;/SPAN&gt; &lt;SPAN style="COLOR: #2b91af"&gt;WorkerThreadParms&lt;/SPAN&gt;(thisevent[j],…);&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: 0in 0in 0pt; mso-layout-grid-align: none" class=MsoNormal&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="COLOR: #2b91af"&gt;WorkerThread&lt;/SPAN&gt; worker = &lt;SPAN style="COLOR: blue"&gt;new&lt;/SPAN&gt; &lt;SPAN style="COLOR: #2b91af"&gt;WorkerThread&lt;/SPAN&gt;();&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: 0in 0in 0pt; mso-layout-grid-align: none" class=MsoNormal&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="COLOR: #2b91af"&gt;ThreadPool&lt;/SPAN&gt;.QueueUserWorkItem (worker.ThreadPoolDelegate,p);&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: 0in 0in 0pt; mso-layout-grid-align: none" class=MsoNormal&gt;&lt;SPAN style="FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;}&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;SPAN style="mso-no-proof: yes"&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;SPAN style="mso-no-proof: yes"&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;followed by (4) a WaitForMultipleObjects in the Main thread:&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; COLOR: #2b91af; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;ManualResetEvent&lt;/SPAN&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; FONT-SIZE: 9pt; mso-no-proof: yes"&gt;.WaitAll(thisevent);&lt;/SPAN&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-SIZE: 9pt"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;FONT face=Calibri&gt;&lt;FONT size=3 face=Calibri&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;You can see there isn’t very much code for you to write to get up and running in parallel and start taking advantage of all those multi-core processor resources. You should take Sanjiv Khan’s advice and write &amp;amp; debug the delegate code you intend to parallelize by testing it in a single threaded mode first. Once the single threaded program is debugged and optimized, you can easily restructure the program to run in parallel by encapsulating that processing in your worker thread delegate following this simple recipe.&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;Even though there isn’t very much code for you to write, there are overhead considerations that you need to be aware of. Let’s look at the simplest case where the program is embarrassingly parallel (as discussed in the previous blog entry). This allows us to ignore complications introduced by serialization and locking (which we will get to later). These overheads include &lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="TEXT-INDENT: -0.25in; MARGIN: 0in 0in 0pt 0.5in; mso-list: l0 level1 lfo1" class=MsoListParagraphCxSpFirst&gt;&lt;SPAN style="mso-bidi-font-family: Calibri; mso-bidi-theme-font: minor-latin"&gt;&lt;SPAN style="mso-list: Ignore"&gt;&lt;FONT size=3 face=Calibri&gt;(1)&lt;/FONT&gt;&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3 face=Calibri&gt;work done in the Common Language Runtime (CLR) on your behalf to spin up the worker threads in the ThreadPool initially, and &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="TEXT-INDENT: -0.25in; MARGIN: 0in 0in 10pt 0.5in; mso-list: l0 level1 lfo1" class=MsoListParagraphCxSpLast&gt;&lt;SPAN style="mso-bidi-font-family: Calibri; mso-bidi-theme-font: minor-latin"&gt;&lt;SPAN style="mso-list: Ignore"&gt;&lt;FONT size=3 face=Calibri&gt;(2)&lt;/FONT&gt;&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3 face=Calibri&gt;the additional cost when running the parallel program associated with queuing work items, dispatching them to a thread pool thread to process, and signaling the main dispatcher thread when done. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;In the case of the initialization work, this is something that only needs to be done once. The other costs accrue each time you need to dispatch a worker thread. For parallelism to be an effective scaling strategy, it is necessary to amortize this overhead cost over the life of the parallel threads. The worker thread delegates need to execute long enough that there is a benefit to executing in parallel. And this is for a best case for parallelism where the underlying program is both embarrassingly parallel and easy to partition into roughly equivalent work requests. &lt;/FONT&gt;&lt;/P&gt;
&lt;H3 style="MARGIN: 10pt 0in 0pt"&gt;&lt;FONT color=#4f81bd size=3 face=Cambria&gt;A Parallel.For example.&lt;/FONT&gt;&lt;/H3&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Using the new Parallel.For construct in the Task Parallel Library, by the way, there is even less code you have to write. All you need to code the Parallel.For is write is the worker thread delegate, because the TPL library handles the remaining boilerplate tasks. However, the underlying overhead considerations are almost identical.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;The challenge of speeding up programs that have fine-grained parallelism grows in tandem with making it easier for developers to write concurrent programs. Take a look at the following example of parallelization in C# using the new Parallel.For construct taken verbatim from a Microsoft white paper entitled “&lt;A title="Taking Parallelism Mainstream" href="http://download.microsoft.com/download/D/5/9/D597F62A-0BEE-4CE7-965B-099D705CFAEE/Taking%20Parallelism%20Mainstream%20Microsoft%20February%202009.docx" mce_href="http://download.microsoft.com/download/D/5/9/D597F62A-0BEE-4CE7-965B-099D705CFAEE/Taking%20Parallelism%20Mainstream%20Microsoft%20February%202009.docx"&gt;Taking Parallelism Mainstream&lt;/A&gt;.” The white paper describes some of the new language facilities in the Task Parallel Library that make it easier for developers to write parallel programs. These new language facilities include Parallel For loops, Parallel LINQ, Parallel Invoke, Futures and Continuations, and messaging using asynchronous agents. To the extent that these parallel computing initiatives succeed, they will generate the need for better performance analysis tools to understand the performance of concurrent programs because not everyone who implements these new constructs is going to see impressive speed-up of his or her applications. Some developers will even see the retrograde performance predicted by Gunther’s scalability formula.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Here’s the C# program that illustrates one of the new parallel programming constructs:&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; COLOR: #4f81bd; FONT-SIZE: 9pt; mso-themecolor: accent1"&gt;IEnumerable&amp;lt;StockQuote&amp;gt; Query(IEnumerable&amp;lt;StockQuote&amp;gt; stocks) {&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; COLOR: #4f81bd; FONT-SIZE: 9pt; mso-themecolor: accent1"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;var results = new ConcurrentBag&amp;lt;StockQuote&amp;gt;();&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; COLOR: #4f81bd; FONT-SIZE: 9pt; mso-themecolor: accent1"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;Parallel.ForEach (stocks, stock =&amp;gt; {&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; COLOR: #4f81bd; FONT-SIZE: 9pt; mso-themecolor: accent1"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;if (stock.MarketCap &amp;gt; 100000000000.0 &amp;amp;&amp;amp;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; COLOR: #4f81bd; FONT-SIZE: 9pt; mso-themecolor: accent1"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;stock.ChangePct &amp;lt; 0.025 &amp;amp;&amp;amp;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; COLOR: #4f81bd; FONT-SIZE: 9pt; mso-themecolor: accent1"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;stock.Volume&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&amp;gt; 1.05 * stock.VolumeMavg3M) {&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; COLOR: #4f81bd; FONT-SIZE: 9pt; mso-themecolor: accent1"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;results.Add(stock);&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; COLOR: #4f81bd; FONT-SIZE: 9pt; mso-themecolor: accent1"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;}&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; COLOR: #4f81bd; FONT-SIZE: 9pt; mso-themecolor: accent1"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;});&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; COLOR: #4f81bd; FONT-SIZE: 9pt; mso-themecolor: accent1"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;return results;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 0pt" class=MsoNormal&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Courier New'; COLOR: #4f81bd; FONT-SIZE: 9pt; mso-themecolor: accent1"&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;}&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;The example uses a &lt;I style="mso-bidi-font-style: normal"&gt;Parallel.ForEach&lt;/I&gt; enumeration loop, along with one of the new concurrent collection classes, the &lt;I style="mso-bidi-font-style: normal"&gt;ConcurrentBag&lt;/I&gt;, to evaluate stock prices based on some set of selection criteria. The problem, the white paper author notes, is one that is considered “embarrassingly parallel” because it is easily decomposed into independent sub-problems that can be executed concurrently. The new C# Task Parallel Library language features provide an elegant way to express this parallelism. Underneath the Parallel.For expression is a run-time library that understands how to partition the body of the parallel For loop into multiple work items, and dispatches them to separate worker threads that are then scheduled to execute concurrently. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;The .NET Task Parallel Library (TPL) provides the run-time machinery to turn this program into a parallel program. At run-time, it automatically parallelizes the lambda expression that the &lt;I style="mso-bidi-font-style: normal"&gt;Parallel.For&lt;/I&gt; construct references. In this example, the Task Parallel Library takes the If Statement in the lambda expression and queues it up to run in parallel using the concurrent runtime library in .NET. The concurrent runtime library creates a thread pool and then delegates the processing of the lambda expression to these worker threads. The concurrent runtime attempts to allocate and schedule an optimal number of worker threads to this parallel task.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;While TPL makes it easier to express parallelism in your programs and eliminates most of the grunt work in setting up the runtime environment associated with parallel threads, it cannot guarantee that running this program in parallel on a machine with four or eight processors will actually speed up its execution time. That is the essence of the challenge of fine-grained parallelism. There is overhead associated with queuing work items to the thread pool the parallel run-time manages. This is overhead that the serial version of the program does not encounter. Only when the amount of work done inside the lambda expression executes for a long enough time does the benefit of parallelizing the lambda expression exceeds this cost, which must be amortized over each concurrent execution of the inner body of the Parallel.For loop. Note that this particular set of overheads is unavoidable. When you are dealing with fine-grained parallelism, the overhead of setting up the parallel run-time environment alone often exceeds the potential benefit of executing in parallel, notwithstanding other possible sources of contention-related delays that could further slow down execution time.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;So, it is important to realize that a code sample like the one I have taken here from the parallel programming white paper was chosen to illustrate the expressive power of the new language constructs. The example was intended to show the &lt;EM&gt;pattern&lt;/EM&gt; that developers should adhere to -- I am certain it wasn't intended to illustrate something specific you would actual do. When you are taking advantage of these new parallel programming language extensions, you need to be aware of the fine-grained:coarse-grained distinction.&amp;nbsp;This is emphatically not an example of a program that you will necessarily speed up by running it in parallel. It will take considerably more anlysis to figure out if parallelism is the right solution here. Speeding up a serial program by running portions of it in parallel isn’t always easy – even when that program has sections that are “embarrassingly parallel.”&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;/P&gt;&lt;SPAN style="mso-spacerun: yes"&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;What we’d like to be able to do is to estimate the actual performance improvement we can expect of this parallel program, compared to its original serial version. “Embarrassingly parallel” is another way of saying that, once parallelized, the serial portion of this program is expected to be quite small. However, this is also an example of fine-grained parallelism because the amount of code associated with the lambda expression that is passed to worker thread delegate to execute is also quite small. An experienced developer at this point should be asking whether the overhead of creating these working threads and dispatching them might, in fact, be greater than the benefit of executing this task in parallel. This relatively fixed overhead is especially important when the amount of work performed by each delegate is quite small – there is only a limited opportunity to amortize this overhead to initiate and manage the multithreaded operation across the execution time of each of the worker threads. It is extremely important to understand this in the case of fine-grained parallelism.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Next, we will turn to coarse-grained parallelism, where the odds are much better that you may be able to speed up program execution substantially using concurrent programming techniques. In the next blog entry, I will start to tackle more promising examples. The analysis of the performance costs associated with parallel programming tasks will become more complex. I will try to illustrate this analysis using a concrete programming example that will simulate coarse-grained parallelism. As we look at how this simple programming example scales on a multi-core machine, it will bring us face-to-face with the pitfalls even experienced developers can expect to encounter when they attempt to parallelize their existing serial applications. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;We will also start to look into the analysis tools that we are available in the next version of the Visual Studio Profiler that greatly help with understanding the performance of your .NET parallel program. In the meantime, if you'd like to get a head start on these new tools in the Visual Studio Profiler , be sure to check out &lt;A href="http://blogs.msdn.com/hshafi/archive/2009/05/18/visual-studio-2010-beta-1-parallel-performance-tools.aspx"&gt;&lt;FONT size=3 face=Calibri&gt;Hazim Shafi’s blog&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3 face=Calibri&gt; for more details.&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/o:wrapblock&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9718379" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ddperf/archive/tags/.NET/default.aspx">.NET</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Scalability/default.aspx">Scalability</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Parallel+programming/default.aspx">Parallel programming</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Performance/default.aspx">Performance</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Beta/default.aspx">Beta</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Visual+Studio+Profiler/default.aspx">Visual Studio Profiler</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Visual+Studio+Team+Developer/default.aspx">Visual Studio Team Developer</category></item><item><title>Are we taking advantage of Parallelism?</title><link>http://blogs.msdn.com/ddperf/archive/2009/05/02/are-we-taking-advantage-of-parallelism.aspx</link><pubDate>Sun, 03 May 2009 01:38:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9584046</guid><dc:creator>Sunny Egbo</dc:creator><slash:comments>1</slash:comments><comments>http://blogs.msdn.com/ddperf/comments/9584046.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ddperf/commentrss.aspx?PostID=9584046</wfw:commentRss><description>&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;Recently, a colleague of mine, Mark Friedman, posted a blog titled “&lt;/FONT&gt;&lt;/SPAN&gt;&lt;A href="http://blogs.msdn.com/ddperf/archive/2009/04/29/parallel-scalability-isn-t-child-s-play-part-2-amdahl-s-law-vs-gunther-s-law.aspx#9576239" mce_href="http://blogs.msdn.com/ddperf/archive/2009/04/29/parallel-scalability-isn-t-child-s-play-part-2-amdahl-s-law-vs-gunther-s-law.aspx#9576239"&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT color=#0000ff size=3&gt;Parallel Scalability Isn’t Child’s Play&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;” in which he reviewed the merits of Amdahl Law vs. Gunther’s Law for determining the practical limits to parallelization. I would not argue with the premise of Mark’s blog that Parallelism is not child’s play. However, I do have alternate views of the use of Amdahl Law and Gunther’s Law that I posted on his blog. &lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;I think my views and comments on Mark’s blog warrant another blog to fully explain.&lt;?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;Speaking of child’s play: my 10-year old son recently made a two-part movie titled “&lt;I style="mso-bidi-font-style: normal"&gt;the Way&lt;/I&gt;” and “&lt;I style="mso-bidi-font-style: normal"&gt;the Way Back&lt;/I&gt;” complete with a full storyline, multiple sound tracks and narrations. He put these movies together with only the help of his eight-year old sister, using sample movie clips and stock photographs he found on his computer hard drive. He asked me for help getting his two masterpieces onto a DVD capable of playing on the average home DVD player. Also, he asked about the length of a typical movie playing in movie theaters around the U.S. (approximately 2 hours) and how much these movies cost at the movie theater (approximately $12 for adults and $8 for children, minus the popcorn). Based on my answers, he determined that he will charge 25 cents for people to watch his movies, because he wanted everyone to attend. I wanted to ask him how much he would charge someone who decided to watch only one of the clips. However, I didn’t because I did not want to lose a price haggling war with a 10-year old. Besides, it would be terrible if you cannot find your way back.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;In any case, his movies were quite impressive. The most technologically savvy thing I did as a 10-year old kid was to build a telephone line with tomato soup cans and a string. Movie making was out of reach for me; but now it is child’s play.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;Today, parallelism is not child’s play. However, I hold out hope that in the future the typical computer program would be written with parallelism in mind. Is parallelism ever going to be child’s play in the future the way movie making is today? &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;Parallelism exists everywhere: &lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;instruction level, memory level, loop level and task level parallelism, etc. Also, parallelism has been with us for quite some time now. For the past several decades, hardware engineers have quietly been busy solving problems in parallel to improve processor and system level performance. However, for the past four or more years, hardware designers have encountered the twin brick walls created by memory speed and power. These walls have forced CPU architects and hardware designers to go multi-core in a major way. The doubling of the CPU frequency every 18 months, that was true for many decades, are no longer practical and have come to an abrupt end. Although, hardware performance continues to improve as my colleagues and I pointed out in our blog “&lt;/FONT&gt;&lt;/SPAN&gt;&lt;A href="http://blogs.msdn.com/ddperf/archive/2008/06/18/lessons-from-the-test-lab-investigating-a-pleasant-surprise.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2008/06/18/lessons-from-the-test-lab-investigating-a-pleasant-surprise.aspx"&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT color=#0000ff size=3&gt;Investigating a Pleasant Surprise&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;FONT size=3&gt;&lt;FONT face="Times New Roman"&gt;,&lt;/FONT&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;” the pace of CPU frequency increases has slowed considerably. Instead, hardware designers have been doubling the number of cores available on a single CPU socket every couple of years. &lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;To get the same level of performance that was previously possible, software engineers would now need to step up to the plate—to write software in a parallel and scalable fashion. They would need tools and frameworks that allow them to think about their problems, identify opportunities for parallelism and to analyze their solutions correctly and efficiently. &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;I am a big fan of Amdahl Law as an analysis framework. However, I do not subscribe to the narrow view that Amdahl’s Law applies only to parallelism, as most people who write about it seem to imply. I prefer the broader treatment of the Law by Hennessy and Patterson in their famous book “&lt;/FONT&gt;&lt;/SPAN&gt;&lt;A href="http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901/ref=sr_1_1?ie=UTF8&amp;amp;s=books&amp;amp;qid=1241294120&amp;amp;sr=1-1" mce_href="http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901/ref=sr_1_1?ie=UTF8&amp;amp;s=books&amp;amp;qid=1241294120&amp;amp;sr=1-1"&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT color=#0000ff size=3&gt;Computer Architecture: A Quantitative Approach&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;”—where Amdahl’s Law is used to estimate the opportunities between competing designs. Amdahl’s Law is very powerful for showing the areas that will likely yield the most fruitful performance gains. In my performance design, tuning and optimization work, I use Amdahl Law for prioritizing the areas of opportunities to focus my efforts to gain performance.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;Amdahl’s Law is not the limit to either absolute performance or parallelism as many authors seem to suggest. Gunther’s and Gustafson’s Laws are helpful for putting Amdahl’s Law in perspective. However, like Amdahl’s Law, these laws are not fundamental limits. The use of these three laws to estimate the level of parallelism that is possible is very flawed. Specifically, the use of these laws as fundamental limits can obfuscate the level of parallelism and performance inherent in typical computing problems. These laws gloss over a number of important points and practical aspects of obtaining parallelism in general purpose computing, including that:&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: Calibri; mso-bidi-theme-font: minor-latin"&gt;&lt;SPAN style="mso-list: Ignore"&gt;&lt;FONT size=3&gt;
&lt;P style="MARGIN-LEFT: 0.5in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: Calibri; mso-bidi-theme-font: minor-latin"&gt;&lt;SPAN style="mso-list: Ignore"&gt;1.&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;Many user tasks are non-monolithic and can be solved in a distributed fashion. Background tasks (e.g., virus scans) that often block single processor execution can now be done in a way that improves user experiences. The key is to identify unnecessary dependencies that would allow these tasks to proceed in parallel with other tasks in a multi-core computer.&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN-LEFT: 0.5in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: Calibri; mso-bidi-theme-font: minor-latin"&gt;&lt;SPAN style="mso-list: Ignore"&gt;2.&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;Some algorithms that have inefficient sequential solutions surprisingly have efficient parallel solutions. This fact should be comforting to fans of algorithms. For example, many applications require matrix multiplication, which turns out to be easily parallelizable. Although the best sequential algorithm for matrix multiplication has a time complexity of O(n&lt;SUP&gt;2.376&lt;/SUP&gt;), a straight-forward parallel solution has an asymptotic time complexity of O(log n) using n&lt;SUP&gt;2.376&lt;/SUP&gt; processor.&amp;nbsp;In other words, we can readily find a parallel solution for matrix multiplication that improves its runtime as more and more processor cores become available. Of course, you might have difficulty conceiving of n&lt;SUP&gt;2.376&lt;/SUP&gt; processors in a system--as a colleague mentioned recently. However, this is just another way of saying that matrix multiplication will benefit with more and more processors.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN-LEFT: 0.5in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: Calibri; mso-bidi-theme-font: minor-latin"&gt;&lt;SPAN style="mso-list: Ignore"&gt;3.&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;Some poor sequential algorithms can be easily parallelized to execute in less time than their sequential solutions. Also, we know that&amp;nbsp;some algorithms that have the best asymptotic time complexities achieve&amp;nbsp;their speed&amp;nbsp;by introducing&amp;nbsp;data dependencies that make parallelization&amp;nbsp;difficult and that the best asymptotic time complexity does not necessarily translate to the best runtime in real life. Hence, at some point, the benefit of the simpler parallelization of some&amp;nbsp;poor sequential algorithms that have little data dependencies&amp;nbsp;can outweigh the benefit of&amp;nbsp;more efficient sequential counterparts that have data dependencies. Hence, when considering parallel solutions it is not always necessary to start with the sequential solution with the best time complexity [also, see comment about Fortune and Wylie below].&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN-LEFT: 0.5in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: Calibri; mso-bidi-theme-font: minor-latin"&gt;&lt;SPAN style="mso-list: Ignore"&gt;4.&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;The real world performance of applications is not determined exclusively by the asymptotic time complexity of algorithms. Because of the increasing gap between CPU and memory speed, memory accesses are increasingly dominating the performance of applications running on modern CPUs. Although, the gap can be mitigated with large caches, every cache miss takes hundreds of CPU cycles to complete. Even a modest overlap in these memory accesses (Memory Level Parallelism) can improve application performance in noticeable ways.&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN-LEFT: 0.5in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="FONT-SIZE: 11pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: 'Times New Roman'; mso-bidi-theme-font: minor-bidi; mso-fareast-font-family: Calibri; mso-fareast-theme-font: minor-latin; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA"&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;Over the years, there have been efforts to classify computationally intractable problems. Many decision problems (i.e., Yes/No) and their optimization counterparts have been categorized into NP-complete and NP-Hard sets respectively. The Travelling Salesman (TSP), Online Bin-Packing and 3-Dimensional Matching problems are three famous examples of NP-Complete problems. In a similar fashion, problems that are difficult to parallelize have been categorized into the P-Complete set or the set of problems that are known to be inherently sequential. As you can imagine sorting is not P-Complete. Likewise, Matrix Multiplication is not in the P-Complete set. Processor scheduling can be done in O(log n) time units using &lt;I style="mso-bidi-font-style: normal"&gt;n&lt;/I&gt; processors—so, it is not P-Complete either. In an ultimate twist of irony, many NP-Hard problems have heuristic solutions that can be executed in parallel to approximate the real solutions. Hence, the natural inclination to think that NP-Complete problems cannot be parallelized is not borne out in practice.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;As it turns out, the real limit to parallelism seem not to be defined by Amdahl’s Law, Gunther’s Law, Gustafson’s Law or NP-Completeness, but by the P-Complete set. It appears that parallelizable problems are related to the asymptotic space complexity of their sequential solutions. According to the Fortune and Wylie’s Parallel Processing Thesis, any problem that can be solved with a poly-logarithmic space complexity can be parallelized efficiently. Because of&amp;nbsp;the time space trade-off of algorithms, this&amp;nbsp;implies that the sequential algorithm that achieves this&amp;nbsp;space complexity is not necessarily the&amp;nbsp;algorithm with&amp;nbsp;the best asymptotic time complexity. &lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;In any case,&amp;nbsp;because one can evaluate problems on multiple levels beyond algorithms (e.g., at the instruction, memory and data access, loop and task levels), the set of problems that can be parallelized appears to be quite large. The question is how to identify and take advantage of the parallelization opportunities that may be inherently available and to do so in an efficient and scalable manner. How can we parallelize loops? How do we&amp;nbsp;overlap high latency activities such as accesses to physical memory or I/O to amortize the cost of those activities? How do we minimize synchronizations? How do we partition tasks to eliminate bottlenecks&amp;nbsp;from the&amp;nbsp;critical paths? How do we dispatch work efficiently to improve efficient system utilization, improve throughput and improve latency? What areas of our application can benefit from what sets of efforts? These are some of the questions that allow for scalable designs.&amp;nbsp;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;Today, the tools to identify parallelism and scalability opportunities are very limited. The programming languages that allow programmers to express parallelism in a natural way are completely lacking. The tools to analyze and troubleshoot parallel implementations are limited as well. Debugging parallel implementation is particularly hard. However, I suspect that with some industry focus and incremental progress, we could continue to make parallelism accessible to average programmers in a few years. However, we are many years away.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size=3&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'"&gt;W&lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;hat are some of the fundamental limits preventing such tools to be built? Like Mark said on his blog, &lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'"&gt;achieving improved scalability using parallel programming techniques is certainly very challenging. But, can parallel programming be made less challenging with intuitive tools that expose parallel solutions in a natural way and allow programmers to exploit them? &lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;Can programming languages and tools improve to a point where a typical 10-year old will be able to write a parallel program as easily as they can put together a multi-track movie today?&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;o:p&gt;&lt;FONT size=3&gt;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;Sunny Egbo&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;/SPAN&gt;&amp;nbsp;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9584046" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ddperf/archive/tags/Performance+Engineering/default.aspx">Performance Engineering</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/.NET/default.aspx">.NET</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Scalability/default.aspx">Scalability</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Parallel+programming/default.aspx">Parallel programming</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Performance/default.aspx">Performance</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Hardware/default.aspx">Hardware</category></item><item><title>Parallel Scalability Isn’t Child’s Play, Part 2: Amdahl’s Law vs. Gunther’s Law</title><link>http://blogs.msdn.com/ddperf/archive/2009/04/29/parallel-scalability-isn-t-child-s-play-part-2-amdahl-s-law-vs-gunther-s-law.aspx</link><pubDate>Wed, 29 Apr 2009 07:51:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9575026</guid><dc:creator>MarkBFriedman</dc:creator><slash:comments>4</slash:comments><comments>http://blogs.msdn.com/ddperf/comments/9575026.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ddperf/commentrss.aspx?PostID=9575026</wfw:commentRss><description>&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;&lt;A title="Link to Parallel Scalability Part 1" href="http://blogs.msdn.com/ddperf/archive/2009/03/16/parallel-scalability-isn-t-child-s-play.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2009/03/16/parallel-scalability-isn-t-child-s-play.aspx"&gt;Part 1 of this series of blog entries&lt;/A&gt; discussed results from simulating the performance of a massively parallel SIMD application on several alternative multi-core architectures. These results were reported by researchers at Sandia Labs and publicized in a press release. Neil Gunther, my colleague from the Computer Measurement Group (CMG), referred to the Sandia findings as evidence supporting his &lt;I style="mso-bidi-font-style: normal"&gt;universal scalability law&lt;/I&gt;. This blog entry investigates Gunther’s model of parallel programming scalability, which, unfortunately, is not as well known as it should be. Gunther’s insight is especially useful in the current computing landscape, which is actively embracing parallel computing using multi-core workstations &amp;amp; servers.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Gunther’s scalability formula for parallel processing is a useful antidote to any overly optimistic expectations developers might have about the gains to be had from applying parallel programming techniques. Where Amdahl’s law can be used to establish a theoretical &lt;I style="mso-bidi-font-style: normal"&gt;upper limit&lt;/I&gt; to the speed-up that parallel programming techniques can provide, Gunther’s law can also model the retrograde performance that we frequently observe when parallel computing is used. In other words, Gunther’s scalability formula encapsulates the behavior we frequently observe where adding more and more processors to a parallel processing workload can result in &lt;I style="mso-bidi-font-style: normal"&gt;degraded&lt;/I&gt; performance. It is a more realistic model for people who adopt parallel programming techniques to enhance the scalability of their applications on multi-core hardware. So, without in any way trying to diminish enthusiasm for the entire enterprise, it is crucial to understand that achieving improved scalability using parallel programming techniques can be very challenging.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;As I discussed &lt;/FONT&gt;&lt;A href="http://www.cmg.org/measureit/issues/mit44/m_44_18.html"&gt;&lt;FONT color=#0000ff size=3 face=Calibri&gt;in a review of Gunther’s last book&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3 face=Calibri&gt;, Gunther’s law adds another parameter to the well-known Amdahl’s Law. Gunther calls this parameter &lt;I style="mso-bidi-font-style: normal"&gt;coherence&lt;/I&gt;. Parallel programs have additional costs associated with maintaining the “coherence” of shared data structures, memory locations that are accessed and updated by threads executing in parallel. By incorporating these coherence-related delays, Gunther’s formula is able to model the retrograde performance that all too frequently is observed empirically. The blue line marked “Conventional” in the chart Sandia Labs published (&lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/ddperf/archive/2009/03/16/parallel-scalability-isn-t-child-s-play.aspx"&gt;&lt;FONT size=3 face=Calibri&gt;Figure 1&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3 face=Calibri&gt; in the earlier blog) is a scalability curve that Gunther correctly cites is consistent with his model.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Let’s drill into the mathematics here for a moment. What Gunther’s calls his Universal Scalability law is an extension to the familiar multiprocessor scalability formula first suggested by Gene Amdahl. In &lt;/FONT&gt;&lt;A href="http://en.wikipedia.org/wiki/Amdahl%27s_law"&gt;&lt;FONT color=#0000ff size=3 face=Calibri&gt;Amdahl's law&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3 face=Calibri&gt;, &lt;I style="mso-bidi-font-style: normal"&gt;p&lt;/I&gt; is the proportion of a program that can be parallelized, leaving &lt;I style="mso-bidi-font-style: normal"&gt;1 −p&lt;/I&gt; to represent the part of the program that cannot be parallelized and remains serial. Amdahl’s insight was that the &lt;I style="mso-bidi-font-style: normal"&gt;1-p &lt;/I&gt;amount of time spent in the serial execution portion of the program creates an upper bound on how much its performance can be improved when parallelized. &lt;/FONT&gt;&lt;/P&gt;&lt;?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /&gt;&lt;o:wrapblock&gt;&lt;?xml:namespace prefix = v ns = "urn:schemas-microsoft-com:vml" /&gt;&lt;v:shapetype id=_x0000_t75 coordsize="21600,21600" o:spt="75" o:preferrelative="t" path="m@4@5l@4@11@9@11@9@5xe" filled="f" stroked="f"&gt;&lt;v:stroke joinstyle="miter"&gt;&lt;/v:stroke&gt;&lt;v:formulas&gt;&lt;v:f eqn="if lineDrawn pixelLineWidth 0"&gt;&lt;/v:f&gt;&lt;v:f eqn="sum @0 1 0"&gt;&lt;/v:f&gt;&lt;v:f eqn="sum 0 0 @1"&gt;&lt;/v:f&gt;&lt;v:f eqn="prod @2 1 2"&gt;&lt;/v:f&gt;&lt;v:f eqn="prod @3 21600 pixelWidth"&gt;&lt;/v:f&gt;&lt;v:f eqn="prod @3 21600 pixelHeight"&gt;&lt;/v:f&gt;&lt;v:f eqn="sum @0 0 1"&gt;&lt;/v:f&gt;&lt;v:f eqn="prod @6 1 2"&gt;&lt;/v:f&gt;&lt;v:f eqn="prod @7 21600 pixelWidth"&gt;&lt;/v:f&gt;&lt;v:f eqn="sum @8 21600 0"&gt;&lt;/v:f&gt;&lt;v:f eqn="prod @7 21600 pixelHeight"&gt;&lt;/v:f&gt;&lt;v:f eqn="sum @10 21600 0"&gt;&lt;/v:f&gt;&lt;/v:formulas&gt;&lt;v:path o:extrusionok="f" gradientshapeok="t" o:connecttype="rect"&gt;&lt;/v:path&gt;&lt;o:lock v:ext="edit" aspectratio="t"&gt;&lt;/o:lock&gt;&lt;/v:shapetype&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;v:shape style="Z-INDEX: 251668480; POSITION: absolute; MARGIN-TOP: 72.45pt; WIDTH: 276.5pt; HEIGHT: 234.2pt; MARGIN-LEFT: 0px; mso-position-horizontal: center" id=_x0000_s1026 type="#_x0000_t75"&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;&lt;v:imagedata src="file:///C:\Users\markfr\AppData\Local\Temp\msohtmlclip1\01\clip_image001.wmz" o:title=""&gt;&lt;/v:imagedata&gt;&lt;?xml:namespace prefix = w ns = "urn:schemas-microsoft-com:office:word" /&gt;&lt;w:wrap type="topAndBottom"&gt;&lt;/w:wrap&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/v:shape&gt;&lt;/o:wrapblock&gt;&lt;BR style="mso-ignore: vglayout" clear=all&gt;&lt;FONT size=3 face=Calibri&gt;Consider a sequential program that we want to speed up using parallel programming techniques. An old-fashioned way to think about this is to identify some portion of the program, &lt;I style="mso-bidi-font-style: normal"&gt;p&lt;/I&gt;, that can be executed in parallel, and then implement a &lt;B style="mso-bidi-font-weight: normal"&gt;Fork()&lt;/B&gt; to spawn parallel tasks, followed by a &lt;B style="mso-bidi-font-weight: normal"&gt;Join()&lt;/B&gt; to unify the processing and carry on sequentially afterwards. Conceptually, something like this:&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;IMG style="WIDTH: 365px; HEIGHT: 335px" title="Fork Join" alt="Fork Join" src="http://5l3vgw.bay.livefilestore.com/y1prpgpboRXSzjQVnNYycq6GJuJ8R8HJIlojyRLhYinSz8MbLbSRl-3NN9tSD_qBRNoLp4SGLDZHIzUL0yvuqRj9GczCqOudgK_/Fork-Join%20flowchart.jpg" width=365 height=335 mce_src="http://5l3vgw.bay.livefilestore.com/y1prpgpboRXSzjQVnNYycq6GJuJ8R8HJIlojyRLhYinSz8MbLbSRl-3NN9tSD_qBRNoLp4SGLDZHIzUL0yvuqRj9GczCqOudgK_/Fork-Join%20flowchart.jpg"&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;STRONG&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Calibri','sans-serif'; COLOR: #002060; FONT-SIZE: 10pt; mso-bidi-font-family: 'Times New Roman'; mso-bidi-theme-font: minor-bidi; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin; mso-bidi-font-size: 11.0pt"&gt;Figure 3.&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;STRONG&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Calibri','sans-serif'; COLOR: #002060; FONT-SIZE: 10pt; FONT-WEIGHT: normal; mso-bidi-font-family: 'Times New Roman'; mso-bidi-theme-font: minor-bidi; mso-bidi-font-weight: bold; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin; mso-bidi-font-size: 11.0pt"&gt; Parallel processing using a Fork/Join.&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;SPAN&gt;&lt;o:p&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Amdahl’s law simply observes that your ability to speed up this program using parallelism is a function of the proportion of the time, &lt;I style="mso-bidi-font-style: normal"&gt;p&lt;/I&gt;, spent in the parallel portion of the program, compared to &lt;I style="mso-bidi-font-style: normal"&gt;s&lt;/I&gt;, the time spent in the serial parts of the program. (Note that &lt;I style="mso-bidi-font-style: normal"&gt;p + s = 1&lt;/I&gt;, in this formulation.) Amdahl’s observation was meant as a direct challenge to hardware architects who were advocating building parallel computing hardware. It was also easy for those advocates of parallel computing approaches to dismiss Amdahl’s remarks since Dr. Amdahl was clearly invested in trying to build faster processors, no matter the cost.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Advocates of parallel computing, of course, are not blind to the hazards of the parallel processing approach. Scalability of the underlying hardware is one challenge. An even bigger challenge is writing multi-threaded programs. For starters, it is often far more difficult to conceptualize a parallel solution than a serial one. (We can speculate that this may simply be a function of the way our minds tend to work.) Parallel programs are also notoriously more difficult to debug. When you are debugging a multi-threaded program running in parallel on parallel hardware, events don’t always occur in the exact same sequence. This is known as &lt;I style="mso-bidi-font-style: normal"&gt;non-determinism&lt;/I&gt;, and it often leads to huge problems for developers because, for instance, it may be very difficult to reproduce the exact timing sequence that exposes an error in your logic.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Furthermore, once you manage to get your programs to run correctly in a parallel processing mode the performance wins of doing so are not a given. In the course of celebrating the performance wins they do get, developers can sometimes diminish an appreciation for how difficult it was to achieve those gains. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Notwithstanding the difficulties that need to be overcome, compelling reasons to look at parallel computation remain, including trying to solve problems that simply just won’t fit inside the largest computers we can build. Today, there is renewed interest in parallel programming because it is difficult for hardware designers to make processors run at higher and higher clock speeds using current semiconductor fabrication technology without them consuming excessive amounts of power and generating excessive amounts of heat in the process that must be dissipated. Power and cooling considerations are driving parallel computing today for portables, desktops, and servers.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;STRONG&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Comparing Gunther’s formula to Amdahl’s law. &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Meanwhile, Amdahl’s original insight remains relevant today. From Amdahl’s law, we understand that, no matter what degree of parallelism is achieved, the execution time of a program’s serial portion is a practical upper bound on the performance of its parallel counterpart. As an example, Figure 1 plots the scalability curve using Amdahl’s law where p = 0.9, when just 10% of the program remains serial. Notice that Amdahl’s law predicts the performance of a parallel program will level off as more and more processors are added. As you can, see Amdahl’s law shows diminishing returns from increasing the level of parallelism. You can see how the parallel approach becomes less and less cost-effective as more and more processors are added.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;IMG style="WIDTH: 334px; HEIGHT: 286px" title="Amdahl's Law vs. Gunther's Law" alt="Amdahl's Law vs. Gunther's Law" src="http://5l3vgw.bay.livefilestore.com/y1pN7Q-dLDfRpTM8sHTBUMnORaHYkqs1fnOq57tgpYfyBjjbKKjQ8XZRTTazyt9cNaxt2X31QRdackdQL_gF0tcM9PTxce5Hw8-/Amdahl%20vs%20Gunther%20laws.jpg" width=334 height=286 mce_src="http://5l3vgw.bay.livefilestore.com/y1pN7Q-dLDfRpTM8sHTBUMnORaHYkqs1fnOq57tgpYfyBjjbKKjQ8XZRTTazyt9cNaxt2X31QRdackdQL_gF0tcM9PTxce5Hw8-/Amdahl%20vs%20Gunther%20laws.jpg"&gt;&lt;IMG style="WIDTH: 0px; HEIGHT: 0px" title="Amdahl's Law &amp;amp; Gunther's Law" alt="Amdahl's Law &amp;amp; Gunther's Law" src="http://5l3vgw.bay.livefilestore.com/y1pN7Q-dLDfRpTM8sHTBUMnORaHYkqs1fnOq57tgpYfyBjjbKKjQ8XZRTTazyt9cNaxt2X31QRdackdQL_gF0tcM9PTxce5Hw8-/Amdahl%20vs%20Gunther%20laws.jpg" mce_src="http://5l3vgw.bay.livefilestore.com/y1pN7Q-dLDfRpTM8sHTBUMnORaHYkqs1fnOq57tgpYfyBjjbKKjQ8XZRTTazyt9cNaxt2X31QRdackdQL_gF0tcM9PTxce5Hw8-/Amdahl%20vs%20Gunther%20laws.jpg"&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT face=Calibri&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="LINE-HEIGHT: 115%; COLOR: #002060; FONT-SIZE: 10pt; mso-bidi-font-size: 11.0pt"&gt;Figure 4.&lt;/SPAN&gt;&lt;/B&gt;&lt;SPAN style="LINE-HEIGHT: 115%; COLOR: #002060; FONT-SIZE: 10pt; mso-bidi-font-size: 11.0pt"&gt; &lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;A comparison of Amdahl’s Law to Gunther’s Universal Scalability Model&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;FONT face=Calibri&gt;&lt;SPAN style="LINE-HEIGHT: 115%; COLOR: #002060; FONT-SIZE: 10pt; mso-bidi-font-size: 11.0pt"&gt;&lt;o:p&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT color=#000000 size=3&gt;Given that Amdahl was mainly acting as an advocate for building faster serial CPUs, something that he wanted to do anyway, his is by no means the last word on the subject. Researchers in numerical computing like the ones in Sandia Labs were encouraged a few years later by a paper from one of their own. John Gustafson of Sandia Labs published a well-known paper in 1988 entitled “&lt;/FONT&gt;&lt;A href="http://www.scl.ameslab.gov/Publications/Gus/AmdahlsLaw/Amdahls.html"&gt;&lt;FONT color=#0000ff size=3&gt;Reevaluating Amdahl's Law&lt;/FONT&gt;&lt;/A&gt;&lt;FONT color=#000000 size=3&gt;” that adopted a much more optimistic stance to parallel programming. The essence of Gustafson’s argument is that when parallel processing resources become available, programmers will jigger their software to take advantage of them:&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0.3in 10pt" class=MsoNormal&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-SIZE: 9pt"&gt;&lt;FONT color=#000000&gt;One does not take a fixed-size problem and run it on various numbers of processors except when doing academic research; in practice, &lt;I&gt;the problem size scales with the number of processors&lt;/I&gt;. When given a more powerful processor, the problem generally expands to make use of the increased facilities. Users have control over such things as grid resolution, number of timesteps, difference operator complexity, and other parameters that are usually adjusted to allow the program to be run in some desired amount of time. Hence, it may be most realistic to assume that &lt;I&gt;run time, not problem size&lt;/I&gt;, is constant.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT color=#000000 size=3&gt;Gustafson’s counter-argument does not refute Amdahl’s law so much as suggest there might be creative ways to work&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;around it. It encouraged parallel programming researchers to keep plugging away, pursuing creative ways to sidestep Amdahl’s law. Microsoft’s Herb Sutter, &lt;/FONT&gt;&lt;A href="http://www.ddj.com/architect/205900309"&gt;&lt;FONT color=#0000ff size=3&gt;in his popular Dr. Dobbs Journal column back in January 2008&lt;/FONT&gt;&lt;/A&gt;&lt;FONT color=#000000 size=3&gt;, cited Gustafson’s Law favorably to offer similar encouragement to software developers today that need to re-fashion their code to take advantage of parallel processing in the many-core, multi-core era. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT color=#000000 size=3&gt;Gunther’s augmentation of Amdahl’s law is grounded empirically, providing a more realistic assessment of scalability using parallel programming technology. Gunther’s formula is similar, but adds another parameter to Amdahl’s law, κ,&lt;SPAN style="mso-fareast-font-family: 'Times New Roman'; mso-fareast-theme-font: minor-fareast"&gt; &lt;/SPAN&gt;that represents something called &lt;I style="mso-bidi-font-style: normal"&gt;coherency&lt;/I&gt; delay: &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Calibri','sans-serif'; FONT-SIZE: 11pt; mso-bidi-font-family: 'Times New Roman'; mso-bidi-theme-font: minor-bidi; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin; mso-fareast-font-family: Calibri; mso-fareast-theme-font: minor-latin; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA"&gt;&lt;v:shapetype id=_x0000_t75 coordsize="21600,21600" o:spt="75" o:preferrelative="t" path="m@4@5l@4@11@9@11@9@5xe" filled="f" stroked="f"&gt;&lt;FONT color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;EM&gt;&lt;STRONG&gt;C(p) = p/(1+s(p-1) + kp(p-1))&lt;/STRONG&gt;&lt;/EM&gt;&lt;/FONT&gt;&lt;/v:shapetype&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Calibri','sans-serif'; FONT-SIZE: 11pt; mso-bidi-font-family: 'Times New Roman'; mso-bidi-theme-font: minor-bidi; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin; mso-fareast-font-family: Calibri; mso-fareast-theme-font: minor-latin; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA"&gt;&lt;v:shapetype coordsize="21600,21600" o:spt="75" o:preferrelative="t" path="m@4@5l@4@11@9@11@9@5xe" filled="f" stroked="f"&gt;&lt;FONT color=#000000&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;To show how the two formulas behave, in Figure 2 above, Amdahl’s law is compared to Gunther’s law for a program with the same 10% serial portion. I set the coherency delay factor in Gunther’s formula to 0.001. When a coherency delay is also modeled, notice that parallel scalability is no longer monotonically increasing as processors are added. When we allow for some amount of coherency delay, there comes a point when overall performance levels off and ultimately begins to &lt;EM&gt;decrease&lt;/EM&gt;. Gunther’s formula not only models the performance of a parallel program that encounters diminishing returns from increased levels of parallelism, it also highlights the performance degradation that can occur when the communication and coordination-related delays introduced by multiple threads needing to synchronize access to shared data structures becomes excessive.&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;Gunther’s formula lumps all the delays associated with communication and coordination among threads that require access to shared data structures into one factor &lt;EM&gt;k&lt;/EM&gt; that he calls &lt;EM&gt;coherence&lt;/EM&gt;. Unfortunately, Gunther himself provides little help in telling us how to estimate &lt;EM&gt;k,&lt;/EM&gt; the crucial coherency delay factor, beforehand, or measure it after the fact. Presumably, &lt;EM&gt;k&lt;/EM&gt; includes delays associated with accessing critical sections of code that are protected by shared locks, as well as instruction execution delays in the hardware associated with maintaining the “coherence” of shared data kept in processor caches that are accessed and updated by concurrently running threads. There are also additional “overheads” associated with spinning up multiple worker threads, queuing up work items for them to process, controlling their execution, and coordinating their ultimate completion that are new to the parallel processing environment that are all absent from the serial version of the same program.&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;As a practical developer trying to understand the behavior of my parallel application, personally, I would find Gunther’s formula much more useful if it helped me identify the sources of coherency delays my parallel programs encounter that are impacting its scalability. It would also be useful if Gunther’s insight could help me guided me as I work to try to eliminate or reduce these obstacles to scalability. That is the main subject of&lt;A title="forward pointer" href="http://blogs.msdn.com/ddperf/archive/2009/06/09/parallel-scalability-isn-t-child-s-play-part-3-the-problem-with-fine-grained-parallelism.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2009/06/09/parallel-scalability-isn-t-child-s-play-part-3-the-problem-with-fine-grained-parallelism.aspx"&gt; the next blog entry in this series&lt;/A&gt;. &lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;&lt;/FONT&gt;&lt;/v:shapetype&gt;&lt;/SPAN&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9575026" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ddperf/archive/tags/.NET/default.aspx">.NET</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Scalability/default.aspx">Scalability</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Parallel+programming/default.aspx">Parallel programming</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Performance/default.aspx">Performance</category></item><item><title>Parallel Scalability Isn’t Child’s Play</title><link>http://blogs.msdn.com/ddperf/archive/2009/03/16/parallel-scalability-isn-t-child-s-play.aspx</link><pubDate>Mon, 16 Mar 2009 20:39:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9481780</guid><dc:creator>MarkBFriedman</dc:creator><slash:comments>9</slash:comments><comments>http://blogs.msdn.com/ddperf/comments/9481780.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ddperf/commentrss.aspx?PostID=9481780</wfw:commentRss><description>&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;In &lt;A title="Neil Gunther's blog" href="http://perfdynamics.blogspot.com/2009/02/poor-scalability-on-multicore.html" mce_href="http://perfdynamics.blogspot.com/2009/02/poor-scalability-on-multicore.html"&gt;a recent blog entry&lt;/A&gt;, Dr. Neil Gunther, a colleague from the Computer Measurement Group (CMG), warned about unrealistic expectations being raised with regard to the performance of parallel programs on current multi-core hardware. Neil’s blog entry highlighted a dismal parallel programming experience publicized &lt;/FONT&gt;&lt;A title="Sandia Labs multi-core press release" href="http://www.sandia.gov/news/resources/releases/2009/multicore.html" mce_href="http://www.sandia.gov/news/resources/releases/2009/multicore.html"&gt;&lt;FONT color=#0000ff size=3 face=Calibri&gt;in a recent press release&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3 face=Calibri&gt; from the Sandia Labs in Albuquerque, New Mexico. Sandia Labs is a research facility operated by the U.S. Department of Energy. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;According to the press release, scientists at Sandia Labs &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: 0in 0.2in 10pt" class=MsoNormal&gt;&lt;SPAN style="FONT-SIZE: 9pt"&gt;&lt;FONT face=Calibri&gt;simulated key algorithms for deriving knowledge from large data sets. The simulations show a significant increase in speed going from two to four multicores, but an insignificant increase from four to eight multicores. Exceeding eight multicores causes a decrease in speed. Sixteen multicores perform barely as well as two, and after that, a steep decline is registered as more cores are added.” They concluded that this retrograde speed-up was due to deficiencies in “memory bandwidth as well as contention between processors over the memory bus available to each processor.&lt;?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Holy cow. The Lab’s scientists, who are heavily invested in parallel programming on supercomputers, simulated running programs on sixteen cores encapsulating “key algorithms for deriving knowledge from large data sets” that gave no better performance than running the same program on two cores. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Please note that these are simulated performance results, because 16-core machines of the type being simulated don’t currently exist. Indeed, I would not expect that 16-core machines of the type being simulated would ever exist. Which leads me to wonder what the point of this Sandia Labs exercise was.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Of course, for developers experienced in parallel programming, this result actually isn’t in itself all that surprising. Quite frequently, experienced developers find that running their multi-threaded application on massively parallel hardware does not scale well with the hardware capabilities. This was apparently the case for the applications the Sandia Labs folks simulated. So what? Should we just give up in our quest for parallel program scalability? &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Before drilling into Dr. Gunther’s specific interest in this disclosure, it is worth looking into the Sandia Labs finding in a bit more detail. For instance, did anyone, besides me, wonder what applications were being simulated?&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;In theory, “deriving knowledge from large data sets” is a category of computing program that readily lends itself to a solution using an &lt;/FONT&gt;&lt;A href="http://en.wikipedia.org/wiki/SIMD"&gt;&lt;FONT size=3 face=Calibri&gt;SIMD&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3 face=Calibri&gt; (&lt;B style="mso-bidi-font-weight: normal"&gt;S&lt;/B&gt;ingle &lt;B style="mso-bidi-font-weight: normal"&gt;I&lt;/B&gt;nstruction, &lt;B style="mso-bidi-font-weight: normal"&gt;M&lt;/B&gt;ultiple &lt;B style="mso-bidi-font-weight: normal"&gt;D&lt;/B&gt;ata) approach. The canonical example of an SIMD approach to “deriving knowledge from large data sets” is a database Search function conducted in parallel where the data set of interest is partitioned across &lt;I style="mso-bidi-font-style: normal"&gt;n&lt;/I&gt; processing units and their locally attached disks. For example, when the Thinking Machines CM-1 supercomputer publicly debuted in the mid-80s, the company demonstrated its capabilities using a parallel search of a database that was partitioned across all 64K nodes of the machine, which was based on the Connection Machine originally designed by MIT whiz kid Danny Hillis. Parallel search when executed across a partitioned dataset should scale linearly, or close enough for government work (pun intended). &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Whenever a problem lends itself to an SIMD approach (also known as “divide and conquer”), linear scalability of the SIMD algorithm does require first partitioning the data being accessed and then proceeding to process that data in parallel. I am sure the point of the Sandia Labs press release was not to disparage the SIMD approach to parallel processing; after all, that is a tried-and-true technique that they have used with great success over the years. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;On the contrary, it appears to be a critique of an approach to building parallel processing hardware&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;where you would increase the number of processing cores on the chip (just because you can with the most current semiconductor fabrication technology) without scaling the memory bandwidth proportionally. Since that is not what is happening hardware-wise, it strikes me that this implied criticism of the multi-core hardware&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;strategy Intel and AMD are pursuing is slaying a non-existent dragon. Both Intel and AMD recognize that memory bus bandwidth is a significant potential bottleneck in their multi-core products, and, as a result, both manufacturers are attempting to scale memory bandwidth proportional to the amount of processing power they deliver on a chip.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;So then what is all the fuss about? The Sandia Labs “news” starts to look like something the blogosphere is latching onto on an otherwise slow day for tech news, raising an alarm &amp;amp; potentially misleading naïve readers about what the conventional wisdom in multiprocessor chip architecture would be if anyone were actually trying to build multi-core microprocessors that way.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;STRONG&gt;Building a better multicore processor.&lt;/STRONG&gt;&lt;/P&gt;&lt;FONT size=3 face=Calibri&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;The point of the Sandia Labs press release publicizing these simulation results appears to be to suggest what they consider better approaches to packaging multi-core processors on a single socket. They released the following chart that that makes this point (reproduced here in Figure 1):&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&lt;IMG style="WIDTH: 450px; HEIGHT: 353px" title="Sandia Labs multicore simulation results" alt="Sandia Labs multicore simulation results" src="http://5l3vgw.bay.livefilestore.com/y1pr4F4aoYifbSInSEBRcbQ9TBEARzKw87EyYk2bricI-CoyRgTN--dE7SeFYj-q7Ll9D3mJePubLw_-B_yrrSvOQ/Sandia%20Labs%20simulated%20multicore%20performance%20(smaller).jpg" width=450 height=353 mce_src="http://5l3vgw.bay.livefilestore.com/y1pr4F4aoYifbSInSEBRcbQ9TBEARzKw87EyYk2bricI-CoyRgTN--dE7SeFYj-q7Ll9D3mJePubLw_-B_yrrSvOQ/Sandia%20Labs%20simulated%20multicore%20performance%20(smaller).jpg"&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoCaption&gt;&lt;STRONG&gt;&lt;FONT size=2&gt;&lt;FONT color=#4f81bd&gt;&lt;SPAN style="mso-no-proof: yes"&gt;Figure 1&lt;/SPAN&gt;. Sandia Labs simulation showing performance of their application vs. the number of processors.&lt;/FONT&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;FONT color=#4f81bd size=2&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT color=#000000 size=3&gt;Exactly what the Sandia Labs folks are reporting here is a little sketchy. Presumably, the simulations are based on observing the behavior of some of their key programs where they were able to measure performance running on “conventional” multi-core processors, perhaps, something like the quad-core machine I recently installed for my desktop that uses a memory bus with bandwidth in the range of 10 GB/sec. The press release seems to imply that the Sandia Labs baseline measurements were taken on current quad-core machines from Intel like mine, not the newer Nehalem processors where the memory architecture has been re-worked extensively. How useful or meaningful the results that Sandia Labs published may turn on this crucial point.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT color=#000000 size=3&gt;This Sandia Labs simulation then extrapolates out to 16 cores per socket (and beyond), simulating the manufacturer adding more cores to the die, apparently &lt;I style="mso-bidi-font-style: normal"&gt;leaving the memory architecture fundamentally unchanged&lt;/I&gt; as they moved to more cores. The Sandia Labs chart in Figure 1 is labeled to indicate that the memory bandwidth was held constant at 10 GB/sec. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT color=#000000 size=3&gt;This is more than a little suspicious. Hardware manufacturers like Intel and AMD understand clearly that the memory bus has to scale with the number of processors. The AMD &lt;/FONT&gt;&lt;A title="HyperTransport specifications" href="http://www.hypertransport.org/default.cfm?page=TechnologyLowLatency" mce_href="http://www.hypertransport.org/default.cfm?page=TechnologyLowLatency"&gt;&lt;FONT size=3&gt;HyperTransport&lt;/FONT&gt;&lt;/A&gt;&lt;FONT color=#000000 size=3&gt; bus architecture is quite explicit about this, and the latest spec for &lt;/FONT&gt;&lt;A title="HyperTransport 3.1" href="http://blogs.msdn.com/controlpanel/blogs/Exactly%20what%20the%20Sandia%20Labs%20folks%20are%20reporting%20here%20is%20a%20little%20sketchy.%20Presumably,%20the%20simulations%20are%20based%20on%20observing%20the%20behavior%20of%20some%20of%20their%20key%20programs%20where%20they%20were%20able%20to%20measure%20performance%20running%20on%20“conventional”%20multi-core%20processors,%20perhaps,%20something%20like%20the%20quad-core%20machine%20I%20recently%20installed%20for%20my%20desktop%20that%20uses%20a%20memory%20bus%20with%20bandwidth%20in%20the%20range%20of%2010%20GB/sec.%20The%20press%20release%20seems%20to%20imply%20that%20the%20Sandia%20Labs%20baseline%20measurements%20were%20taken%20on%20current%20quad-core%20machines%20from%20Intel%20like%20mine,%20not%20the%20newer%20Nehalem%20processors%20where%20the%20memory%20architecture%20has%20been%20re-worked%20extensively.%20How%20useful%20or%20meaningful%20the%20results%20that%20Sandia%20Labs%20published%20may%20turn%20on%20this%20crucial%20point." mce_href="http://blogs.msdn.com/controlpanel/blogs/Exactly what the Sandia Labs folks are reporting here is a little sketchy. Presumably, the simulations are based on observing the behavior of some of their key programs where they were able to measure performance running on “conventional” multi-core processors, perhaps, something like the quad-core machine I recently installed for my desktop that uses a memory bus with bandwidth in the range of 10 GB/sec. The press release seems to imply that the Sandia Labs baseline measurements were taken on current quad-core machines from Intel like mine, not the newer Nehalem processors where the memory architecture has been re-worked extensively. How useful or meaningful the results that Sandia Labs published may turn on this crucial point."&gt;&lt;FONT size=3&gt;HyperTransport version 3.1&lt;/FONT&gt;&lt;/A&gt;&lt;FONT color=#000000 size=3&gt; has an aggregate bandwidth in excess of 50 GB/sec.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT color=#000000 size=3&gt;Meanwhile, the memory bus capacity on the latest &lt;/FONT&gt;&lt;A title="Nhealem architecture announcement" href="http://blogs.msdn.com/ddperf/archive/2008/04/01/thoughts-on-intel-s-recent-hardware-announcements.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2008/04/01/thoughts-on-intel-s-recent-hardware-announcements.aspx"&gt;&lt;FONT size=3&gt;Nehalem&lt;/FONT&gt;&lt;/A&gt;&lt;FONT color=#000000 size=3&gt;-class processors from Intel has been boosted significantly. Alternatively, it is when you cannot scale the memory bus with processor capacity that machines with &lt;/FONT&gt;&lt;A title="Blogging about NUMA 2008" href="http://blogs.msdn.com/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx"&gt;&lt;FONT size=3&gt;NUMA&lt;/FONT&gt;&lt;/A&gt;&lt;FONT color=#000000 size=3&gt; architectures become more attractive. The AMD processors use HyperTransport links&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;in a ring topology that implicitly leads to NUMA-characteristics. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT color=#000000 size=3&gt;In Intel’s approach to NUMA scalability, some small number of processors share a common memory bus, forming a &lt;I style="mso-bidi-font-style: normal"&gt;node&lt;/I&gt;. Current Nehalem machines (also known as the Core i7 architecture) have four cores sharing the Front-side memory bus (FSB). The physical layout of this chip is photographed in Figure 2, showing four cores, connected to DDR3 DRAM using an integrated memory controller. I wasn’t able to come find a speed rating for the FSB in the Nehalem on Intel’s web site or elsewhere, other than ballpark estimates that puts its speed in the range of 30-40 GB/sec. The QuickConnect technology links that are used to link memory controllers support 25 GB/sec transfers, which is probably a safe lower bound on the capacity of the FSB. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT color=#000000 size=3&gt;&lt;IMG style="WIDTH: 526px; HEIGHT: 363px" title="Core i7 4-way multiprocessor photo" alt="Core i7 4-way multiprocessor photo" src="http://5l3vgw.bay.livefilestore.com/y1pS_IQwDypWmRE8pD4mgMliuhbypb0uOI730CaN7MKi5QtXsiDyzMJ9eE4o2-kp03n19hsvrPV-MEMRRbv9L1d3Q/Nehalem%20multicore%20chip%20photo.jpg" width=526 height=363 mce_src="http://5l3vgw.bay.livefilestore.com/y1pS_IQwDypWmRE8pD4mgMliuhbypb0uOI730CaN7MKi5QtXsiDyzMJ9eE4o2-kp03n19hsvrPV-MEMRRbv9L1d3Q/Nehalem%20multicore%20chip%20photo.jpg"&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;FONT color=#4f81bd size=2&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoCaption&gt;&lt;STRONG&gt;Figure 2. Aerial photograph showing the layout of the 4-way Core i7 (Nehalem) microprocessor chip.&lt;/STRONG&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT color=#000000 size=3&gt;The &lt;/FONT&gt;&lt;A href="https://cfwebprod.sandia.gov/cfdocs/CCIM/docs/pim-mpi.pdf"&gt;&lt;FONT color=#0000ff size=3&gt;PIM architecture&lt;/FONT&gt;&lt;/A&gt;&lt;FONT color=#000000 size=3&gt;, whose scalability curves are close to ideal for the Sandia Labs workloads is, probably not coincidentally, a processor architecture championed at Sandia Labs. The idea behind PIM machines &lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Calibri','sans-serif'; FONT-SIZE: 11pt; mso-ascii-theme-font: minor-latin; mso-fareast-font-family: Calibri; mso-fareast-theme-font: minor-latin; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: 'Times New Roman'; mso-bidi-theme-font: minor-bidi; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA"&gt;(&lt;U&gt;P&lt;/U&gt;rocessor &lt;U&gt;I&lt;/U&gt;n &lt;U&gt;M&lt;/U&gt;emory) &lt;/SPAN&gt;is that the processor (or processors) is embedded into the memory chip itself, which is a pretty interest approach to solving the “memory wall” that limits performance in today’s dominant computer architectures. Instead of loading up the microprocessor socket with more and more cores, which is the professed hardware roadmap at Intel &amp;amp; AMD, integrating memory into the socket is an intriguing alternative. Such machines, if anyone were to build them, would obviously have NUMA performance characteristics.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT color=#000000 size=3&gt;The debate is a bit academic for my taste, however, until these PIM architecture machines are a reality. For PIM architecture machines to ever get traction, either the microprocessor manufacturers would have to start building DRAM chips or the DRAM manufacturers would have to start building microprocessors. The way the semiconductor fabrication business is stratified today, that does not appear to be very likely in the near future.&lt;/FONT&gt;&lt;/P&gt;&lt;FONT color=#000000&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;So, in the end, the point of the Sandia Labs press release appears to be trying to publicize the multiprocessor hardware direction espoused mainly by Sandia Labs’ own researchers. Frankly, there have been lots and lots of different architectural approaches to parallel processing over the years, and it doesn’t look like any one approach is optimal for all computing situations. You ought to be pick another parallel programming workload to simulate in Figure 1 and get an entirely different ranking of the approaches.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;Still, the Sandia Labs simulation data are interesting mainly for they say about how difficult it is going to be for developers to write parallel programs that scale well on multi-core machines. No, achieving parallel isn’t child’s play for hardware manufacturers. Nor is it for software developers attempting to take advantage of parallel processing hardware, which is the subject I will start to drill into next time.&lt;/FONT&gt;&lt;/P&gt;&lt;/FONT&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoCaption&gt;&lt;A title="Continue to Part 2" href="http://blogs.msdn.com/ddperf/archive/2009/04/29/parallel-scalability-isn-t-child-s-play-part-2-amdahl-s-law-vs-gunther-s-law.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2009/04/29/parallel-scalability-isn-t-child-s-play-part-2-amdahl-s-law-vs-gunther-s-law.aspx"&gt;Continue to Part 2....&lt;/A&gt;.&lt;/P&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9481780" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ddperf/archive/tags/Scalability/default.aspx">Scalability</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Parallel+programming/default.aspx">Parallel programming</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Hardware/default.aspx">Hardware</category></item><item><title>Visual Studio 2010 Hardware Requirements</title><link>http://blogs.msdn.com/ddperf/archive/2008/12/23/visual-studio-2010-hardware-requirements.aspx</link><pubDate>Wed, 24 Dec 2008 10:56:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9251566</guid><dc:creator>David Berg</dc:creator><slash:comments>16</slash:comments><comments>http://blogs.msdn.com/ddperf/comments/9251566.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ddperf/commentrss.aspx?PostID=9251566</wfw:commentRss><description>&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT face=Calibri size=3&gt;Soma’s been talking about the upcoming Visual Studio 2010 release on his &lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/somasegar/default.aspx"&gt;&lt;FONT face=Calibri size=3&gt;blog&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face=Calibri size=3&gt;, which means I’m starting to get questions about what type of hardware you’re going to need to run VS2010 on.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT face=Calibri size=3&gt;Unfortunately, I can’t give you an official answer yet (other than to say, it depends on what you’re doing – obviously building small apps with one of the Express versions of Visual Studio won’t require the same resources as a multi-million line app using full blown Visual Studio Team System with lots of third party add-ins).&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT face=Calibri size=3&gt;What I can do is help put some of the things we’ve said about Visual Studio 2010 into context, to maybe help you make some better hardware decisions today:&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpFirst style="MARGIN: 0in 0in 0pt 0.75in; TEXT-INDENT: -0.5in; mso-add-space: auto; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="mso-fareast-font-family: Calibri; mso-fareast-theme-font: minor-latin; mso-bidi-font-family: Calibri; mso-bidi-theme-font: minor-latin"&gt;&lt;SPAN style="mso-list: Ignore"&gt;&lt;FONT face=Calibri size=3&gt;1)&lt;/FONT&gt;&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri size=3&gt;Memory – we’re trying to make VS2010 as frugal as we can here in order to run in as little memory as possible; however, we’re also adding a lot of functionality, and systems with more memory do tend to perform much better.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;So the general rule of buying systems still applies – spring for as much memory as you can afford.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;It’s hard to have too much memory, at a minimum you want to make sure that you’re not paging.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;That said, there’s very little benefit to making a text editor 64 bit (and lots of reasons not to), so anything over 4GB is likely to be wasted (unless you’re running or writing apps that need more).&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.75in; mso-add-space: auto"&gt;&lt;?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /&gt;&lt;o:p&gt;&lt;FONT face=Calibri size=3&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.75in; TEXT-INDENT: -0.5in; mso-add-space: auto; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="mso-fareast-font-family: Calibri; mso-fareast-theme-font: minor-latin; mso-bidi-font-family: Calibri; mso-bidi-theme-font: minor-latin"&gt;&lt;SPAN style="mso-list: Ignore"&gt;&lt;FONT face=Calibri size=3&gt;2)&lt;/FONT&gt;&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri size=3&gt;CPU – modern CPUs with their larger caches and tuned instruction pipelines tend to perform much better than one’s from just a few years ago (see our &lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/ddperf/archive/2008/06/18/lessons-from-the-test-lab-investigating-a-pleasant-surprise.aspx"&gt;&lt;FONT face=Calibri color=#0000ff size=3&gt;blog&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face=Calibri size=3&gt;).&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;If you’re going to do multi-threaded programming, you’ll want at least a dual core processor (and with the new &lt;/FONT&gt;&lt;A href="http://msdn.microsoft.com/en-us/concurrency/default.aspx"&gt;&lt;FONT face=Calibri size=3&gt;Parallel Computing&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face=Calibri size=3&gt; support in VS 2010, you &lt;U&gt;will&lt;/U&gt; want to do multi-threaded programming).&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in"&gt;&lt;o:p&gt;&lt;FONT face=Calibri size=3&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.75in; TEXT-INDENT: -0.5in; mso-add-space: auto; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="mso-fareast-font-family: Calibri; mso-fareast-theme-font: minor-latin; mso-bidi-font-family: Calibri; mso-bidi-theme-font: minor-latin"&gt;&lt;SPAN style="mso-list: Ignore"&gt;&lt;FONT face=Calibri size=3&gt;3)&lt;/FONT&gt;&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri size=3&gt;GPU – VS2010 will leverage WPF heavily to create richer editing and visualizations, so a decent GPU that supports at least DX9 is highly recommended (DX10 is preferred, but requires Vista).&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in"&gt;&lt;o:p&gt;&lt;FONT face=Calibri size=3&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpLast style="MARGIN: 0in 0in 10pt 0.75in; TEXT-INDENT: -0.5in; mso-add-space: auto; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="mso-fareast-font-family: Calibri; mso-fareast-theme-font: minor-latin; mso-bidi-font-family: Calibri; mso-bidi-theme-font: minor-latin"&gt;&lt;SPAN style="mso-list: Ignore"&gt;&lt;FONT face=Calibri size=3&gt;4)&lt;/FONT&gt;&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Disk – If you’re building a large project or working with a large database, a large high-speed disk is pretty important.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;For large projects, you can often benefit by spreading your work across multiple disk spindles.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;At an extreme, putting your tools on one drive, your source code on another, and your object files on a third drive allows the three major sources of disk IO in building a project to be carried out independently of each other.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;If you have to use a slower disk (e.g. a notebook) then be sure to get lots of memory.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;Also keep in mind that modern hard drives tend to have more built in caching, so the same speed drive bought recently will likely outperform one bought a few years ago.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT face=Calibri size=3&gt;So now that I’ve given you my thoughts on what hardware Visual Studio 2010 will need, what are your thoughts? &lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;What kind of hardware are you developing on today, and what do you expect to be using in the next couple years?&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;What are your expectations on how we should be leveraging your hardware to create a productive development environment?&lt;/FONT&gt;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9251566" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ddperf/archive/tags/Performance/default.aspx">Performance</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Visual+Studio/default.aspx">Visual Studio</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Hardware/default.aspx">Hardware</category></item><item><title>PDC2008 preConference Workshop</title><link>http://blogs.msdn.com/ddperf/archive/2008/10/22/pdc2008-preconference-workshop.aspx</link><pubDate>Wed, 22 Oct 2008 18:36:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9011219</guid><dc:creator>MarkBFriedman</dc:creator><slash:comments>0</slash:comments><comments>http://blogs.msdn.com/ddperf/comments/9011219.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ddperf/commentrss.aspx?PostID=9011219</wfw:commentRss><description>&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Over the past several weeks, I have been working overtime developing a presentation on web application performance to be given at the upcoming Professional Developer’s Conference (PDC), which is next week in Los Angeles. This is partly why I have been remiss about blogging this month. At least, that is my excuse, and I am sticking to it.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;The presentation is entitled “Performance by design using the .NET Framework” and I am presenting jointly with two colleagues in the Developer Division, Rico Mariani and Vance Morrison. It is one of ten PreConference sessions that are scheduled to run all day on Sunday. My portion of the session is an extended discussion of optimization &amp;amp; scaling strategies for web applications. The scope encompasses ASP.NET, AJAX, Silverlight, WPF &amp;amp; WCF. Information about the upcoming event is &lt;/FONT&gt;&lt;A href="http://www.microsoftpdc.com/Agenda/Preconference.aspx#performance-by-design-using-the-net-framework" mce_href="http://www.microsoftpdc.com/Agenda/Preconference.aspx#performance-by-design-using-the-net-framework"&gt;&lt;FONT color=#0000ff size=3 face=Calibri&gt;here&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3 face=Calibri&gt;.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;I have attended several PDCs in the past as a customer, and found them to be amazing events. Before the days of widespread blogging, the “Ask the Experts” sessions at the PDC were often the only way to get an authoritative answer to your question. The actual Conference sessions emphasize imminently arriving technology and future directions, aimed at the professional developer who needs to be able to anticipate and plan. The technical sessions run the gamut from Windows 7, the Windows Live Cloud computing initiatives, IE8, Surface and Windows for Workflow. There will be previews of the next version of the .NET Framework, Visual Studio, and the Visual Studio Team System. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;This is the first time I will be on the other side of the podium for the event. In our preCon session, Rico, Vance and I will focus on facilities available in the Framework today, including the best practices and tools we recommend to help you design &amp;amp; build an application that meets its performance and scalability requirements. The intended audience is experienced .NET developers. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;If you are reading this blog &amp;amp; coming to my session, be sure to say hello. I’d like to get the chance to meet you in person. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;-- Mark Friedman&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9011219" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ddperf/archive/tags/Performance+Engineering/default.aspx">Performance Engineering</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/.NET/default.aspx">.NET</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Scalability/default.aspx">Scalability</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Performance/default.aspx">Performance</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Performance+testing/default.aspx">Performance testing</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/PDC2008/default.aspx">PDC2008</category></item><item><title>Mainstream NUMA and the TCP/IP stack: Final Thoughts</title><link>http://blogs.msdn.com/ddperf/archive/2008/09/18/mainstream-numa-and-the-tcp-ip-stack-final-thoughts.aspx</link><pubDate>Fri, 19 Sep 2008 00:18:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8957878</guid><dc:creator>MarkBFriedman</dc:creator><slash:comments>5</slash:comments><comments>http://blogs.msdn.com/ddperf/comments/8957878.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ddperf/commentrss.aspx?PostID=8957878</wfw:commentRss><description>&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;This is a continuation of Part IV of this article posted &lt;A class="" title=Link-back-to-Part4 href="http://blogs.msdn.com/ddperf/archive/2008/09/09/mainstream-numa-and-the-tcp-ip-stack-part-iv-paralleling-tcp-ip.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2008/09/09/mainstream-numa-and-the-tcp-ip-stack-part-iv-paralleling-tcp-ip.aspx"&gt;&lt;FONT color=#666666&gt;here&lt;/FONT&gt;&lt;/A&gt;.&amp;nbsp;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;Note that a final version of a white paper tying this series of five blog entries together (and a Powerpoint presentation on the subject) are attached.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;For many years, the effort to improve network performance on Windows and other platforms focused on reducing the host processing requirements associated with the need to service frequent interrupts from the NIC. In the many-core era where the clock speeds of processors are constrained by power considerations, this strategy is inadequate to the growing host processing requirements that accompany high-speed networking. It is necessary to augment technologies like interrupt moderation and TCP Offload Engine that improve the efficiency of network I/O with an approach that allows TCP/IP Receive packets to be processed in parallel across multiple CPUs. Together, MSI-X and RSS are technologies that enable host processing of TCP/IP packets to scale in the many-core world, albeit not without some compromises with the prevailing model of networking using isolated, layered components.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;SPAN style="COLOR: black"&gt;&lt;FONT face=Calibri&gt;Using MSI-X and RSS, for example, the Intel 82598 10 Gigabit Ethernet Controller mentioned earlier can be mapped to a maximum of 16 processor cores that could then be devoted to networking I/O interrupt handling. Capacity-wise, this is still not sufficient processing capacity to handle the theoretical maximum load equation 3 predicts for a 10 Gb Ethernet card, but it does represent a substantial scalability improvement.&lt;?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;With this understanding of what MSI-X and RSS accomplishes, let’s return for a moment to our NUMA server machine shown in Figure 6 below.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;IMG title="NUMA server with multiple RSS queues" style="WIDTH: 364px; HEIGHT: 633px" height=633 alt="NUMA server with multiple RSS queues" src="http://5l3vgw.bay.livefilestore.com/y1pP1tl3lheOVmfXixoNk6WzdhcLnXhAbVSW28AD1IJ3YyHN1ZbYhAQygJHF1fesNHfPK3ehJ6yE4w/Simple%20Two%20Node%20NUMA%20Server%20with%20two%20RSS%20Queues%20(vertical%20orientation).jpg" width=364 mce_src="http://5l3vgw.bay.livefilestore.com/y1pP1tl3lheOVmfXixoNk6WzdhcLnXhAbVSW28AD1IJ3YyHN1ZbYhAQygJHF1fesNHfPK3ehJ6yE4w/Simple%20Two%20Node%20NUMA%20Server%20with%20two%20RSS%20Queues%20(vertical%20orientation).jpg"&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;With MSI-X and Receive-Side Scaling, CPU 0 on node A and CPU 1 on node B are both enabled for processing network interrupts. Since RSS schedules the NDIS DPC to run on the same processor as the ISR, even at moderate networking loads, CPU 0 and 1 for all practical purposes become dedicated to the processing of high priority networking interrupts. &lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;Numerous economies of scale accrue using this approach. The same RSS process that sends all Receive packets from a single TCP connection to a specific CPU for processing improves the efficiency of that processing. The instruction execution rate of the TCP/IP protocol stack is enhanced significantly through this scheduling mechanism that enforces localization. Ultimately, TCP/IP application data buffers need to be allocated from local node memory and processed by threads confined to that node. Recently used data and instructions that networking ISRs and DPCs issue tend to reside in the dedicated cache (or caches) associated with the processor devoted to network I/O. Or, at the very least, they migrate to the last level cache that is shared by all the processors on the same NUMA node.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;Ultimately, of course, the TCP layer hands data from the network I/O to an application layer that is ready to receive and process it. The implications of RSS for the application threads that process TCP receive packets and build responses for TCP/IP to send back to network clients ought to be obvious, but I will spell them out anyway. For optimal performance, these application processing threads also need to be directed to run on the same NUMA node where the TCP Receive packet was processed. This localization of the application’s threads should, of course, be subject to other load balancing considerations to prevent the ideal node from becoming severely over-committed while other CPUs on other nodes are idling or under-utilized. The performance penalty for an application thread that must run on a different node than the one that processed the original TCP/IP Receive packet is considerable because it must access the data payload of the request remotely. Networked applications need to understand these performance and capacity considerations and schedule their threads accordingly to balance the work across NUMA nodes optimally.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;Consider the ASP.NET application threads that process incoming HTTP Requests and generate HTTP Response messages. If the HTTP Request packet is processed by CPU 0 on node A in a NUMA machine, the Request packet payload is allocated in node A local memory. The ASP.NET application thread running in User mode that processes that incoming HTTP Request will run much more efficiently if it is scheduled to run on one of the other processors on node A, where it can access the payload and build the Response message using local node memory. &lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;There is currently no mechanism in Windows today for kernel mode drivers like ndis.sys and http.sys to communicate to the application layers above them and specify the NUMA node on which that packet was originally processed. Communicating that information to the application layer is another grievous violation of the principle of isolation in the network protocol stack, but it is a necessary step to improve the performance of networking applications in the many-core era where even moderately sized server machines have NUMA characteristics.&lt;/FONT&gt;&lt;/P&gt;
&lt;H3 style="MARGIN: 10pt 0in 0pt"&gt;&lt;BR&gt;&lt;FONT face=Cambria color=#4f81bd size=2&gt;Links.&lt;/FONT&gt;&lt;/H3&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;Herb Sutter, “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software.” Dr. Dobb’s Journal, March 1, 2005. &lt;/FONT&gt;&lt;/SPAN&gt;&lt;A href="http://www.ddj.com/architect/184405990" mce_href="http://www.ddj.com/architect/184405990"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri color=#0000ff&gt;http://www.ddj.com/architect/184405990&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;NTttcp performance testing tool: &lt;/FONT&gt;&lt;/SPAN&gt;&lt;A href="http://www.microsoft.com/whdc/device/network/TCP_tool.mspx" mce_href="http://www.microsoft.com/whdc/device/network/TCP_tool.mspx"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri color=#0000ff&gt;http://www.microsoft.com/whdc/device/network/TCP_tool.mspx&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;Windows Performance Toolkit (WPT, aka xperf): &lt;/FONT&gt;&lt;/SPAN&gt;&lt;A href="http://msdn.microsoft.com/en-us/library/cc305218.aspx" mce_href="http://msdn.microsoft.com/en-us/library/cc305218.aspx"&gt;&lt;FONT color=#0000ff&gt;&lt;FONT face=Calibri&gt;&lt;SPAN style="mso-bookmark: xperfLink"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;http://msdn.microsoft.com/en-us/library/cc305218.aspx&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: xperfLink"&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;SPAN style="mso-bookmark: xperfLink"&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;David Kanter, “The Common System Interface: Intel's Future Interconnect,” &lt;/FONT&gt;&lt;/SPAN&gt;&lt;A href="http://www.realworldtech.com/includes/templates/articles.cfm?ArticleID=RWT082807020032&amp;amp;mode=print" mce_href="http://www.realworldtech.com/includes/templates/articles.cfm?ArticleID=RWT082807020032&amp;amp;mode=print"&gt;&lt;FONT color=#0000ff&gt;&lt;FONT face=Calibri&gt;&lt;SPAN style="mso-bookmark: TheCommonSystemInterface"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;http://www.realworldtech.com/includes/templates/articles.cfm?ArticleID=RWT082807020032&amp;amp;mode=print&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: TheCommonSystemInterface"&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;SPAN style="mso-bookmark: TheCommonSystemInterface"&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;Windows NUMA support: &lt;/FONT&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;A href="http://msdn.microsoft.com/en-us/library/aa363804.aspx" mce_href="http://msdn.microsoft.com/en-us/library/aa363804.aspx"&gt;&lt;FONT color=#0000ff&gt;&lt;FONT face=Calibri&gt;&lt;SPAN style="mso-bookmark: WindowsNUMAsupport"&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;http://msdn.microsoft.com/en-us/library/aa363804.aspx&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: WindowsNUMAsupport"&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;SPAN style="mso-bookmark: WindowsNUMAsupport"&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt; &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;SPAN style="mso-bookmark: WindowsNUMAsupport"&gt;&lt;/SPAN&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;Intel white paper: &lt;/FONT&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;A href="http://www.intel.com/technology/ioacceleration/306484.pdf" mce_href="http://www.intel.com/technology/ioacceleration/306484.pdf"&gt;&lt;FONT color=#0000ff&gt;&lt;FONT face=Calibri&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: IOATwhitepaper"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;Accelerating High-Speed Networking with Intel® I/O Acceleration Technology&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: IOATwhitepaper"&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: IOATwhitepaper"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: IOATwhitepaper"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;Mark B. Friedman, “&lt;/FONT&gt;&lt;A class="" title=SANCapacityPlanningLink name=SANCapacityPlanningLink&gt;&lt;/A&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;A href="http://www.demandtech.com/Resources/Papers/Intro%20to%20SAN%20capacity%20planning.pdf" mce_href="http://www.demandtech.com/Resources/Papers/Intro%20to%20SAN%20capacity%20planning.pdf"&gt;&lt;FONT color=#0000ff&gt;&lt;FONT face=Calibri&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: IOATwhitepaper"&gt;&lt;SPAN style="mso-bookmark: SANCapacityPlanningLink"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;An Introduction to SAN Capacity Planning&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: IOATwhitepaper"&gt;&lt;SPAN style="mso-bookmark: SANCapacityPlanningLink"&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;SPAN style="mso-bookmark: SANCapacityPlanningLink"&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: IOATwhitepaper"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;,” &lt;I style="mso-bidi-font-style: normal"&gt;Proceedings&lt;/I&gt;, Computer Measurement Group, Dec. 2001.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: IOATwhitepaper"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;Jeffrey Mogul’s “TCP offload is a dumb idea whose time has come,” &lt;I style="mso-bidi-font-style: normal"&gt;Proceedings&lt;/I&gt; of the 9th conference on Hot Topics in Operating Systems - Volume 9, 2003. &lt;/FONT&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;A href="http://portal.acm.org/citation.cfm?id=1251059&amp;amp;dl=ACM&amp;amp;coll=portal&amp;amp;CFID=71988909&amp;amp;CFTOKEN=98964748" mce_href="http://portal.acm.org/citation.cfm?id=1251059&amp;amp;dl=ACM&amp;amp;coll=portal&amp;amp;CFID=71988909&amp;amp;CFTOKEN=98964748"&gt;&lt;FONT color=#0000ff&gt;&lt;FONT face=Calibri&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: IOATwhitepaper"&gt;&lt;SPAN style="mso-bookmark: TCPOffloadDumbIdeaLink"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;http://portal.acm.org/citation.cfm?id=1251059&amp;amp;dl=ACM&amp;amp;coll=portal&amp;amp;CFID=71988909&amp;amp;CFTOKEN=98964748&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: IOATwhitepaper"&gt;&lt;SPAN style="mso-bookmark: TCPOffloadDumbIdeaLink"&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: IOATwhitepaper"&gt;&lt;SPAN style="mso-bookmark: TCPOffloadDumbIdeaLink"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt; &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;SPAN style="mso-bookmark: TCPOffloadDumbIdeaLink"&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: IOATwhitepaper"&gt;&lt;/SPAN&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;Dell Computer Corporation, “&lt;/FONT&gt;&lt;A class="" title=DellTCPOffloadwhitepaper name=DellTCPOffloadwhitepaper&gt;&lt;/A&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;A href="http://www.dell.com/downloads/global/vectors/ps3q06-20060132-Broad_com.pdf" mce_href="http://www.dell.com/downloads/global/vectors/ps3q06-20060132-Broad_com.pdf"&gt;&lt;FONT color=#0000ff&gt;&lt;FONT face=Calibri&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: DellTCPOffloadwhitepaper"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;Boosting Data Transfer with TCP Offload Engine Technology&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: DellTCPOffloadwhitepaper"&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;SPAN style="mso-bookmark: DellTCPOffloadwhitepaper"&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;.”&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;Microsoft Corporation, KB 951037, &lt;/FONT&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;A href="http://support.microsoft.com/kb/951037" mce_href="http://support.microsoft.com/kb/951037"&gt;&lt;FONT face=Calibri&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: KB951037"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;http://support.microsoft.com/kb/951037&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: KB951037"&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: KB951037"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt; &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: KB951037"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;Microsoft Corporation, Windows Driver Development Kit (DDK) documentation, &lt;/FONT&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;A href="http://msdn.microsoft.com/en-us/library/cc264906.aspx" mce_href="http://msdn.microsoft.com/en-us/library/cc264906.aspx"&gt;&lt;FONT color=#0000ff&gt;&lt;FONT face=Calibri&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: KB951037"&gt;&lt;SPAN style="mso-bookmark: WindowsDDKLink"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;http://msdn.microsoft.com/en-us/library/cc264906.aspx&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: KB951037"&gt;&lt;SPAN style="mso-bookmark: WindowsDDKLink"&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: KB951037"&gt;&lt;SPAN style="mso-bookmark: WindowsDDKLink"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: KB951037"&gt;&lt;SPAN style="mso-bookmark: WindowsDDKLink"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;Microsoft Corporation, KB 927168, &lt;/FONT&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;A href="http://support.microsoft.com/kb/927168" mce_href="http://support.microsoft.com/kb/927168"&gt;&lt;FONT color=#0000ff&gt;&lt;FONT face=Calibri&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: KB951037"&gt;&lt;SPAN style="mso-bookmark: WindowsDDKLink"&gt;&lt;SPAN style="mso-bookmark: KB927168"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;http://support.microsoft.com/kb/927168&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: KB951037"&gt;&lt;SPAN style="mso-bookmark: WindowsDDKLink"&gt;&lt;SPAN style="mso-bookmark: KB927168"&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: KB951037"&gt;&lt;SPAN style="mso-bookmark: WindowsDDKLink"&gt;&lt;SPAN style="mso-bookmark: KB927168"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: KB951037"&gt;&lt;SPAN style="mso-bookmark: WindowsDDKLink"&gt;&lt;SPAN style="mso-bookmark: KB927168"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;FONT face=Calibri&gt;Microsoft Corporation,&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;NDIS 6.0 Receive-Side Scaling documentation, &lt;/FONT&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;A href="http://msdn.microsoft.com/en-us/library/ms795609.aspx" mce_href="http://msdn.microsoft.com/en-us/library/ms795609.aspx"&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: KB951037"&gt;&lt;SPAN style="mso-bookmark: WindowsDDKLink"&gt;&lt;SPAN style="mso-bookmark: KB927168"&gt;&lt;SPAN style="mso-bookmark: NDIS6ReceiveSideScaling"&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt; mso-ascii-font-family: Calibri; mso-hansi-font-family: Calibri; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT face=Calibri color=#0000ff&gt;http://msdn.microsoft.com/en-us/library/ms795609.aspx&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;SPAN style="mso-bookmark: KB951037"&gt;&lt;SPAN style="mso-bookmark: WindowsDDKLink"&gt;&lt;SPAN style="mso-bookmark: KB927168"&gt;&lt;SPAN style="mso-bookmark: NDIS6ReceiveSideScaling"&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN style="mso-bookmark: NDIS6ReceiveSideScaling"&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: KB927168"&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: WindowsDDKLink"&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: KB951037"&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bookmark: ReskitPerfGuidebook"&gt;&lt;/SPAN&gt;&lt;SPAN style="mso-bidi-font-size: 9.0pt"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=8957878" width="1" height="1"&gt;</description><enclosure url="http://cid-12a53f90793d2c8b.skydrive.live.com/self.aspx/DDPEBlogImages/Presentations%20and%20Papers/Mainstream%20NUMA%20and%20the%20TCP%20|5CMG%20paper%208220%20draft|6.docx" length="24241" type="text/html; charset=utf-8" /><category domain="http://blogs.msdn.com/ddperf/archive/tags/Performance+Engineering/default.aspx">Performance Engineering</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/.NET/default.aspx">.NET</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Scalability/default.aspx">Scalability</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Parallel+programming/default.aspx">Parallel programming</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Performance/default.aspx">Performance</category></item><item><title>Mainstream NUMA and the TCP/IP stack, Part IV: Parallelizing TCP/IP</title><link>http://blogs.msdn.com/ddperf/archive/2008/09/09/mainstream-numa-and-the-tcp-ip-stack-part-iv-paralleling-tcp-ip.aspx</link><pubDate>Tue, 09 Sep 2008 02:35:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8935373</guid><dc:creator>MarkBFriedman</dc:creator><slash:comments>2</slash:comments><comments>http://blogs.msdn.com/ddperf/comments/8935373.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ddperf/commentrss.aspx?PostID=8935373</wfw:commentRss><description>&lt;FONT face=Calibri&gt;
&lt;P&gt;This is a continuation of Part III of this article posted &lt;A class="" title=Link-back-to-Part3 href="http://blogs.msdn.com/ddperf/archive/2008/08/06/mainstream-numa-and-the-tcp-ip-stack-part-iii-a-look-back-at-strategies-to-scale-high-speed-networking.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2008/08/06/mainstream-numa-and-the-tcp-ip-stack-part-iii-a-look-back-at-strategies-to-scale-high-speed-networking.aspx"&gt;&lt;FONT color=#666666&gt;here&lt;/FONT&gt;&lt;/A&gt;.&amp;nbsp;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;In the many-core era, the host processor overhead associated with processing TCP/IP interrupts is not a capacity problem, since CPU cycles on the host computer are plentiful and becoming more plentiful all the time. The problem is that the individual processors themselves are not fast enough, nor are they growing much faster. To craft a solution that works in the many-core era, there is a clear need to enhance the hardware and software in the TCP/IP protocol stack to run in parallel across multiple processors and take advantage of the available capacity. There are two hardware and software technologies that are associated with that capability today:&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri&gt;Extended Message-Signaled Interrupts (MSI-X): a hardware technology that allows the NIC to support multiple interrupt vectors, enabling multiple processor cores to handle interrupts from the NIC simultaneously.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri&gt;Receive-Side Scaling (RSS): the protocol used in the NDIS driver software to manage multiple interrupt vectors and communicate to the hardware to ensure that session-oriented TCP packets are delivered in sequence to a processor-specific interrupt queue. &lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;MSI-X and RSS work together to allow the processing of TCP/IP Receive packets to scale in parallel across multiple processor cores&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;&lt;I style="mso-bidi-font-style: normal"&gt;Message Signaled Interrupts (MSI-X)&lt;/I&gt;. MSI-X is an architectural change that allows a device to send interrupts to be processed on multiple CPUs. Historically, on the Intel architecture, devices were limited to sending interrupts to a single target. Concentrating all hardware interrupts on a single processor boosts the instruction execution rate of the Interrupt Service Routine (ISR) by increasing the chances of a processor cache warm start. In the many-core era, limiting the device to one processor that it can interrupt is a severe capacity constraint. MSI-X capabilities allow the NICs to scale on many-core processors. &lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;One key feature of Windows’ support for MSI-X devices is the ability to specify a policy that automatically assigns MSI-X interrupts to CPUs based on the OS’s understanding of the underlying NUMA topology of the machine. An NDIS-driver that supports MSI-X devices can specify an &lt;I style="mso-bidi-font-style: normal"&gt;IrqPolicySpreadMessagesAcrossAllProcessors&lt;/I&gt; policy that automatically distributes interrupts across an optimal set of eligible processors. On some NUMA machines, the performance of the device connection is affected by the underlying topology of the multi-node connections. For instance, certain device-to-processor node connections may be low latency local ones, while others are higher latency remote connections. For performance reasons, you want NIC interrupts to be processed on nodes that are connected locally and access local memory on that node exclusively. For optimal scalability, you then want to balance device interrupts across all the NUMA nodes that are interconnected. The &lt;I style="mso-bidi-font-style: normal"&gt;IrqPolicySpreadMessagesAcrossAllProcessors &lt;/I&gt;policy understands these performance considerations, and distributes the device interrupts to the right set of processors automatically.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;Figure 6 illustrates one way the &lt;I style="mso-bidi-font-style: normal"&gt;IrqPolicySpreadMessagesAcrossAllProcessors&lt;/I&gt; policy could be used to distribute interrupts from the NIC across nodes in a simple NUMA machine. A server with two quad-core sockets is shown, with each socket connected to a block of local RAM. Memory accesses from a processor core to local RAM are considerably faster than an access to remote memory attached via a bridge to the other multi-core socket. An optimal configuration is to process TCP/IP interrupts on CPU 0 on the first node and on CPU 1 on the second node, as depicted, balancing the networking I/O load across nodes. &lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;IMG title="NUMA machine with two RSS queues" style="WIDTH: 364px; HEIGHT: 633px" height=633 alt="NUMA machine with two RSS queues" src="http://5l3vgw.bay.livefilestore.com/y1plB-YgF_mL03SwgvEEnctbltcS7spfhLdNbX9F-mfjqSgfnbCfwdqGeSijVe4EpzZB3v3PL7MFqc/Simple%20Two%20Node%20NUMA%20Server%20with%20two%20RSS%20Queues%20(vertical%20orientation).jpg" width=364 mce_src="http://5l3vgw.bay.livefilestore.com/y1plB-YgF_mL03SwgvEEnctbltcS7spfhLdNbX9F-mfjqSgfnbCfwdqGeSijVe4EpzZB3v3PL7MFqc/Simple%20Two%20Node%20NUMA%20Server%20with%20two%20RSS%20Queues%20(vertical%20orientation).jpg"&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT face=Calibri&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="FONT-SIZE: 9pt; COLOR: black; LINE-HEIGHT: 115%"&gt;Figure 6.&lt;/SPAN&gt;&lt;/B&gt;&lt;SPAN style="FONT-SIZE: 9pt; COLOR: black; LINE-HEIGHT: 115%"&gt; Two NUMA nodes in a Windows Server machine configured to use MSI-X and RSS to process TCP/IP Receive packets across multiple processors.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;FONT face=Calibri&gt;&lt;SPAN style="FONT-SIZE: 9pt; COLOR: black; LINE-HEIGHT: 115%"&gt;&lt;?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /&gt;&lt;o:p&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT size=2&gt;While Receive-Side Scaling (RSS) does not require MSI-X, the two technologies normally go hand-in-hand. We restrict the RSS discussion here to the manner in which MSI-X devices are supported, which is both the simplest and most common case.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT size=2&gt;&lt;I style="mso-bidi-font-style: normal"&gt;&lt;SPAN style="COLOR: black"&gt;Receive-Side Scaling (RSS)&lt;/SPAN&gt;&lt;/I&gt;&lt;SPAN style="COLOR: black"&gt;. &lt;/SPAN&gt;RSS complements the Windows support for MSI-X. It allows the workload associated with processing network interrupts to be spread across multiple CPUs. With RSS, the DPC routine that we have seen is responsible for performing the bulk of the host processing is also scheduled to run on the same CPU where the interrupt service routine (ISR) just ran. Concentrating all the work associated with network interrupt processing on the same CPU improves instruction execution rates because data associated with the packet is likely to remain in the processor caches. It also dramatically reduces delays spent in unproductive spin lock code associated with serialization. Optimistic, non-blocking per processor locking strategies are effective under these circumstances. By default under RSS, even the Send processing associated with an ACK message is also processed on the same CPU where the Receive was processed to take advantage of the same performance considerations.&lt;SPAN style="COLOR: black"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT size=2&gt;There is one complication, however, that arises when network interrupts are distributed across multiple CPUs that RSS is forced to address. If packets are distributed randomly across multiple CPUs, this can conflict with the important function of the TCP protocol that guarantees delivery of data in sequence to the application. Suppose packets for a group of TCP connections are processed across two CPUs and one CPU in the bunch is lightly loaded while the other is heavily loaded. Older packets received on the lightly loaded CPU could easily be processed first. Receiving packets out of order in TCP triggers Fast Retransmits, for example, that could degrade both the network and delay the application, not to mention serialization delays before TCP can safely notify the application layer that Request data is available for processing.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT size=2&gt;Given this complication, RSS distributes connections, not individual packets. RSS has a mechanism that sends all the packets associated with any one TCP connection to the same processor. This preserves the order of delivery of received data packets, which avoids needless requests for TCP retransmits. Crucially, the processor associated with the specific connection must be communicated to the NIC, which must arrange Received packets into the correct message queues accordingly, prior to signaling the host processor by raising an interrupt. This coordination, of course, is another violation of the isolation principle of the layered networking stack. It is worth noting that nasty side effects can arise as a result of this willful violation of the layered networking architecture; see, for example, &lt;/FONT&gt;&lt;A href="http://support.microsoft.com/kb/927168"&gt;&lt;FONT color=#0000ff size=2&gt;KB927168&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=2&gt; &lt;/FONT&gt;&lt;FONT size=2&gt;documenting a conflict between RSS and Internet Connection Sharing on Vista that was later fixed in WS2008 and Vista SP1.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT size=2&gt;To achieve good performance, however, it is absolutely necessary for the NIC to schedule all the packets for the same TCP session to same host processor. It can only do that by peeking into the TCP header and finding the port indicator, which it then uses to calculate the right CPU to deliver the packet to. This calculation is based on a hash table that is passed to the NIC by the NDIS driver software. RSS even includes a capability to adjust the load across the CPUs that are enabled for processing NIC interrupts dynamically. The protocol stack in Windows can re-balance the interrupt load by modifying the hashing table passed to the NIC that is used in determining the proper CPU. &lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;This mechanism can be used in case some CPUs remain overloaded for an extended period of time because, for example, some TCP connections are more chatty and persistent than others. &lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT size=2&gt;Speaking of maintaining a balanced system, long-running tasks such as large file copies associated with a single ftp, SMB or media server session present inherent difficulties under RSS. The general problem is that the throughput of any one session is ultimately limited by host processor speed. With many-core processors, it is important to figure out how to use parallel data divide-and-conquer techniques to break long serial operations into smaller sub-tasks that can be executed concurrently. Providing the capability to spread long, data-intensive operations across multiple TCP sessions is one possible approach.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT size=2&gt;For further technical details on RSS, see &lt;A href="http://msdn.microsoft.com/en-us/library/ms795609.aspx"&gt;&lt;SPAN style="FONT-SIZE: 9pt; mso-ascii-font-family: Calibri; mso-hansi-font-family: Calibri; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT color=#0000ff&gt;http://msdn.microsoft.com/en-us/library/ms795609.aspx&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;. One interesting aspect of the RSS specification is that the DPC, not the ISR, is responsible for re-enabling the processor for more interrupts from the NIC. This prevents the NIC from sending any more Receive packets to the processor until the previous set has been completely processed. This effectively acts as both a serialization mechanism and a form of interrupt moderation that adaptively adjusts the delay between interrupts based on the specific processing load at the CPU.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT size=2&gt;This blog entry is continued &lt;A class="" title="Link to Part V" href="http://blogs.msdn.com/ddperf/archive/2008/09/18/mainstream-numa-and-the-tcp-ip-stack-final-thoughts.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2008/09/18/mainstream-numa-and-the-tcp-ip-stack-final-thoughts.aspx"&gt;here&lt;/A&gt;.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&amp;nbsp;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=8935373" width="1" height="1"&gt;</description></item><item><title>Performance improvements in Service Pack 1 for VS 2008 and .NET FX 3.5</title><link>http://blogs.msdn.com/ddperf/archive/2008/08/13/service-pack-1-for-vs-2008-and-net-fx-3-5-released.aspx</link><pubDate>Wed, 13 Aug 2008 14:31:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8860575</guid><dc:creator>David Berg</dc:creator><slash:comments>6</slash:comments><comments>http://blogs.msdn.com/ddperf/comments/8860575.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ddperf/commentrss.aspx?PostID=8860575</wfw:commentRss><description>&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT face=Calibri size=3&gt;We just announced the release of &lt;/FONT&gt;&lt;A href="http://msdn.microsoft.com/en-us/vstudio/products/cc533447.aspx" mce_href="http://msdn.microsoft.com/en-us/vstudio/products/cc533447.aspx"&gt;&lt;FONT face=Calibri size=3&gt;Service Pack 1 for VS 2008 and .NET FX 3.5&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face=Calibri size=3&gt;.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;A major push for this release was continuing to enhance performance and reliability, as Soma noted in his &lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/somasegar/archive/2008/08/11/service-pack-1-for-vs-2008-and-net-fx-3-5-released.aspx" mce_href="http://blogs.msdn.com/somasegar/archive/2008/08/11/service-pack-1-for-vs-2008-and-net-fx-3-5-released.aspx"&gt;&lt;FONT face=Calibri size=3&gt;most recent blog entry&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face=Calibri size=3&gt;. &lt;/FONT&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;I want to take a minute to drill into the major performance improvements you will find in this release of Visual Studio.&lt;?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;H1 style="MARGIN: 24pt 0in 0pt"&gt;&lt;FONT size=5&gt;&lt;FONT color=#365f91&gt;&lt;FONT face=Cambria&gt;Framework Performance Enhancements&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/H1&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;U&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;NET FX (CLR):&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/U&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpFirst style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;New .NET Framework Client Profile - a smaller .NET Framework redist optimized for .NET client applications.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;The new redist weighs in at around 28 MB, enabling a smaller, faster, more reliable installation experience for .NET client applications on machines that do not already have the .NET Framework installed. The framework was refactored so that it now includes system core libraries and components (including LINQ), language support, XML, Windows Forms, WPF, Deployment, Web Services remoting and serialization, data access, and a few others.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;See the &lt;/FONT&gt;&lt;FONT face=Calibri size=3&gt;BCL Team &lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/bclteam/archive/2008/05/21/net-framework-client-profile-justin-van-patten.aspx" mce_href="http://blogs.msdn.com/bclteam/archive/2008/05/21/net-framework-client-profile-justin-van-patten.aspx"&gt;&lt;SPAN style="mso-comment-continuation: 2"&gt;&lt;FONT face=Calibri size=3&gt;blog&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN class=MsoCommentReference&gt;&lt;SPAN style="FONT-SIZE: 8pt; LINE-HEIGHT: 115%"&gt;&lt;SPAN style="mso-special-character: comment"&gt;&lt;FONT face=Calibri&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt; for the full list and more details.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Client applications should also see an improvement in cold startup scenarios especially for rich graphics WPF-based apps.We also made improvements to the working set of Ngen’d images, which also helps cold startup scenarios .&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpLast style="MARGIN: 0in 0in 10pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Support for Address Space Layout Randomization (ASLR) on Vista and WS 2008.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;ASLR uses fast kernel mode virtual base address relocation to improve both memory layout and security.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;U&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;WPF:&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/U&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpFirst style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Cold startup up to 40% faster, depending on the scenario and application size, without the need to modify any of your code.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Additional support for text and graphics to deliver better performance. For example, effects like DropShadow and Blur were initially implemented using software rendering; with SP1 these are now implemented using hardware acceleration.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Faster text rendering, mostly when used in specific scenarios such as VisualBrushes, DrawingBrushes, and Viewport2DVisual3D.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Scrolling improvements with Container Recycling.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Improved working set using TreeView virtualization &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpLast style="MARGIN: 0in 0in 10pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;A much improved WriteableBitmap that enables real-time bitmap updates from a software surface.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT face=Calibri size=3&gt;Jossef Goldberg’s &lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/jgoldb/default.aspx" mce_href="http://blogs.msdn.com/jgoldb/default.aspx"&gt;&lt;FONT face=Calibri size=3&gt;blog&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face=Calibri size=3&gt; is a great source of information on WPF performance tips and tricks.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;His detailed list of SP1 performance improvements is posted &lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/jgoldb/archive/2008/05/15/what-s-new-for-performance-in-wpf-in-net-3-5-sp1.aspx" mce_href="http://blogs.msdn.com/jgoldb/archive/2008/05/15/what-s-new-for-performance-in-wpf-in-net-3-5-sp1.aspx"&gt;&lt;FONT face=Calibri size=3&gt;here&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;U&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;WCF:&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/U&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P class=MsoListParagraph style="MARGIN: 0in 0in 10pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri size=3&gt;Support for asynchronous HTTP module/handlers on IIS 7.0.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;Supports better thread management and improved throughput for systems with heavy backend processing requirements. (See Wenlong’s &lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/wenlong/archive/2008/08/13/orcas-sp1-improvement-asynchronous-wcf-http-module-handler-for-iis7-for-better-server-scalability.aspx" mce_href="http://blogs.msdn.com/wenlong/archive/2008/08/13/orcas-sp1-improvement-asynchronous-wcf-http-module-handler-for-iis7-for-better-server-scalability.aspx"&gt;&lt;FONT face=Calibri size=3&gt;blog&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt; for the technical details.)&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;U&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Windows Forms:&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/U&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P class=MsoListParagraph style="MARGIN: 0in 0in 10pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;General performance improvements, mostly due to underlying improvements in the CLR.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;U&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Data handling:&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/U&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpFirst style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Improved throughput in ADO.NET scenarios (2x+ requests/second for some scenarios).&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpLast style="MARGIN: 0in 0in 10pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Performance improvements in XLINQ over XML containing many small elements.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;H1 style="MARGIN: 24pt 0in 0pt"&gt;&lt;FONT size=5&gt;&lt;FONT color=#365f91&gt;&lt;FONT face=Cambria&gt;Visual Studio Performance Enhancements&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/H1&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;U&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Visual Web Developer:&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/U&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpFirst style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Improved typing performance in the designer in complex pages (especially with MutiView control) 100x&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri size=3&gt;Fixed the issues with &lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/webdevtools/archive/2008/06/18/faster-switch-to-design-view-in-vs-2008-sp1-rtm.aspx" mce_href="http://blogs.msdn.com/webdevtools/archive/2008/06/18/faster-switch-to-design-view-in-vs-2008-sp1-rtm.aspx"&gt;&lt;FONT face=Calibri size=3&gt;Switching to Design View&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Opening Web Sites is up to 10x faster!&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Building Web Sites is up to 3x faster!&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Opening Web Forms is up to 2x faster!&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;General performance improvements in startup and shutdown.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpLast style="MARGIN: 0in 0in 10pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri size=3&gt;Plus lots of new features and fixes (see the &lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/webdevtools/archive/2008/08/11/web-development-updates-in-visual-studio-2008-sp1.aspx" mce_href="http://blogs.msdn.com/webdevtools/archive/2008/08/11/web-development-updates-in-visual-studio-2008-sp1.aspx"&gt;&lt;FONT face=Calibri size=3&gt;team blog&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;).&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;U&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Visual Basic .NET:&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/U&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpFirst style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Performance improvements in Intellisense and listing errors.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpLast style="MARGIN: 0in 0in 10pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Improvements in compiler and build throughput (most notably for projects with large amounts of XML comments in a single file)&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;U&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Visual C#:&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/U&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P class=MsoListParagraph style="MARGIN: 0in 0in 10pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Up to 2x improvements in bringing up Intellisense with a large number of types.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;U&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;XAML Editing:&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/U&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P class=MsoListParagraph style="MARGIN: 0in 0in 10pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Improved designer startup and form load time.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;U&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Debugging:&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/U&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpFirst style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Improvements in symbol and source downloading and the ability to cancel out of symbol download from a slow symbol server.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpLast style="MARGIN: 0in 0in 10pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri size=3&gt;Fix to a performance problem in the debugger when you are stepping through source code that is downloaded from Microsoft Reference Source Server that was caused by downloading the source files again for each breakpoint. Previously released as &lt;/FONT&gt;&lt;A href="http://support.microsoft.com/kb/944899" mce_href="http://support.microsoft.com/kb/944899"&gt;&lt;FONT face=Calibri color=#0000ff size=3&gt;KB944899&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;. &lt;SPAN style="COLOR: red"&gt;(Please &lt;A href="http://www.microsoft.com/downloads/details.aspx?FamilyId=A494B0E0-EB07-4FF1-A21C-A4663E456D9D&amp;amp;displaylang=en" mce_href="http://www.microsoft.com/downloads/details.aspx?FamilyId=A494B0E0-EB07-4FF1-A21C-A4663E456D9D&amp;amp;displaylang=en"&gt;uninstall this KB&lt;/A&gt; before installing the SP.)&lt;/SPAN&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;U&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;XML Editing:&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/U&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpFirst style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Loading XML is up to 3x faster!&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpLast style="MARGIN: 0in 0in 10pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Improved editing performance.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;H1 style="MARGIN: 24pt 0in 0pt"&gt;&lt;FONT size=5&gt;&lt;FONT color=#365f91&gt;&lt;FONT face=Cambria&gt;Team Foundation Server:&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/H1&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;In this Service Pack, a large part of the focus was to improve the performance and scalability of Team Foundation Server. Key changes include faster synchronization with Active Directory, improved check-in concurrency, a faster way to create source tree branches, online index rebuilding for less maintenance downtime and better support for checking very large sets of code. &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;TFS improved the number of projects a server can support. You should experience better scalability of the server, as well as the client experience when connecting to a server with a large number of projects on it.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpFirst style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Opening Source Controlled Solutions is up to 2.5x faster!&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Deleting files is up to 2x faster!&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Improved Work Item performance (loading, saving, querying).&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Improved UI navigation performance.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Improved performance working with TFS work items in Excel and Project&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Improved performance and reliability of the Visual SourceSafe migration tool.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpLast style="MARGIN: 0in 0in 10pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo1; tab-stops: list .5in"&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-bidi-font-size: 11.0pt"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri size=3&gt;See Brian Harry’s BLOG for more about the &lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/bharry/archive/2008/08/11/vs-vsts-tfs-net-3-5-sp1-is-shipping.aspx" mce_href="http://blogs.msdn.com/bharry/archive/2008/08/11/vs-vsts-tfs-net-3-5-sp1-is-shipping.aspx"&gt;&lt;FONT face=Calibri size=3&gt;Service Pack Release&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face=Calibri size=3&gt; and &lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/bharry/archive/2008/04/28/team-foundation-server-2008-sp1.aspx" mce_href="http://blogs.msdn.com/bharry/archive/2008/04/28/team-foundation-server-2008-sp1.aspx"&gt;&lt;FONT face=Calibri size=3&gt;Team Foundation Server improvements&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt; and scalability.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;H1 style="MARGIN: 24pt 0in 0pt"&gt;&lt;FONT size=5&gt;&lt;FONT color=#365f91&gt;&lt;FONT face=Cambria&gt;Other:&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/H1&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;And, of course, there are lots of new features including the new Data Entity Framework, ADO.NET data services, support for SQL Server 2008’s new features, updated components for Visual Basic and Visual C++ (including a MFC-based Office 2007 Ribbon), and new designer capabilities that improve performance indirectly by improving developer productivity.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;o:p&gt;&lt;FONT face=Calibri size=3&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT face=Calibri size=3&gt;Some of these performance fixes were previously released as hot fixes (see our &lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/ddperf/archive/2008/05/12/vs2008-sp1-and-net-fx-beta-performance-improvements.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2008/05/12/vs2008-sp1-and-net-fx-beta-performance-improvements.aspx"&gt;&lt;FONT face=Calibri size=3&gt;blog on the beta&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face=Calibri size=3&gt;).&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;If you installed some of those hot fixes you may need to &lt;/FONT&gt;&lt;A href="http://www.microsoft.com/downloads/details.aspx?FamilyId=A494B0E0-EB07-4FF1-A21C-A4663E456D9D&amp;amp;displaylang=en" mce_href="http://www.microsoft.com/downloads/details.aspx?FamilyId=A494B0E0-EB07-4FF1-A21C-A4663E456D9D&amp;amp;displaylang=en"&gt;&lt;FONT face=Calibri size=3&gt;remove them&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt; before installing the Service Patch.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;See the release notes on the download page for more information.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT face=Calibri size=3&gt;Should you encounter any performance problems we’ve missed, please continue to let us know here on the blog or by e-mail to &lt;/FONT&gt;&lt;A href="mailto:devperf@Microsoft.com" mce_href="mailto:devperf@Microsoft.com"&gt;&lt;FONT face=Calibri color=#0000ff size=3&gt;devperf@Microsoft.com&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=8860575" width="1" height="1"&gt;</description></item><item><title>Mainstream NUMA and the TCP/IP stack, Part III: A look back at older strategies to scale high-speed networking</title><link>http://blogs.msdn.com/ddperf/archive/2008/08/06/mainstream-numa-and-the-tcp-ip-stack-part-iii-a-look-back-at-strategies-to-scale-high-speed-networking.aspx</link><pubDate>Wed, 06 Aug 2008 02:04:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8835243</guid><dc:creator>MarkBFriedman</dc:creator><slash:comments>2</slash:comments><comments>http://blogs.msdn.com/ddperf/comments/8835243.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ddperf/commentrss.aspx?PostID=8835243</wfw:commentRss><description>&lt;FONT face=Calibri&gt;
&lt;P&gt;This is a continuation of Part II of this article posted &lt;A class="" title=Link-back-to-Part2 href="http://blogs.msdn.com/ddperf/archive/2008/07/27/mainstream-numa-and-the-tcp-ip-stack-part-i-programming-ccnuma-machines.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2008/07/27/mainstream-numa-and-the-tcp-ip-stack-part-i-programming-ccnuma-machines.aspx"&gt;here&lt;/A&gt;.&lt;/P&gt;&lt;/FONT&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;By necessity, both the hardware and the software devoted to processing network traffic need to evolve in the many-core era to become multiprocessor-oriented. On servers that have NUMA architectures, that multiprocessing support needs to acquire a NUMA flavoring. The technology that allows network interrupts to be processed concurrently across multiple processors includes support for &lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri&gt;multiple Descriptor Queues in the networking hardware, &lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri&gt;Extended Message Signaled Interrupts (MSI-X) to allow hardware interrupts to be serviced concurrently on more than one processor, and&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri&gt;the software support in Windows known as Receive-Side Scaling (or RSS). &lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;With ccNUMA architecture machines becoming more mainstream, it is clear that multi-processor support should also include being NUMA-aware. &lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;For an idea of how fast the NICs are getting, a typical 1 Gb Ethernet card supports 4 transmit and 4 receive interrupt queues and can spread interrupts across as many as 4 host processors for load balancing under RSS. A 10 Gb Ethernet card necessarily supports even high levels of parallelism. For example, the dual-ported Intel 82598 10 Gigabit Ethernet Controller provides 32 transmit queues and 64 receive queues per port, which can be mapped to a maximum of 16 processor cores. Note that this increase in parallel processing capacity is only a 4x improvement over recent 1 Gb Ethernet cards, which is probably inadequate to exploit the increased bandwidth fully.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;Let’s consider briefly some of the ideas to improve TCP/IP performance that have been implemented in the recent past. The strategies discussed here either increase the efficiency of host computer processing of TCP/IP packets or attempt to off load some of these software functions onto networking hardware. These strategies have proven effective, but they do not offer enough capacity relief to keep pace with the steady advance of networking speeds. The way out of the current mismatch between high speed networks and the host processing requirements they generate is a parallel processing approach where TCP/IP interrupts are distributed across multiple CPUs. &lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;Interestingly, some of the changes outlined here to streamline TCP/IP host processing fly in the face of a major precedent. The layered architecture of the networking stack is widely regarded as one of the storied accomplishments of software engineering. Several of the efforts to improve the performance of host processing of TCP/IP interrupts involve shattering the strict isolation of components that the layered networking model advocates.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;In theory, at least, the layered approach simplifies complex software that is designed to function smoothly in diverse environments. Layering also supports development of components that can proceed independently and in parallel. In principle, each layer defines and adheres to a standard set of services, or &lt;I style="mso-bidi-font-style: normal"&gt;interfaces&lt;/I&gt;, that it provides to the component in the layer immediately above it. An upper layer communicates with a level below it using only this predefined set of abstract interfaces. (The set of services provided and consumed by two adjacent layers, in effect, defines a &lt;I style="mso-bidi-font-style: normal"&gt;contract&lt;/I&gt;.) Furthermore, in the design of the networking protocol, components are isolated. It is an article of faith among software engineers that layered architectures, when properly defined and implemented, greatly contribute to the robustness and reliability of the software built using those design principles.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;
&lt;TABLE class=MsoNormalTable style="BORDER-RIGHT: medium none; BORDER-TOP: medium none; BACKGROUND: #f9f9f9; BORDER-LEFT: medium none; BORDER-BOTTOM: medium none; BORDER-COLLAPSE: collapse; mso-border-alt: solid #AAAAAA .5pt; mso-yfti-tbllook: 1184" cellSpacing=0 cellPadding=0 border=1 class="MsoNormalTable"&gt;
&lt;TBODY&gt;
&lt;TR style="HEIGHT: 18.7pt; mso-yfti-irow: 0; mso-yfti-firstrow: yes"&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #aaaaaa 1pt solid; PADDING-LEFT: 2.4pt; BACKGROUND: #f2f2f2; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #aaaaaa 1pt solid; WIDTH: 384.9pt; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; HEIGHT: 18.7pt; mso-border-alt: solid #AAAAAA .5pt" width=513 colSpan=4&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt; TEXT-ALIGN: center" align=center&gt;&lt;B&gt;&lt;SPAN style="COLOR: black; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'; mso-bidi-font-size: 10.0pt"&gt;&lt;FONT size=3&gt;TCP/IP Layers&lt;?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/B&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;
&lt;TR style="mso-yfti-irow: 1"&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; BACKGROUND: #f2f2f2; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #aaaaaa 1pt solid; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt"&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt; TEXT-ALIGN: center" align=center&gt;&lt;B&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;Data unit&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/B&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; BACKGROUND: #f2f2f2; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #f0f0f0; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt"&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt; TEXT-ALIGN: center" align=center&gt;&lt;B&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;Layer&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/B&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; BACKGROUND: #f2f2f2; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #f0f0f0; WIDTH: 148.2pt; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" width=198&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt; TEXT-ALIGN: center" align=center&gt;&lt;B&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;Function&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/B&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 0.75pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 0.75pt; BACKGROUND: #f2f2f2; PADDING-BOTTOM: 0.75pt; BORDER-LEFT: #f0f0f0; WIDTH: 103.5pt; PADDING-TOP: 0.75pt; BORDER-BOTTOM: #aaaaaa 1pt solid; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" vAlign=top width=138&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt; TEXT-ALIGN: center" align=center&gt;&lt;B&gt;&lt;SPAN style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;Example&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/B&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;
&lt;TR style="mso-yfti-irow: 2"&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #aaaaaa 1pt solid; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt" vAlign=top&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;Data&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #f0f0f0; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" vAlign=top&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;5. &lt;/SPAN&gt;&lt;A title="Application layer" href="http://en.wikipedia.org/wiki/Application_layer"&gt;&lt;SPAN style="FONT-SIZE: 9pt; COLOR: windowtext; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'; TEXT-DECORATION: none; text-underline: none"&gt;Application&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #f0f0f0; WIDTH: 148.2pt; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" vAlign=top width=198&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;Application-specific&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 0.75pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 0.75pt; PADDING-BOTTOM: 0.75pt; BORDER-LEFT: #f0f0f0; WIDTH: 103.5pt; PADDING-TOP: 0.75pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" vAlign=top width=138&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN lang=FR style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'; mso-ansi-language: FR"&gt;HTTP, SMTP, RPC, SOAP, etc.&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;
&lt;TR style="mso-yfti-irow: 3"&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #aaaaaa 1pt solid; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt" vAlign=top&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;Segment&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #f0f0f0; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" vAlign=top&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;4. &lt;/SPAN&gt;&lt;A title="Transport layer" href="http://en.wikipedia.org/wiki/Transport_layer"&gt;&lt;SPAN style="FONT-SIZE: 9pt; COLOR: windowtext; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'; TEXT-DECORATION: none; text-underline: none"&gt;Transport&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #f0f0f0; WIDTH: 148.2pt; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" vAlign=top width=198&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;End-to-end connections (sessions) and reliable delivery&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 0.75pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 0.75pt; PADDING-BOTTOM: 0.75pt; BORDER-LEFT: #f0f0f0; WIDTH: 103.5pt; PADDING-TOP: 0.75pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" vAlign=top width=138&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;TCP/ UDP&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;
&lt;TR style="mso-yfti-irow: 4"&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #aaaaaa 1pt solid; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt" vAlign=top&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;Packet/Datagram&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #f0f0f0; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" vAlign=top&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;3. &lt;/SPAN&gt;&lt;A title="Network layer" href="http://en.wikipedia.org/wiki/Network_layer"&gt;&lt;SPAN style="FONT-SIZE: 9pt; COLOR: windowtext; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'; TEXT-DECORATION: none; text-underline: none"&gt;Network&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #f0f0f0; WIDTH: 148.2pt; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" vAlign=top width=198&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;A title="Logical address" href="http://en.wikipedia.org/wiki/Logical_address"&gt;&lt;SPAN style="FONT-SIZE: 9pt; COLOR: windowtext; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'; TEXT-DECORATION: none; text-underline: none"&gt;Logical addressing&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt; and routing; segmentation &amp;amp; re-assembly&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 0.75pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 0.75pt; PADDING-BOTTOM: 0.75pt; BORDER-LEFT: #f0f0f0; WIDTH: 103.5pt; PADDING-TOP: 0.75pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" vAlign=top width=138&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;IP&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;
&lt;TR style="mso-yfti-irow: 5"&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #aaaaaa 1pt solid; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt" vAlign=top&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;Frame&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #f0f0f0; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" vAlign=top&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;2. &lt;/SPAN&gt;&lt;A title="Data link layer" href="http://en.wikipedia.org/wiki/Data_link_layer"&gt;&lt;SPAN style="FONT-SIZE: 9pt; COLOR: windowtext; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'; TEXT-DECORATION: none; text-underline: none"&gt;Data link&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #f0f0f0; WIDTH: 148.2pt; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" vAlign=top width=198&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;Physical addressing (MAC)&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 0.75pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 0.75pt; PADDING-BOTTOM: 0.75pt; BORDER-LEFT: #f0f0f0; WIDTH: 103.5pt; PADDING-TOP: 0.75pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" vAlign=top width=138&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;Ethernet, ATM&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;
&lt;TR style="mso-yfti-irow: 6; mso-yfti-lastrow: yes"&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #aaaaaa 1pt solid; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt" vAlign=top&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;Bit&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #f0f0f0; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" vAlign=top&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;1. &lt;/SPAN&gt;&lt;A title="Physical layer" href="http://en.wikipedia.org/wiki/Physical_layer"&gt;&lt;SPAN style="FONT-SIZE: 9pt; COLOR: windowtext; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'; TEXT-DECORATION: none; text-underline: none"&gt;Physical&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 2.4pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 2.4pt; PADDING-BOTTOM: 2.4pt; BORDER-LEFT: #f0f0f0; WIDTH: 148.2pt; PADDING-TOP: 2.4pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" vAlign=top width=198&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;Media, signal and binary transmission&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;
&lt;TD class="" style="BORDER-RIGHT: #aaaaaa 1pt solid; PADDING-RIGHT: 0.75pt; BORDER-TOP: #f0f0f0; PADDING-LEFT: 0.75pt; PADDING-BOTTOM: 0.75pt; BORDER-LEFT: #f0f0f0; WIDTH: 103.5pt; PADDING-TOP: 0.75pt; BORDER-BOTTOM: #aaaaaa 1pt solid; BACKGROUND-COLOR: transparent; mso-border-alt: solid #AAAAAA .5pt; mso-border-top-alt: solid #AAAAAA .5pt; mso-border-left-alt: solid #AAAAAA .5pt" vAlign=top width=138&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;SPAN style="FONT-SIZE: 9pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'"&gt;Optical fiber, coax, twisted pair&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT face=Calibri&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="FONT-SIZE: 10pt; COLOR: black; LINE-HEIGHT: 115%; mso-bidi-font-size: 11.0pt"&gt;Table 4.&lt;/SPAN&gt;&lt;/B&gt;&lt;SPAN style="FONT-SIZE: 10pt; COLOR: black; LINE-HEIGHT: 115%; mso-bidi-font-size: 11.0pt"&gt; The TCP/IP layered networking model.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;FONT face=Calibri&gt;&lt;SPAN style="FONT-SIZE: 10pt; COLOR: black; LINE-HEIGHT: 115%; mso-bidi-font-size: 11.0pt"&gt;&lt;o:p&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;Table 4 is a standard representation of the layered networking model used in TCP/IP, which has gained almost universal acceptance in computer-computer communications. Take the ubiquitous IP layer, for instance. IP implements a Best Effort service model to deliver packets from one station to another using routing. By design, it is connectionless, session-less and unreliable. Delivery of packets to the correct destination is not guaranteed, but IP does take a “best effort” approach to accomplish this. For the applications that require these services, the higher-level TCP Host protocol guarantees that packets are delivered reliably and in order to the designated application layer above it. It does this using a session-oriented protocol that preserves the state of the messaging-passing session between packets. TCP has also evolved complicated, performance-oriented flow and congestion control mechanisms that are beyond the scope of the current discussion.&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;The layering approach to networking introduces one additional and crucial design constraint. When one station is transferring a message to another, only components at the same level in the protocol stack can exchange data and communicate with each other. For example, only the TCP component in the receiver is supposed to be able to understand and process information placed into the TCP packet header by the sender. However, both the TCP Offload Engine and Receive-Side Scaling utilize knowledge of what is going on the upper layer TCP protocol down in the Data Link (or MAC) layer in the receiver, a serious violation of the principle that the layers in the protocol stack remain totally isolated from each other. Apparently, this is a case where the serious performance issues trump the pure design principles, and the evolution of TCP/IP has always been sensitive to practical issues of scaling. It is not that you absolutely cannot violate the contract that governs the ways layers communicate, but it is something that should be done very thoughtfully so that your once clean interfaces don’t start to look like swiss cheese. &lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;A crucial factor that works to encourage breaking with precedent is that the protocols from the TCP layer down to the hardware all adhere very strictly to the standards in order to promote interoperability. This strict compliance has the effect of hardening the services and interfaces between these layers in cement. This rigidity actually reduces the risk of side effects when a lower level component presumptuously usurps a service that architecturally is defined as the responsibility of some higher level.&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;Another factor is also at work. Within a layer, components that conform to the same contract layer can, in theory, be freely substituted for each other. This principle of the layered approach is supposed to promote development of a profusion of components that implement different sets of services, but still adhere to the strict requirements of the standard. In fact, the need for interoperability severely limits the proliferation of components that can be freely substituted for each other. Ethernet is almost always the hardware used at the bottom of the stack due to its superior cost/performance. Ethernet is always followed by IP, which is then usually followed by TCP. UDP can be freely substituted for TCP at the host processing layer, but only when the TCP services that ensure reliable delivery of packets and flow control can be dispensed with. In practice, there is very little variety among the components you will see operating in every networking protocol stack. TCP/IP over Ethernet, being ubiquitous, achieves the highest possible degree of interconnectivity and interoperability.&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;With the TCP/IP stack so pervasive and so stable and dominant, it then becomes possible to think the unthinkable. It becomes difficult to resist the temptation to violate the principle of isolation if you can demonstrate a big enough performance win. Having the Ethernet layer peek into the TCP packet headers and optimize their processing is acceptable when the violation of this sacrosanct principle of layering yields sufficient performance or scalability improvements. &lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt" mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;H3 style="MARGIN: 10pt 0in 0pt"&gt;&lt;FONT face=Cambria color=#4f81bd size=3&gt;Recent performance improvements to TCP/IP host processing. &lt;/FONT&gt;&lt;/H3&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;Before we drill into the current set of architectural changes to the networking stack, let’s explore briefly some of the more successful strategies for reducing the host computer processing requirements associated with TCP/IP interrupt processing that have been explored in the past. Stateless processing associated with the IP layer were some the earliest functions identified that could be performed on the NIC and eliminate some amount of host processing. &lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;These offloaded functions include Checksum and segmentation for large Sends, both of which were supported in the Windows 2000 timeframe. Because the IP protocol is stateless and connectionless, there are virtually no side effects to performing these functions on the NIC, even if it does violate the principle of strict isolation between layers in the protocol stack.&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;Another set of performance improvements that have been implemented recently do potentially generate serious side effects that must be handled rather delicately. These include: &lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;interrupt moderation&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;jumbo frames&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol"&gt;&lt;SPAN style="mso-list: Ignore"&gt;·&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;TCP Offload engine (TOE)&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;We will drill into these three approaches next, discussing some of the potential side effects, performance trade-offs, and other issues they raise. It is probably also worthwhile to mention netDMA, which is the Windows support for Intel’s I/O Acceleration Technology (I/OAT), in this context. I/OAT makes targeted improvements in the processor memory architecture to improve the efficiency of NIC-to-memory transfers. &lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;Each of these approaches has worked to a degree, but none has produced enough of a breakthrough in performance to address the underlying condition, the growing mismatch between host processing requirements and network bandwidth. &lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;As noted earlier, the CPU load associated with the processing Ethernet packets with TCP/IP at a server is a long-standing and persistent performance problem that has escaped a satisfactory solution in the past. For many years, the thrust of conventional solutions to the problem was straightforward – namely, any means possible for reducing the number of interrupts that the host computer needs to process. Two of the more effective approaches to reducing the number of interrupts are to use some form of &lt;I style="mso-bidi-font-style: normal"&gt;interrupt moderation&lt;/I&gt; or so-called &lt;I style="mso-bidi-font-style: normal"&gt;jumbo frames&lt;/I&gt;, basically larger packets than the Ethernet standard supports. Both approaches are effective to a degree, but also have serious built-in limitations and drawbacks.&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;I style="mso-bidi-font-style: normal"&gt;Interrupt moderation&lt;/I&gt; on the NIC is widely used today to reduce the host interrupt processing rate. It is successful, but only to the extent of addressing the processing load associated with each interrupt, which, as indicated in the protocol overhead measurements discussed earlier, is relatively minor. A NIC that supports interrupt moderation can delay the host interrupt for up to a specified period of time with the hope that the NIC will receive additional networks packets to process during the delay. Then, instead of each packet causing an interrupt, the host processor can process multiple packets in a single interrupt. In the measurements reported in Part 1, interrupt moderation was used to cut the host processor interrupt rate in half. When you consider as we have earlier, the potential rate of networks interrupts that a 10Gb Ethernet card can drive, some form of interrupt moderation on the NIC becomes essential for the smooth operation of the host processor.&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;Interrupt moderation helps, but not enough to relieve the bottleneck at the host CPU. The host processing associated with the TCP/IP protocol appears to scale as a function of &lt;I style="mso-bidi-font-style: normal"&gt;both&lt;/I&gt; the number of interrupts and the amount of data being transferred between the NIC and the host computer. As the average size of data payloads increases, the processing bottleneck shifts to memory latency. See, for example, the bottleneck analysis presented in the Intel white paper “&lt;A href="http://www.intel.com/technology/ioacceleration/306484.pdf"&gt;&lt;FONT color=#0000ff&gt;Accelerating High-Speed Networking with Intel® I/O Acceleration Technology&lt;/FONT&gt;&lt;/A&gt;.” &lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;Interrupt moderation should be used cautiously in situations where the fastest possible network latency is required, such as two communicating infrastructure servers connected to the same high-speed networking backbone. It also has to be implemented carefully to ensure it does not interfere with the TCP congestion control functions that try to measure round trip time (RTT). &lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;I style="mso-bidi-font-style: normal"&gt;Jumbo frames&lt;/I&gt;. Sending data across the wire in so-called &lt;I style="mso-bidi-font-style: normal"&gt;jumbo frames&lt;/I&gt; also significantly reduces the number of host interrupts. And there is little question that the size of the Ethernet MTU is sub-optimal for many networking transmission workloads. Consider the relatively large data payloads that routinely need to be transferred between a back-end database machine and the clusters of front-end and middle tier machines in a typical clustered, multi-tier web service application today. Using jumbo frames of, say, 9K payloads on the high speed network backbone linking these servers leads to a 6:1 reduction in the number of host processor interrupts required to transfer sizable blocks of data. When servers are connected to a Storage Area Network (SAN) using iSCSI, even larger frames are desirable.&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;In fact, jumbo frames appears to be such a simple, effective solution within the confines of the data center that it naturally leads to consideration of what other aspects of the TCP/IP protocol that are sub-optimal in that environment could also be modified. For example, when there is frequent high speed communication between very reliable components, the TCP/IP requirement to acknowledge positively the receipt of every packet is overkill, and it very tempting to break with the standard and relax that requirement. The superior cost/performance of high speed Ethernet-based networking makes it very tempting to consider as an alternative interconnect technology to use with both SANs and High Performance Computing (HPC) clusters. In both these cases there are alternatives linkage technologies that outperform TCP/IP that are also considerably more expensive. For a further discussion of this issue in the context of SAN performance, see “&lt;A href="http://www.demandtech.com/Resources/Papers/Intro%20to%20SAN%20capacity%20planning.pdf"&gt;&lt;FONT color=#0000ff&gt;An Introduction to SAN Capacity Planning&lt;/FONT&gt;&lt;/A&gt;.” And for the HPC flavor of this same discussion, see Jeffrey Mogul’s “&lt;A href="http://portal.acm.org/citation.cfm?id=1251059&amp;amp;dl=ACM&amp;amp;coll=portal&amp;amp;CFID=71988909&amp;amp;CFTOKEN=98964748"&gt;&lt;FONT color=#0000ff&gt;TCP offload is a dumb idea whose time has come&lt;/FONT&gt;&lt;/A&gt;.”&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;Unfortunately, using non-standard jumbo frames introduces a significant compatibility problem that severely limits the effectiveness of the solution. The great majority of network clients will reject frames larger than the standard Ethernet MTU of 1500 bytes. In effect, you can send jumbo frames between specific host computers that are equipped to handle them on a dedicated backbone segment readily enough, but you cannot reliably send them to just any machine connected using the IP internetworking layer. So implementing jumbo frames requires more complicated routing schemes. TCP/IP RFC 2923 section 2.1, which is supported in Windows XP SP3, Vista, and Windows Server 2008, allows two TCP peers to negotiate the largest size MTU that can be transmitted between them. But the connectionless and stateless IP routing mechanism means that no single packet transmitted between station A and B need follow the same route twice. Given that the precise route to the destination station is dynamically constructed for each packet, any intermediate router that did not support jumbo frames would reject any non-standard packets it received and prevent successful transmission to the receiver.&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;I style="mso-bidi-font-style: normal"&gt;TCP Offload Engine&lt;/I&gt;. A TCP Offload Engine (or TOE) is another solution that has been implemented to reduce the host processing required for TCP/IP interrupts. As the name suggests, in this approach, certain TCP/IP protocol functions are performed directly on the NIC, either reducing the amount of processing that must be performed in the host machine, or eliminating host interrupts associated with certain TCP/IP housekeeping operations entirely. Areas where significant performance gains are experienced with TOE include the elimination of expensive memory copy operations, offloading segmentation and reassembly (a function of the IP layer), and offloading some of the TCP housekeeping functions that ensure reliable connections (mainly, ACK processing and TCP retransmission timers). Moving these functions onto the NIC results in a reduction of the total number of interrupts that need to be processed by the host machine. Potential performance benefits associated with TOE are quantified here: “&lt;A href="http://www.dell.com/downloads/global/vectors/ps3q06-20060132-Broad_com.pdf"&gt;&lt;FONT color=#0000ff&gt;Boosting Data Transfer with TCP Offload Engine Technology&lt;/FONT&gt;&lt;/A&gt;.” You can see that the potential CPU savings are considerable. &lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;TCP Offload Engine, however, is a grievous violation of the layered architecture of the networking protocol stack. The TCP Chimney Offload feature that provides TOE support in Windows, for example, required an extensive re-architecture of the TCP/IP stack. TOE introduces many breaking changes. See the KnowledgeBase article entitled ”&lt;A href="http://support.microsoft.com/kb/951037"&gt;&lt;FONT color=#0000ff&gt;Information about the TCP Chimney Offload feature in Windows Server 2008&lt;/FONT&gt;&lt;/A&gt;” detailing the many limitations, reflecting what networking functions can &amp;amp; can’t safely be offloaded in which computing environments. For instance, any networked machine that enforces an IPsec-based security policy where it is necessary to inspect each individual packet cannot use TOE. Neither is TOE currently compatible with either server virtualization technology or common forms of clustering based on virtual IP addresses. &lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;The modest benefits in many environments and the complexities introduced due to explicit violations of the layered model of the network protocol argue against a general TOE solution. Another strong criticism of the TOE approach is that it merely moves the bottleneck from the host processor to the NIC. As RSS penetrates the market for high-speed networking, I believe that interest in the TOE approach will wane. If you do have a processing bottleneck on the host machine as a result of high-speed networking, with an RSS solution, at least, the bottleneck is visible, and there are inexpensive mechanisms to help deal with it. A processing bottleneck on the NIC is opaque and resists any capacity solution other than to swap in a more expensive card, assuming one exists, and hope that the new one is significantly faster and more powerful than the old one. &lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;I style="mso-bidi-font-style: normal"&gt;Intel I/OAT&lt;/I&gt;. Intel’s I/OAT introduces memory architecture improvements that give the NIC access to a dedicated DMA (direct memory access) engine for copying data between host memory and the NIC. These architectural changes are known as the Intel QuickData Technology DMA subsystem. With both interrupt moderation and the TCP Offload of IP segmentation and re-assembly, the processor tends to receive fewer interrupts to process, but each interrupt results in larger amounts of data that needs to be processed by the host. The networking protocol stack services the initial interrupt from the NIC and examines the Receive data block while it is running in kernel mode. Most data blocks associated with networking I/O subsequently need to be copied into the networking application’s private address space. A performance analysis showed that, especially with larger blocks of data, this second memory-to-memory copy operation was responsible for a very large portion of the host processor load.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;An Intel white paper &lt;A href="http://www.intel.com/technology/ioacceleration/317106.pdf"&gt;&lt;FONT color=#0000ff&gt;here&lt;/FONT&gt;&lt;/A&gt; describes this analysis in some detail. &lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;Ultimately, the result of this performance analysis was the set of I/OAT architectural improvements that permit this second memory-to-memory operation to be performed by a DMA provider engine (located on the Northbridge chip set currently) that requires no additional host processor bandwidth. The memory copy operation occupies the memory controller, but does not consume Front Side Bus bandwidth, which also frees up the host processor to perform other CPU tasks. Interestingly, Windows support for this technology, described in the &lt;A href="http://msdn.microsoft.com/en-us/library/cc264906.aspx"&gt;&lt;FONT color=#0000ff&gt;Driver Development Kit (DDK) documentation&lt;/FONT&gt;&lt;/A&gt;, is actually very general, but to date the only netDMA client available is the tcpip.sys kernel mode driver that processes networking interrupts. It ought to be possible for disk I/O controllers to also exploit I/OAT architectural improvements sometime in the future. However, data blocks associated with disk I/O, which are cached by default in the system address space in Windows, are not necessarily subject to multiple copy operations, depending on the cache interface used.&lt;/P&gt;
&lt;P style="MARGIN: 10pt 0in 0pt" mce_keep="true"&gt;&lt;A class="" title="Mainstream NUMA and TCP/IP: Part IV" href="http://blogs.msdn.com/ddperf/archive/2008/09/09/mainstream-numa-and-the-tcp-ip-stack-part-iv-paralleling-tcp-ip.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2008/09/09/mainstream-numa-and-the-tcp-ip-stack-part-iv-paralleling-tcp-ip.aspx"&gt;Continue to Part IV&lt;/A&gt;.&lt;/P&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=8835243" width="1" height="1"&gt;</description></item><item><title>Mainstream NUMA &amp; the TCP/IP stack: Part 2: Programming ccNUMA machines</title><link>http://blogs.msdn.com/ddperf/archive/2008/07/27/mainstream-numa-and-the-tcp-ip-stack-part-i-programming-ccnuma-machines.aspx</link><pubDate>Sun, 27 Jul 2008 21:02:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8780016</guid><dc:creator>MarkBFriedman</dc:creator><slash:comments>1</slash:comments><comments>http://blogs.msdn.com/ddperf/comments/8780016.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ddperf/commentrss.aspx?PostID=8780016</wfw:commentRss><description>&lt;P&gt;This is a continuation of Part I of this article posted &lt;A class="" title="Link-back to Part 1" href="http://blogs.msdn.com/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx"&gt;here&lt;/A&gt;.&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;In Part 1 of this article, we looked at the capacity issues that are driving architectural changes in the TCP/IP networking stack. While network interfaces are increasing in throughput capacity, processor speeds in the multi-core era are not keeping pace. Meanwhile, the TCP/IP protocol has grown in complexity so that host processing requirements are increasing, too. The only way for networked computers to scale in the multi-core era is to begin distributing networking I/O operations across multiple processors. Since bigger server machines rely on NUMA architectures for scalability, high speed networking is also evolving to exploit machines with NUMA architectures in an optimal fashion.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;Machines with NUMA (non-uniform memory access speeds) architectures are usually large scale multiprocessors that are assembled using building blocks, or &lt;I style="mso-bidi-font-style: normal"&gt;nodes&lt;/I&gt;, that each contain some number of CPUs, some amount of RAM, and various other peripheral connections. Nodes are often configured on separate boards, for example, or specific segments of a board. Multiple nodes are then interconnected with high speed links of some sort that permit all the memory that is configured to be available to executing programs. There are many schools of thought on what the best interconnection technology is. Some manufacturers favor tree structures, some favor directory schemes, some favor network-like routing. A key feature of the architecture is that the latency of a memory fetch depends on the physical location of the RAM being accessed. Accessing RAM attached to the local node is faster than a memory fetch to a remote location that is physically located on another node. &lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;Within &lt;A class="" title="Nehalem Hyperlink1" href="http://www.realworldtech.com/includes/templates/articles.cfm?ArticleID=RWT040208182719&amp;amp;mode=print" mce_href="http://www.realworldtech.com/includes/templates/articles.cfm?ArticleID=RWT040208182719&amp;amp;mode=print"&gt;one of the new Intel Nehalem many-core microprocessor&lt;/A&gt;, for example, all the processor cores and their logical processors can access local memory at a uniform speed. Figure 3 is a schematic diagram depicting a 4-way Nehalem multiprocessor chip that is connected to a bank of RAM. The configuration of processors and RAM shown in Figure 3 is a building block that is used in creating a larger scale machine by connecting two or more of such nodes together. &lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;IMG title="Quad-core node" style="WIDTH: 362px; HEIGHT: 520px" height=520 alt="Quad-core node" src="http://5l3vgw.bay.livefilestore.com/y1pysaX_fyaHyL_hZhhGIyXP5RhSKILXbj8AXnupeLec_hHtxoKXb6Z48TZYahS02yXpSrpH6b9-mY/Quad-core%20Node%20Drawing.jpg" width=362 mce_src="http://5l3vgw.bay.livefilestore.com/y1pysaX_fyaHyL_hZhhGIyXP5RhSKILXbj8AXnupeLec_hHtxoKXb6Z48TZYahS02yXpSrpH6b9-mY/Quad-core%20Node%20Drawing.jpg"&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="FONT-SIZE: 9pt; mso-bidi-font-size: 11.0pt"&gt;Figure 3.&lt;/SPAN&gt;&lt;/B&gt;&lt;SPAN style="FONT-SIZE: 9pt; mso-bidi-font-size: 11.0pt"&gt; &lt;EM&gt;A schematic diagram depicting a NUMA node showing locally-attached RAM and a multi-core socket.&lt;?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/EM&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;A two-node NUMA server is illustrated in Figure 4, which shows a direct connection between the memory controller on node A and the memory controller on node B. This is the relatively simple case. A thread executing on node A can access any RAM location on either node, but an access to a local memory address is considerably faster. The latency to access to a remote memory location is several times slower. (Definitive timings are not available as of this writing because early versions of the hardware are just starting to become available.)&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;A class="" title="Two-node NUMA server based on Nehalem" href="http://5l3vgw.bay.livefilestore.com/y1pNycwWsLj-tQabletYlwpg3Jnn7wvCJGYF-7IKnkz7PITD2CeK6cTdNqU3uDM8GRBK0iw64sEKZ8/Simple%20Two%20Node%20NUMA%20Server%20Drawing%20(vertical%20orientation).jpg" mce_href="http://5l3vgw.bay.livefilestore.com/y1pNycwWsLj-tQabletYlwpg3Jnn7wvCJGYF-7IKnkz7PITD2CeK6cTdNqU3uDM8GRBK0iw64sEKZ8/Simple%20Two%20Node%20NUMA%20Server%20Drawing%20(vertical%20orientation).jpg"&gt;&lt;/A&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="FONT-SIZE: 9pt; mso-bidi-font-size: 11.0pt"&gt;&lt;IMG title="Two-node NUMA server based on Nehalem" style="WIDTH: 407px; HEIGHT: 1072px" height=1072 alt="Two-node NUMA server based on Nehalem" src="http://5l3vgw.bay.livefilestore.com/y1pNycwWsLj-tQabletYlwpg3Jnn7wvCJGYF-7IKnkz7PITD2CeK6cTdNqU3uDM8GRBK0iw64sEKZ8/Simple%20Two%20Node%20NUMA%20Server%20Drawing%20(vertical%20orientation).jpg" width=407 mce_src="http://5l3vgw.bay.livefilestore.com/y1pNycwWsLj-tQabletYlwpg3Jnn7wvCJGYF-7IKnkz7PITD2CeK6cTdNqU3uDM8GRBK0iw64sEKZ8/Simple%20Two%20Node%20NUMA%20Server%20Drawing%20(vertical%20orientation).jpg"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/B&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="FONT-SIZE: 9pt; mso-bidi-font-size: 11.0pt"&gt;Figure 4.&lt;/SPAN&gt;&lt;/B&gt;&lt;SPAN style="FONT-SIZE: 9pt; mso-bidi-font-size: 11.0pt"&gt; &lt;EM&gt;A two-NUMA server showing a cross-node link that is used when a thread on one node needs to access a remote memory location.&lt;o:p&gt;&lt;/o:p&gt;&lt;/EM&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;As the number of nodes increases, it is no longer feasible for every node to be directly connected to every other node, nor can each bank of RAM that is installed be accessed in a single hop. The specific technology used to link nodes may introduce additional variation in the cost of accessing remote memory. From any one node, it could take longer to access memory on some nodes than others. For instance, some nodes may be accessed in a single hop across a direct link, while other accesses may require multiple hops. Some manufacturers favor routing through a shared directory service, for example. Your mileage may vary.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;Specifically, in the Intel architecture, manufacturers are supplying a cache coherent flavor of NUMA servers (ccNUMA). Cache coherence is implemented using a snooping protocol to ensure that threads executing on each NUMA node have access to the most current copy of the contents of the distributed memory. Details of the snooping protocol used in Intel ccNUMA machines are discussed &lt;/FONT&gt;&lt;A href="http://www.realworldtech.com/includes/templates/articles.cfm?ArticleID=RWT082807020032&amp;amp;mode=print"&gt;&lt;FONT face=Calibri color=#0000ff&gt;here&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face=Calibri&gt;.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;AMD has taken a somewhat different tack in building its multi-core processors. For communication on chip between processors, AMD uses a technology known as HyperTransport, which is a dedicated, per-processor 2-way high speed link. Multiple processors cores are then linked on the chip in a ring topology as depicted in Figure 5. The ring topology has the effect of scaling the bus bandwidth that is used as an interconnect linearly with the number of the processors. But the architecture leads to NUMA characteristics. A thread executing on CPU 0 can access a local memory location, a remote memory location that is local to CPU 1 at the cost of one hop across the HT link, or a remote memory location that is local to CPU 2 at the cost of two hops across HT links.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;IMG title="AMD multi-core socket" style="WIDTH: 435px; HEIGHT: 435px" height=435 alt="AMD multi-core socket" src="http://5l3vgw.bay.livefilestore.com/y1pt8apQ0QRaEO0kR9KRDE29WNelvL0WCkG3i6aQTMLuL52t-DmDG1bUcWKUlO_qNaHWOaCGRePA_w/AMD%20multicore%20socket.jpg" width=435 mce_src="http://5l3vgw.bay.livefilestore.com/y1pt8apQ0QRaEO0kR9KRDE29WNelvL0WCkG3i6aQTMLuL52t-DmDG1bUcWKUlO_qNaHWOaCGRePA_w/AMD%20multicore%20socket.jpg"&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;SPAN style="FONT-SIZE: 9pt; mso-bidi-font-size: 11.0pt"&gt;Figure 5.&lt;/SPAN&gt;&lt;/B&gt;&lt;SPAN style="FONT-SIZE: 9pt; mso-bidi-font-size: 11.0pt"&gt; &lt;EM&gt;The AMD approach to multi-core processors has NUMA characteristics. A program executing on CPU 0 that accesses RAM that is local to CPU 2 requires two hops across the HyperTransport links that connect the processors in a ring.&lt;o:p&gt;&lt;/o:p&gt;&lt;/EM&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;Historically, application development for NUMA machines meant understanding the performance costs associated with accessing remote memory on a specific hardware platform. Since manufacturers employ different proprietary interconnection schemes in their multi-tiered NUMA machines, application developer are challenged to find the right balance in exploiting a specific proprietary architecture that may then limit the ability to port the application to a different platform in the future. It may be possible to connect nodes in a NUMA machine in an asymmetric configuration, for example, where the performance cost function associated with accessing different memory locations is decidedly irregular.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;To scale well, a multi-threaded program running on a NUMA machine needs to be aware of the machine environment and understand which memory references are local to the node and which are remote. A thread that was running on one NUMA node that migrates to another node pays a heavy price every time it has to fetch results from remote memory locations. The difficulty programmers face when trying to develop a scalable, multi-threaded application for a NUMA architecture machine is understanding their memory usage pattern and how it maps to the NUMA topography. When NUMA considerations were confined to expensive, high-end supercomputers, the inherent complexities developers faced in programming them were considered relatively esoteric concerns. However, in the era of many-core processors, NUMA is poised to become a mainstream architecture. &lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;In theory, it is possible to craft an optimal solution when threads and the memory they access are &lt;I style="mso-bidi-font-style: normal"&gt;balanced&lt;/I&gt; across NUMA processing nodes. In order to achieve an optimal balancing of the machines resources without overloading any of them, programs need to understand the CPU and memory resources that individual tasks executing in parallel require and understand how to best map those resources to the topography of the machine. Then they require a suitable scheduling mechanism to achieve the desired result. Achieving an optimal balance, as a practical matter, is not easy, in the face of variability in the resources required by any of execution threads, a complication that may then require dynamic adjustments to the scheduling policy in effect. &lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;The Windows OS is already NUMA-aware to a degree and, thus, supports a NUMA programming model. For example, once dispatched, threads have node affinity and tend to stay dispatched on an available processor within a node. Windows OS memory management is also NUMA-aware, maintaining per node allocation pools. The OS not only resists migrating threads to another node, it also tries to ensure that most memory allocated are satisfied locally using per node memory management data structures. &lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;Windows also provides a number of NUMA-oriented APIs that applications can use to keep their threads from migrating off-node and also enable them to direct memory allocations to a specific physical processing node. For more information on the NUMA support in Windows, see the MSDN Help topic “&lt;/FONT&gt;&lt;A href="http://msdn.microsoft.com/en-us/library/aa363804.aspx"&gt;&lt;FONT face=Calibri color=#0000ff&gt;NUMA Support&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face=Calibri&gt;.” &lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;FONT face=Calibri&gt;To help application developers deal better with the complexities of NUMA architectures in the future, the Windows NUMA support needs to evolve. One potential approach would be for the OS to attempt to calculate a performance cost function at start-up that it would then expose to driver and application programs when they start up and run. Conceivably, the OS might also need to adjust this performance cost function to response to configuration changes that occur dynamically, such as any power management event that affects memory latency. These changes would then have to be communicated to NUMA-aware drivers and applications somehow so they could adapt to changing conditions.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=CMGpaperbody style="MARGIN: 0in 0in 3pt"&gt;&lt;A class="" title=Link-to-Part3 href="http://blogs.msdn.com/ddperf/archive/2008/08/06/mainstream-numa-and-the-tcp-ip-stack-part-iii-a-look-back-at-strategies-to-scale-high-speed-networking.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2008/08/06/mainstream-numa-and-the-tcp-ip-stack-part-iii-a-look-back-at-strategies-to-scale-high-speed-networking.aspx"&gt;Continue to Part III of this article.&lt;/A&gt; &lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=8780016" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ddperf/archive/tags/Performance+Engineering/default.aspx">Performance Engineering</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Scalability/default.aspx">Scalability</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Parallel+programming/default.aspx">Parallel programming</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Performance/default.aspx">Performance</category></item><item><title>Lessons from the test lab: investigating a pleasant surprise</title><link>http://blogs.msdn.com/ddperf/archive/2008/06/18/lessons-from-the-test-lab-investigating-a-pleasant-surprise.aspx</link><pubDate>Thu, 19 Jun 2008 00:33:20 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8618468</guid><dc:creator>jonathanh</dc:creator><slash:comments>8</slash:comments><comments>http://blogs.msdn.com/ddperf/comments/8618468.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ddperf/commentrss.aspx?PostID=8618468</wfw:commentRss><description>&lt;p&gt;This post describes our recent investigation into an interesting performance problem: benchmarks that we were surprised to find running significantly faster than we expected on new hardware. Along the way we discuss useful benchmarking tools, how to validate results, and why it pays to know exactly what hardware you're running on.&lt;/p&gt;  &lt;p&gt;This all started in our performance test lab. During the development of Visual Studio, each new build undergoes a suite of automated performance tests, running in a lab full of identical machines. These performance tests allow us to track Visual Studio's performance over time, and &lt;a href="http://blogs.msdn.com/ddperf/archive/2008/05/20/visual-studio-performance-testing-noise-is-enemy-1.aspx"&gt;detect performance regressions&lt;/a&gt; (when something gets unexpectedly worse). We recently added a batch of new machines in our lab, and that's when the fun started.&lt;/p&gt;  &lt;p&gt;&lt;b&gt;Pop Quiz: How Much Faster?&lt;/b&gt;&lt;/p&gt;  &lt;p&gt;Old machine: dual-core Intel Pentium D 830 processor, running at 3 GHz, with 1 GB of RAM.&lt;/p&gt;  &lt;p&gt;New machine: quad-core Intel Xeon 5355 processor, running at 2.66 GHz, with 4 GB of RAM. &lt;/p&gt;  &lt;p&gt;Given the differences in the two hardware configurations above, how much faster would you expect the new machine to be when running a Visual Studio performance test? Lower than, same as, twice, three times or four times the performance of the older machine? &lt;/p&gt;  &lt;p&gt;One line of reasoning might look at the relative clock frequencies of the processors on the two machines. This might lead you to expect the newer processor cores to perform slower than the older cores, since their clock frequency is 11% lower. By this reasoning you might conclude that single-threaded applications would perform poorly on the new machine. &lt;/p&gt;  &lt;p&gt;Another line of reasoning would factor in the number of cores in the two systems. Since the new machine has twice the number of cores, you might expect it to have about twice the performance on multi-threaded applications. (If you also accounted for the lower clock frequency, you'd end up with a figure of 1.78 times the performance of the old machine.) &lt;/p&gt;  &lt;p&gt;A third approach might estimate the impact of RAM size. We’ve quadrupled the amount of RAM, so maybe any benchmarks that used to page to disk can now execute entirely in memory and hence will be orders of magnitude faster. [We'll cheat here and tell you that our benchmarks are generally not memory constrained]. &lt;/p&gt;  &lt;p&gt;So far, all these options seem plausible. What's your guess? &lt;/p&gt;  &lt;p&gt;What we naively expected to find lay somewhere between the first two lines of reasoning - that the new machines would be 1-2 times faster than the old machines, depending on the particular benchmark.&lt;/p&gt;  &lt;p&gt;What we actually found is that many of our single-threaded CPU-bound benchmarks run about &lt;strong&gt;twice as fast&lt;/strong&gt; on the new machine, while scalable multi-threaded benchmarks run up to &lt;strong&gt;four times as fast&lt;/strong&gt;. This was a pleasant surprise, because it significantly reduces the overall time to run all the benchmarks. But it did leave us wondering why we were getting much greater speedups than our naive explanations would suggest. The rest of this post explores that question.&lt;/p&gt;  &lt;p&gt;&lt;b&gt;Using WinSAT and SPEC to Validate Benchmark Results&lt;/b&gt;&lt;/p&gt;  &lt;p&gt;To make sure this wasn't a fluke result, we used the &lt;a href="http://msdn.microsoft.com/en-us/library/ms737378(VS.85).aspx"&gt;Windows System Assessment Tool&lt;/a&gt; (winsat.exe). This is a built-in tool that can give quickly give a representative view of a machine's performance. It is multi-threaded, taking full advantage of all the cores on a machine. Here are the WinSAT CPU results: &lt;/p&gt;  &lt;table cellspacing="0" cellpadding="2" width="422" border="0"&gt;&lt;tbody&gt;     &lt;tr&gt;       &lt;td valign="top" width="189"&gt;&lt;strong&gt;Benchmark&lt;/strong&gt;&lt;/td&gt;        &lt;td valign="top" width="89"&gt;&lt;strong&gt;Old Machine&lt;/strong&gt;&lt;/td&gt;        &lt;td valign="top" width="95"&gt;&lt;strong&gt;New Machine&lt;/strong&gt;&lt;/td&gt;        &lt;td valign="top" width="47"&gt;&lt;strong&gt;Speedup&lt;/strong&gt;&lt;/td&gt;     &lt;/tr&gt;      &lt;tr&gt;       &lt;td valign="top" width="189"&gt;CPU – Compression (MB/s)&lt;/td&gt;        &lt;td valign="top" width="89"&gt;70.5&lt;/td&gt;        &lt;td valign="top" width="95"&gt;262.0&lt;/td&gt;        &lt;td valign="top" width="47"&gt;3.7&lt;/td&gt;     &lt;/tr&gt;      &lt;tr&gt;       &lt;td valign="top" width="189"&gt;CPU – Encryption (MB/s)&lt;/td&gt;        &lt;td valign="top" width="89"&gt;52.3&lt;/td&gt;        &lt;td valign="top" width="95"&gt;139.3&lt;/td&gt;        &lt;td valign="top" width="47"&gt;2.7&lt;/td&gt;     &lt;/tr&gt;   &lt;/tbody&gt;&lt;/table&gt;  &lt;p&gt;We also wanted to validate our results against other real-world benchmarks. For this we turned to the &lt;a href="http://www.spec.org/"&gt;SPEC website&lt;/a&gt;. SPEC produces a series of benchmark suites, plus a very formal process that ensures results are reproducible and can fairly be applied across different manufacturers. More importantly for our purposes, SPEC posts all reported benchmark results on their web site. You won’t always be able to find your exact machine listed, but after using results from a tool like CPU-Z you can generally find results from a machine with the same CPU configuration and clock speed. &lt;/p&gt;  &lt;p&gt;We used the &amp;quot;CINT2006&amp;quot; benchmarks – this is a widely-used benchmark suite concentrating on integer performance. We compared results for both CINT2006, which is a good test of single-threaded performance, and CINT2006 Rate, which tests the ability of a system to execute multiple copies of CINT2006, and is therefore a better test of multi-threaded performance. For two representative machines that are similar to our old and new hardware, here are the results:&lt;/p&gt;  &lt;p&gt;&lt;/p&gt;  &lt;table cellspacing="0" cellpadding="2" width="422" border="0"&gt;&lt;tbody&gt;     &lt;tr&gt;       &lt;td valign="top" width="189"&gt;&lt;strong&gt;Benchmark&lt;/strong&gt;&lt;/td&gt;        &lt;td valign="top" width="89"&gt;&lt;strong&gt;Old Machine&lt;/strong&gt;&lt;/td&gt;        &lt;td valign="top" width="95"&gt;&lt;strong&gt;New Machine&lt;/strong&gt;&lt;/td&gt;        &lt;td valign="top" width="47"&gt;&lt;strong&gt;Speedup&lt;/strong&gt;&lt;/td&gt;     &lt;/tr&gt;      &lt;tr&gt;       &lt;td valign="top" width="189"&gt;CINT2006&lt;/td&gt;        &lt;td valign="top" width="89"&gt;9.85&lt;/td&gt;        &lt;td valign="top" width="95"&gt;15.5&lt;/td&gt;        &lt;td valign="top" width="47"&gt;1.6&lt;/td&gt;     &lt;/tr&gt;      &lt;tr&gt;       &lt;td valign="top" width="189"&gt;CINT2006 Rate&lt;/td&gt;        &lt;td valign="top" width="89"&gt;18.0&lt;/td&gt;        &lt;td valign="top" width="95"&gt;44.4&lt;/td&gt;        &lt;td valign="top" width="47"&gt;2.5&lt;/td&gt;     &lt;/tr&gt;   &lt;/tbody&gt;&lt;/table&gt;  &lt;p&gt;The WinSAT and SPEC results confirm that the new machines are much faster than our naive expectations, even for benchmarks such as CINT2006 that cannot take advantage of the extra cores. So what were we missing? &lt;/p&gt;  &lt;p&gt;&lt;b&gt;Using CPU-Z to Examining Machine Configurations&lt;/b&gt;&lt;/p&gt;  &lt;p&gt;To answer this, we need a deeper understanding of the configurations of the two systems. &lt;/p&gt;  &lt;p&gt;Unfortunately, finding detailed configuration information isn't always straightforward. For example, we know that level two (L2) cache size impacts performance, but Windows doesn't report it, and it's not easy to reboot into the BIOS to take a look at cache size when the machine is located in a remote test lab. This is where machine reporting tools like &lt;a href="http://www.cpuid.com/cpuz.php"&gt;CPU-Z&lt;/a&gt; come in. You can run CPU-Z remotely on an unknown machine and get back a nicely formatted HTML report showing exactly what the hardware is. Here's a deeper look at our old and new systems:&lt;/p&gt;  &lt;table cellspacing="0" cellpadding="2" width="408" border="0"&gt;&lt;tbody&gt;     &lt;tr&gt;       &lt;td valign="top" width="155"&gt;&lt;strong&gt;Feature&lt;/strong&gt;&lt;/td&gt;        &lt;td valign="top" width="118"&gt;&lt;strong&gt;Old Machine&lt;/strong&gt;&lt;/td&gt;        &lt;td valign="top" width="141"&gt;&lt;strong&gt;New Machine&lt;/strong&gt;&lt;/td&gt;     &lt;/tr&gt;      &lt;tr&gt;       &lt;td valign="top" width="155"&gt;CPU name&lt;/td&gt;        &lt;td valign="top" width="118"&gt;Pentium D 830          &lt;br /&gt;(“Smithfield”)&lt;/td&gt;        &lt;td valign="top" width="141"&gt;Xeon X5355          &lt;br /&gt;(“Clovertown”)&lt;/td&gt;     &lt;/tr&gt;      &lt;tr&gt;       &lt;td valign="top" width="155"&gt;CPU speed&lt;/td&gt;        &lt;td valign="top" width="118"&gt;3.00 GHz&lt;/td&gt;        &lt;td valign="top" width="141"&gt;2.66 GHz&lt;/td&gt;     &lt;/tr&gt;      &lt;tr&gt;       &lt;td valign="top" width="155"&gt;Number of cores&lt;/td&gt;        &lt;td valign="top" width="118"&gt;2&lt;/td&gt;        &lt;td valign="top" width="141"&gt;4&lt;/td&gt;     &lt;/tr&gt;      &lt;tr&gt;       &lt;td valign="top" width="155"&gt;L1 cache (per core)&lt;/td&gt;        &lt;td valign="top" width="118"&gt;16 KB&lt;/td&gt;        &lt;td valign="top" width="141"&gt;32 KB&lt;/td&gt;     &lt;/tr&gt;      &lt;tr&gt;       &lt;td valign="top" width="155"&gt;L2 cache (total)&lt;/td&gt;        &lt;td valign="top" width="118"&gt;2 MB&lt;/td&gt;        &lt;td valign="top" width="141"&gt;8 MB&lt;/td&gt;     &lt;/tr&gt;      &lt;tr&gt;       &lt;td valign="top" width="155"&gt;System RAM&lt;/td&gt;        &lt;td valign="top" width="118"&gt;1 GB DDR2&lt;/td&gt;        &lt;td valign="top" width="141"&gt;4 GB DDR2&lt;/td&gt;     &lt;/tr&gt;   &lt;/tbody&gt;&lt;/table&gt;  &lt;p&gt;&lt;b&gt;Using BCDEdit to Disable Cores&lt;/b&gt;&lt;/p&gt;  &lt;p&gt;Now we can try to tease out the relative impacts of the many changes from the old configurations the new configurations. The first and easiest step is to disable two out of four cores on a new machine, to enable a fairer &amp;quot;apples to apples&amp;quot; comparison of cores between old and new machines.&lt;/p&gt;  &lt;p&gt;To do this we used the Windows BCDEdit tool, which replaces the old method of editing BOOT.INI by hand. Here we were particularly concerned with the order in which cores are disabled. This is important because the 8 MB of L2 cache in the Xeon “Clovertown” processors is divided: two of the four cores share 4 MB, and the other two cores share the other 4 MB. To keep our benchmark comparisons as fair as possible, we wanted to make sure that only one of the L2 caches was in use after disabling two cores. We used CPU-Z again after rebooting to confirm this.&lt;/p&gt;  &lt;p&gt;Now we were in a position to do a fairer “cores to cores” comparison between the old and new machines. Here's a summary from WinSAT: &lt;/p&gt;  &lt;table cellspacing="0" cellpadding="2" width="422" border="0"&gt;&lt;tbody&gt;     &lt;tr&gt;       &lt;td valign="top" width="189"&gt;&lt;strong&gt;Benchmark&lt;/strong&gt;&lt;/td&gt;        &lt;td valign="top" width="89"&gt;&lt;strong&gt;Old Machine&lt;/strong&gt;&lt;/td&gt;        &lt;td valign="top" width="95"&gt;&lt;strong&gt;New (2 cores)&lt;/strong&gt;&lt;/td&gt;        &lt;td valign="top" width="47"&gt;&lt;strong&gt;Speedup&lt;/strong&gt;&lt;/td&gt;     &lt;/tr&gt;      &lt;tr&gt;       &lt;td valign="top" width="189"&gt;CPU – Compression (MB/s)&lt;/td&gt;        &lt;td valign="top" width="89"&gt;70.5&lt;/td&gt;        &lt;td valign="top" width="95"&gt;131.9&lt;/td&gt;        &lt;td valign="top" width="47"&gt;1.9&lt;/td&gt;     &lt;/tr&gt;      &lt;tr&gt;       &lt;td valign="top" width="189"&gt;CPU – Encryption (MB/s)&lt;/td&gt;        &lt;td valign="top" width="89"&gt;52.3&lt;/td&gt;        &lt;td valign="top" width="95"&gt;69.7&lt;/td&gt;        &lt;td valign="top" width="47"&gt;1.3&lt;/td&gt;     &lt;/tr&gt;      &lt;tr&gt;       &lt;td valign="top" width="189"&gt;Memory Bandwidth (MB/s)&lt;/td&gt;        &lt;td valign="top" width="89"&gt;4,041&lt;/td&gt;        &lt;td valign="top" width="95"&gt;3,360&lt;/td&gt;        &lt;td valign="top" width="47"&gt;0.8&lt;/td&gt;     &lt;/tr&gt;   &lt;/tbody&gt;&lt;/table&gt;  &lt;p&gt;Now we can really see the advantage of the latest processors – on a core-for-core basis, they are 1.3-1.9x faster on the CPU-intensive WinSAT benchmarks, despite having lower clock frequencies.&lt;/p&gt;  &lt;p&gt;Good, now on to the next… wait a second. Look at that memory bandwidth result. Our new machines have &lt;i&gt;less&lt;/i&gt; memory bandwidth than the old machines? That doesn't look right: although memory performance hasn't been keeping pace with CPU speeds, it &lt;i&gt;has&lt;/i&gt; been improving over time. Compared to a three-year-old machine, we'd expect these new machines to have slightly better memory bandwidth, and definitely not worse. What gives?&lt;/p&gt;  &lt;p&gt;&lt;b&gt;Memory Channels&lt;/b&gt;&lt;/p&gt;  &lt;p&gt;A primary limiting factor to memory bandwidth is the number of memory channels that are in use. And this turns out to be the problem here: although the new machines have four memory channels and eight memory slots, only two of those slots are filled, because the vendor supplied us with two 2 GB memory modules per machine. This maximizes future expansion potential – we can take the machine up to 16 GB without throwing away any of our initial investment in memory. But in the meantime using two memory slots limits us to two memory channels in use. If instead we had four 1 GB memory modules we'd have four memory channels in use, improving memory interleaving from 2:1 to 4:1 and increasing memory bandwidth. To confirm this, we populated four memory slots on one of the new machines (going from 4 GB to 8 GB) and reran WinSAT:&lt;/p&gt;  &lt;table cellspacing="0" cellpadding="2" width="422" border="0"&gt;&lt;tbody&gt;     &lt;tr&gt;       &lt;td valign="top" width="189"&gt;&lt;strong&gt;Benchmark&lt;/strong&gt;&lt;/td&gt;        &lt;td valign="top" width="89"&gt;&lt;strong&gt;2 channels&lt;/strong&gt;&lt;/td&gt;        &lt;td valign="top" width="95"&gt;&lt;strong&gt;4 channels&lt;/strong&gt;&lt;/td&gt;        &lt;td valign="top" width="47"&gt;&lt;strong&gt;Speedup&lt;/strong&gt;&lt;/td&gt;     &lt;/tr&gt;      &lt;tr&gt;       &lt;td valign="top" width="189"&gt;Memory Bandwidth (MB/s)&lt;/td&gt;        &lt;td valign="top" width="89"&gt;3,360&lt;/td&gt;        &lt;td valign="top" width="95"&gt;4,134&lt;/td&gt;        &lt;td valign="top" width="47"&gt;1.2&lt;/td&gt;     &lt;/tr&gt;   &lt;/tbody&gt;&lt;/table&gt;  &lt;p&gt;&lt;b&gt;Conclusions&lt;/b&gt;&lt;/p&gt;  &lt;p&gt;It's always possible to run more experiments to further isolate and explain benchmark results, but after a while you reach a point of diminishing returns. With the results we have so far, we can already draw some useful conclusions. &lt;/p&gt;  &lt;p&gt;The first conclusion is that our naive explanations greatly underestimated just how much better the newer processors are at executing real benchmarks, despite their slower clock speeds. The results from WinSAT and SPEC clearly show this, with core-to-core performance that is 1.3-1.9x faster on the new machines, depending on the benchmark. &lt;/p&gt;  &lt;p&gt;This is perhaps the most important lesson for developers to learn: clock speeds are no longer a good indicator of true performance. Although clock speeds have plateaued, processor designers continue to find ways to make each new generation significantly faster than the last. In our case, the old machines have Pentium D processors (“Smithfield”), while the new machines have Xeon 5-series processors (“Clovertown”).&amp;#160; And while the newer processors have slightly slower clock speeds, their micro-architecture executes more instructions per clock cycle. &lt;/p&gt;  &lt;p&gt;The second conclusion is that it's very hard to perform fair comparisons. The two machines have several configuration differences, including clock frequency, number of cores, core micro-architecture, cache sizes, bus speed, memory size and speed, and so on. We showed an example of isolating the effect of just one of these differences, the number of cores, using the BCDEdit tool. Isolating the effect of every single difference would require much more effort.&lt;/p&gt;  &lt;p&gt;Indeed, some of these differences are interrelated, and it is hard to change one without affecting another. For example, CPU architects make their micro-architecture design decisions based on cache sizes. Now imagine a hypothetical experiment that tried to isolate the effect of L2 cache size by giving each core just 1 MB of cache. This would be especially hard on the newer processors, which have been designed on the assumption that they have 2 MB of L2 cache per core&lt;a href="file://tkzaw-pro-13/#_ftn1_6097" name="_ftnref1_6097"&gt;[1]&lt;/a&gt;. In trying to perform a fairer comparison, we would have actually handicapped one system!&lt;/p&gt;  &lt;p&gt;Our final conclusion is that it truly pays to benchmark and compare systems. In our case, the simplest possible benchmark (WinSAT) showed an unexpected memory bandwidth loss, which we then traced back to a machine mis-configuration. So that was the final pleasant surprise: if we hadn't gotten curious about why the new machines were so much faster, we would never have found that they could be faster still!&lt;/p&gt;  &lt;p&gt;David Berg    &lt;br /&gt;Sunny Egbo     &lt;br /&gt;Jonathan Hardwick     &lt;br /&gt;Peter Okonski&lt;/p&gt;  &lt;hr align="left" width="33%" size="1" /&gt;  &lt;p&gt;&lt;a href="file://tkzaw-pro-13/#_ftnref1_6097" name="_ftn1_6097"&gt;[1]&lt;/a&gt; Because two cores share a single 4 MB L2 cache on the Clovertown processors, the exact size of the cache that is used by each core is not fixed at 2 MB per core; the use will vary during program execution. Cache hungry threads might get more of the cache, while less cache hungry threads get less. Even when two cache hungry threads run on the two cores, their memory hotspots are asynchronous; thus, the net effect is that each thread gets more of the cache when they need it and less when they don’t need it.&lt;/p&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=8618468" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ddperf/archive/tags/Performance/default.aspx">Performance</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Visual+Studio/default.aspx">Visual Studio</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Performance+testing/default.aspx">Performance testing</category></item></channel></rss>