<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://blogs.msdn.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Developer Division Performance Engineering blog : Hardware</title><link>http://blogs.msdn.com/ddperf/archive/tags/Hardware/default.aspx</link><description>Tags: Hardware</description><dc:language>en-US</dc:language><generator>CommunityServer 2.1 SP1 (Build: 61025.2)</generator><item><title>Are we taking advantage of Parallelism?</title><link>http://blogs.msdn.com/ddperf/archive/2009/05/02/are-we-taking-advantage-of-parallelism.aspx</link><pubDate>Sun, 03 May 2009 01:38:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9584046</guid><dc:creator>Sunny Egbo</dc:creator><slash:comments>1</slash:comments><comments>http://blogs.msdn.com/ddperf/comments/9584046.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ddperf/commentrss.aspx?PostID=9584046</wfw:commentRss><description>&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;Recently, a colleague of mine, Mark Friedman, posted a blog titled “&lt;/FONT&gt;&lt;/SPAN&gt;&lt;A href="http://blogs.msdn.com/ddperf/archive/2009/04/29/parallel-scalability-isn-t-child-s-play-part-2-amdahl-s-law-vs-gunther-s-law.aspx#9576239" mce_href="http://blogs.msdn.com/ddperf/archive/2009/04/29/parallel-scalability-isn-t-child-s-play-part-2-amdahl-s-law-vs-gunther-s-law.aspx#9576239"&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT color=#0000ff size=3&gt;Parallel Scalability Isn’t Child’s Play&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;” in which he reviewed the merits of Amdahl Law vs. Gunther’s Law for determining the practical limits to parallelization. I would not argue with the premise of Mark’s blog that Parallelism is not child’s play. However, I do have alternate views of the use of Amdahl Law and Gunther’s Law that I posted on his blog. &lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;I think my views and comments on Mark’s blog warrant another blog to fully explain.&lt;?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;Speaking of child’s play: my 10-year old son recently made a two-part movie titled “&lt;I style="mso-bidi-font-style: normal"&gt;the Way&lt;/I&gt;” and “&lt;I style="mso-bidi-font-style: normal"&gt;the Way Back&lt;/I&gt;” complete with a full storyline, multiple sound tracks and narrations. He put these movies together with only the help of his eight-year old sister, using sample movie clips and stock photographs he found on his computer hard drive. He asked me for help getting his two masterpieces onto a DVD capable of playing on the average home DVD player. Also, he asked about the length of a typical movie playing in movie theaters around the U.S. (approximately 2 hours) and how much these movies cost at the movie theater (approximately $12 for adults and $8 for children, minus the popcorn). Based on my answers, he determined that he will charge 25 cents for people to watch his movies, because he wanted everyone to attend. I wanted to ask him how much he would charge someone who decided to watch only one of the clips. However, I didn’t because I did not want to lose a price haggling war with a 10-year old. Besides, it would be terrible if you cannot find your way back.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;In any case, his movies were quite impressive. The most technologically savvy thing I did as a 10-year old kid was to build a telephone line with tomato soup cans and a string. Movie making was out of reach for me; but now it is child’s play.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;Today, parallelism is not child’s play. However, I hold out hope that in the future the typical computer program would be written with parallelism in mind. Is parallelism ever going to be child’s play in the future the way movie making is today? &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;Parallelism exists everywhere: &lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;instruction level, memory level, loop level and task level parallelism, etc. Also, parallelism has been with us for quite some time now. For the past several decades, hardware engineers have quietly been busy solving problems in parallel to improve processor and system level performance. However, for the past four or more years, hardware designers have encountered the twin brick walls created by memory speed and power. These walls have forced CPU architects and hardware designers to go multi-core in a major way. The doubling of the CPU frequency every 18 months, that was true for many decades, are no longer practical and have come to an abrupt end. Although, hardware performance continues to improve as my colleagues and I pointed out in our blog “&lt;/FONT&gt;&lt;/SPAN&gt;&lt;A href="http://blogs.msdn.com/ddperf/archive/2008/06/18/lessons-from-the-test-lab-investigating-a-pleasant-surprise.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2008/06/18/lessons-from-the-test-lab-investigating-a-pleasant-surprise.aspx"&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT color=#0000ff size=3&gt;Investigating a Pleasant Surprise&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;FONT size=3&gt;&lt;FONT face="Times New Roman"&gt;,&lt;/FONT&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;” the pace of CPU frequency increases has slowed considerably. Instead, hardware designers have been doubling the number of cores available on a single CPU socket every couple of years. &lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;To get the same level of performance that was previously possible, software engineers would now need to step up to the plate—to write software in a parallel and scalable fashion. They would need tools and frameworks that allow them to think about their problems, identify opportunities for parallelism and to analyze their solutions correctly and efficiently. &lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;I am a big fan of Amdahl Law as an analysis framework. However, I do not subscribe to the narrow view that Amdahl’s Law applies only to parallelism, as most people who write about it seem to imply. I prefer the broader treatment of the Law by Hennessy and Patterson in their famous book “&lt;/FONT&gt;&lt;/SPAN&gt;&lt;A href="http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901/ref=sr_1_1?ie=UTF8&amp;amp;s=books&amp;amp;qid=1241294120&amp;amp;sr=1-1" mce_href="http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901/ref=sr_1_1?ie=UTF8&amp;amp;s=books&amp;amp;qid=1241294120&amp;amp;sr=1-1"&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT color=#0000ff size=3&gt;Computer Architecture: A Quantitative Approach&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;”—where Amdahl’s Law is used to estimate the opportunities between competing designs. Amdahl’s Law is very powerful for showing the areas that will likely yield the most fruitful performance gains. In my performance design, tuning and optimization work, I use Amdahl Law for prioritizing the areas of opportunities to focus my efforts to gain performance.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;Amdahl’s Law is not the limit to either absolute performance or parallelism as many authors seem to suggest. Gunther’s and Gustafson’s Laws are helpful for putting Amdahl’s Law in perspective. However, like Amdahl’s Law, these laws are not fundamental limits. The use of these three laws to estimate the level of parallelism that is possible is very flawed. Specifically, the use of these laws as fundamental limits can obfuscate the level of parallelism and performance inherent in typical computing problems. These laws gloss over a number of important points and practical aspects of obtaining parallelism in general purpose computing, including that:&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: Calibri; mso-bidi-theme-font: minor-latin"&gt;&lt;SPAN style="mso-list: Ignore"&gt;&lt;FONT size=3&gt;
&lt;P style="MARGIN-LEFT: 0.5in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: Calibri; mso-bidi-theme-font: minor-latin"&gt;&lt;SPAN style="mso-list: Ignore"&gt;1.&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;Many user tasks are non-monolithic and can be solved in a distributed fashion. Background tasks (e.g., virus scans) that often block single processor execution can now be done in a way that improves user experiences. The key is to identify unnecessary dependencies that would allow these tasks to proceed in parallel with other tasks in a multi-core computer.&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN-LEFT: 0.5in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: Calibri; mso-bidi-theme-font: minor-latin"&gt;&lt;SPAN style="mso-list: Ignore"&gt;2.&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;Some algorithms that have inefficient sequential solutions surprisingly have efficient parallel solutions. This fact should be comforting to fans of algorithms. For example, many applications require matrix multiplication, which turns out to be easily parallelizable. Although the best sequential algorithm for matrix multiplication has a time complexity of O(n&lt;SUP&gt;2.376&lt;/SUP&gt;), a straight-forward parallel solution has an asymptotic time complexity of O(log n) using n&lt;SUP&gt;2.376&lt;/SUP&gt; processor.&amp;nbsp;In other words, we can readily find a parallel solution for matrix multiplication that improves its runtime as more and more processor cores become available. Of course, you might have difficulty conceiving of n&lt;SUP&gt;2.376&lt;/SUP&gt; processors in a system--as a colleague mentioned recently. However, this is just another way of saying that matrix multiplication will benefit with more and more processors.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN-LEFT: 0.5in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: Calibri; mso-bidi-theme-font: minor-latin"&gt;&lt;SPAN style="mso-list: Ignore"&gt;3.&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;Some poor sequential algorithms can be easily parallelized to execute in less time than their sequential solutions. Also, we know that&amp;nbsp;some algorithms that have the best asymptotic time complexities achieve&amp;nbsp;their speed&amp;nbsp;by introducing&amp;nbsp;data dependencies that make parallelization&amp;nbsp;difficult and that the best asymptotic time complexity does not necessarily translate to the best runtime in real life. Hence, at some point, the benefit of the simpler parallelization of some&amp;nbsp;poor sequential algorithms that have little data dependencies&amp;nbsp;can outweigh the benefit of&amp;nbsp;more efficient sequential counterparts that have data dependencies. Hence, when considering parallel solutions it is not always necessary to start with the sequential solution with the best time complexity [also, see comment about Fortune and Wylie below].&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN-LEFT: 0.5in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: Calibri; mso-bidi-theme-font: minor-latin"&gt;&lt;SPAN style="mso-list: Ignore"&gt;4.&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;The real world performance of applications is not determined exclusively by the asymptotic time complexity of algorithms. Because of the increasing gap between CPU and memory speed, memory accesses are increasingly dominating the performance of applications running on modern CPUs. Although, the gap can be mitigated with large caches, every cache miss takes hundreds of CPU cycles to complete. Even a modest overlap in these memory accesses (Memory Level Parallelism) can improve application performance in noticeable ways.&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN-LEFT: 0.5in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="FONT-SIZE: 11pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: 'Times New Roman'; mso-bidi-theme-font: minor-bidi; mso-fareast-font-family: Calibri; mso-fareast-theme-font: minor-latin; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA"&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;Over the years, there have been efforts to classify computationally intractable problems. Many decision problems (i.e., Yes/No) and their optimization counterparts have been categorized into NP-complete and NP-Hard sets respectively. The Travelling Salesman (TSP), Online Bin-Packing and 3-Dimensional Matching problems are three famous examples of NP-Complete problems. In a similar fashion, problems that are difficult to parallelize have been categorized into the P-Complete set or the set of problems that are known to be inherently sequential. As you can imagine sorting is not P-Complete. Likewise, Matrix Multiplication is not in the P-Complete set. Processor scheduling can be done in O(log n) time units using &lt;I style="mso-bidi-font-style: normal"&gt;n&lt;/I&gt; processors—so, it is not P-Complete either. In an ultimate twist of irony, many NP-Hard problems have heuristic solutions that can be executed in parallel to approximate the real solutions. Hence, the natural inclination to think that NP-Complete problems cannot be parallelized is not borne out in practice.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;As it turns out, the real limit to parallelism seem not to be defined by Amdahl’s Law, Gunther’s Law, Gustafson’s Law or NP-Completeness, but by the P-Complete set. It appears that parallelizable problems are related to the asymptotic space complexity of their sequential solutions. According to the Fortune and Wylie’s Parallel Processing Thesis, any problem that can be solved with a poly-logarithmic space complexity can be parallelized efficiently. Because of&amp;nbsp;the time space trade-off of algorithms, this&amp;nbsp;implies that the sequential algorithm that achieves this&amp;nbsp;space complexity is not necessarily the&amp;nbsp;algorithm with&amp;nbsp;the best asymptotic time complexity. &lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;In any case,&amp;nbsp;because one can evaluate problems on multiple levels beyond algorithms (e.g., at the instruction, memory and data access, loop and task levels), the set of problems that can be parallelized appears to be quite large. The question is how to identify and take advantage of the parallelization opportunities that may be inherently available and to do so in an efficient and scalable manner. How can we parallelize loops? How do we&amp;nbsp;overlap high latency activities such as accesses to physical memory or I/O to amortize the cost of those activities? How do we minimize synchronizations? How do we partition tasks to eliminate bottlenecks&amp;nbsp;from the&amp;nbsp;critical paths? How do we dispatch work efficiently to improve efficient system utilization, improve throughput and improve latency? What areas of our application can benefit from what sets of efforts? These are some of the questions that allow for scalable designs.&amp;nbsp;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;Today, the tools to identify parallelism and scalability opportunities are very limited. The programming languages that allow programmers to express parallelism in a natural way are completely lacking. The tools to analyze and troubleshoot parallel implementations are limited as well. Debugging parallel implementation is particularly hard. However, I suspect that with some industry focus and incremental progress, we could continue to make parallelism accessible to average programmers in a few years. However, we are many years away.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size=3&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'"&gt;W&lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;hat are some of the fundamental limits preventing such tools to be built? Like Mark said on his blog, &lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'"&gt;achieving improved scalability using parallel programming techniques is certainly very challenging. But, can parallel programming be made less challenging with intuitive tools that expose parallel solutions in a natural way and allow programmers to exploit them? &lt;/SPAN&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;Can programming languages and tools improve to a point where a typical 10-year old will be able to write a parallel program as easily as they can put together a multi-track movie today?&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;o:p&gt;&lt;FONT size=3&gt;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;FONT size=3&gt;Sunny Egbo&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"&gt;&lt;/SPAN&gt;&amp;nbsp;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9584046" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ddperf/archive/tags/Performance+Engineering/default.aspx">Performance Engineering</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/.NET/default.aspx">.NET</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Scalability/default.aspx">Scalability</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Parallel+programming/default.aspx">Parallel programming</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Performance/default.aspx">Performance</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Hardware/default.aspx">Hardware</category></item><item><title>Parallel Scalability Isn’t Child’s Play</title><link>http://blogs.msdn.com/ddperf/archive/2009/03/16/parallel-scalability-isn-t-child-s-play.aspx</link><pubDate>Mon, 16 Mar 2009 20:39:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9481780</guid><dc:creator>MarkBFriedman</dc:creator><slash:comments>9</slash:comments><comments>http://blogs.msdn.com/ddperf/comments/9481780.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ddperf/commentrss.aspx?PostID=9481780</wfw:commentRss><description>&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;In &lt;A title="Neil Gunther's blog" href="http://perfdynamics.blogspot.com/2009/02/poor-scalability-on-multicore.html" mce_href="http://perfdynamics.blogspot.com/2009/02/poor-scalability-on-multicore.html"&gt;a recent blog entry&lt;/A&gt;, Dr. Neil Gunther, a colleague from the Computer Measurement Group (CMG), warned about unrealistic expectations being raised with regard to the performance of parallel programs on current multi-core hardware. Neil’s blog entry highlighted a dismal parallel programming experience publicized &lt;/FONT&gt;&lt;A title="Sandia Labs multi-core press release" href="http://www.sandia.gov/news/resources/releases/2009/multicore.html" mce_href="http://www.sandia.gov/news/resources/releases/2009/multicore.html"&gt;&lt;FONT color=#0000ff size=3 face=Calibri&gt;in a recent press release&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3 face=Calibri&gt; from the Sandia Labs in Albuquerque, New Mexico. Sandia Labs is a research facility operated by the U.S. Department of Energy. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;According to the press release, scientists at Sandia Labs &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="LINE-HEIGHT: normal; MARGIN: 0in 0.2in 10pt" class=MsoNormal&gt;&lt;SPAN style="FONT-SIZE: 9pt"&gt;&lt;FONT face=Calibri&gt;simulated key algorithms for deriving knowledge from large data sets. The simulations show a significant increase in speed going from two to four multicores, but an insignificant increase from four to eight multicores. Exceeding eight multicores causes a decrease in speed. Sixteen multicores perform barely as well as two, and after that, a steep decline is registered as more cores are added.” They concluded that this retrograde speed-up was due to deficiencies in “memory bandwidth as well as contention between processors over the memory bus available to each processor.&lt;?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Holy cow. The Lab’s scientists, who are heavily invested in parallel programming on supercomputers, simulated running programs on sixteen cores encapsulating “key algorithms for deriving knowledge from large data sets” that gave no better performance than running the same program on two cores. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Please note that these are simulated performance results, because 16-core machines of the type being simulated don’t currently exist. Indeed, I would not expect that 16-core machines of the type being simulated would ever exist. Which leads me to wonder what the point of this Sandia Labs exercise was.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Of course, for developers experienced in parallel programming, this result actually isn’t in itself all that surprising. Quite frequently, experienced developers find that running their multi-threaded application on massively parallel hardware does not scale well with the hardware capabilities. This was apparently the case for the applications the Sandia Labs folks simulated. So what? Should we just give up in our quest for parallel program scalability? &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Before drilling into Dr. Gunther’s specific interest in this disclosure, it is worth looking into the Sandia Labs finding in a bit more detail. For instance, did anyone, besides me, wonder what applications were being simulated?&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;In theory, “deriving knowledge from large data sets” is a category of computing program that readily lends itself to a solution using an &lt;/FONT&gt;&lt;A href="http://en.wikipedia.org/wiki/SIMD"&gt;&lt;FONT size=3 face=Calibri&gt;SIMD&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=3 face=Calibri&gt; (&lt;B style="mso-bidi-font-weight: normal"&gt;S&lt;/B&gt;ingle &lt;B style="mso-bidi-font-weight: normal"&gt;I&lt;/B&gt;nstruction, &lt;B style="mso-bidi-font-weight: normal"&gt;M&lt;/B&gt;ultiple &lt;B style="mso-bidi-font-weight: normal"&gt;D&lt;/B&gt;ata) approach. The canonical example of an SIMD approach to “deriving knowledge from large data sets” is a database Search function conducted in parallel where the data set of interest is partitioned across &lt;I style="mso-bidi-font-style: normal"&gt;n&lt;/I&gt; processing units and their locally attached disks. For example, when the Thinking Machines CM-1 supercomputer publicly debuted in the mid-80s, the company demonstrated its capabilities using a parallel search of a database that was partitioned across all 64K nodes of the machine, which was based on the Connection Machine originally designed by MIT whiz kid Danny Hillis. Parallel search when executed across a partitioned dataset should scale linearly, or close enough for government work (pun intended). &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Whenever a problem lends itself to an SIMD approach (also known as “divide and conquer”), linear scalability of the SIMD algorithm does require first partitioning the data being accessed and then proceeding to process that data in parallel. I am sure the point of the Sandia Labs press release was not to disparage the SIMD approach to parallel processing; after all, that is a tried-and-true technique that they have used with great success over the years. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;On the contrary, it appears to be a critique of an approach to building parallel processing hardware&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;where you would increase the number of processing cores on the chip (just because you can with the most current semiconductor fabrication technology) without scaling the memory bandwidth proportionally. Since that is not what is happening hardware-wise, it strikes me that this implied criticism of the multi-core hardware&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;strategy Intel and AMD are pursuing is slaying a non-existent dragon. Both Intel and AMD recognize that memory bus bandwidth is a significant potential bottleneck in their multi-core products, and, as a result, both manufacturers are attempting to scale memory bandwidth proportional to the amount of processing power they deliver on a chip.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;So then what is all the fuss about? The Sandia Labs “news” starts to look like something the blogosphere is latching onto on an otherwise slow day for tech news, raising an alarm &amp;amp; potentially misleading naïve readers about what the conventional wisdom in multiprocessor chip architecture would be if anyone were actually trying to build multi-core microprocessors that way.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;STRONG&gt;Building a better multicore processor.&lt;/STRONG&gt;&lt;/P&gt;&lt;FONT size=3 face=Calibri&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;The point of the Sandia Labs press release publicizing these simulation results appears to be to suggest what they consider better approaches to packaging multi-core processors on a single socket. They released the following chart that that makes this point (reproduced here in Figure 1):&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&lt;IMG style="WIDTH: 450px; HEIGHT: 353px" title="Sandia Labs multicore simulation results" alt="Sandia Labs multicore simulation results" src="http://5l3vgw.bay.livefilestore.com/y1pr4F4aoYifbSInSEBRcbQ9TBEARzKw87EyYk2bricI-CoyRgTN--dE7SeFYj-q7Ll9D3mJePubLw_-B_yrrSvOQ/Sandia%20Labs%20simulated%20multicore%20performance%20(smaller).jpg" width=450 height=353 mce_src="http://5l3vgw.bay.livefilestore.com/y1pr4F4aoYifbSInSEBRcbQ9TBEARzKw87EyYk2bricI-CoyRgTN--dE7SeFYj-q7Ll9D3mJePubLw_-B_yrrSvOQ/Sandia%20Labs%20simulated%20multicore%20performance%20(smaller).jpg"&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoCaption&gt;&lt;STRONG&gt;&lt;FONT size=2&gt;&lt;FONT color=#4f81bd&gt;&lt;SPAN style="mso-no-proof: yes"&gt;Figure 1&lt;/SPAN&gt;. Sandia Labs simulation showing performance of their application vs. the number of processors.&lt;/FONT&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;FONT color=#4f81bd size=2&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT color=#000000 size=3&gt;Exactly what the Sandia Labs folks are reporting here is a little sketchy. Presumably, the simulations are based on observing the behavior of some of their key programs where they were able to measure performance running on “conventional” multi-core processors, perhaps, something like the quad-core machine I recently installed for my desktop that uses a memory bus with bandwidth in the range of 10 GB/sec. The press release seems to imply that the Sandia Labs baseline measurements were taken on current quad-core machines from Intel like mine, not the newer Nehalem processors where the memory architecture has been re-worked extensively. How useful or meaningful the results that Sandia Labs published may turn on this crucial point.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT color=#000000 size=3&gt;This Sandia Labs simulation then extrapolates out to 16 cores per socket (and beyond), simulating the manufacturer adding more cores to the die, apparently &lt;I style="mso-bidi-font-style: normal"&gt;leaving the memory architecture fundamentally unchanged&lt;/I&gt; as they moved to more cores. The Sandia Labs chart in Figure 1 is labeled to indicate that the memory bandwidth was held constant at 10 GB/sec. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT color=#000000 size=3&gt;This is more than a little suspicious. Hardware manufacturers like Intel and AMD understand clearly that the memory bus has to scale with the number of processors. The AMD &lt;/FONT&gt;&lt;A title="HyperTransport specifications" href="http://www.hypertransport.org/default.cfm?page=TechnologyLowLatency" mce_href="http://www.hypertransport.org/default.cfm?page=TechnologyLowLatency"&gt;&lt;FONT size=3&gt;HyperTransport&lt;/FONT&gt;&lt;/A&gt;&lt;FONT color=#000000 size=3&gt; bus architecture is quite explicit about this, and the latest spec for &lt;/FONT&gt;&lt;A title="HyperTransport 3.1" href="http://blogs.msdn.com/controlpanel/blogs/Exactly%20what%20the%20Sandia%20Labs%20folks%20are%20reporting%20here%20is%20a%20little%20sketchy.%20Presumably,%20the%20simulations%20are%20based%20on%20observing%20the%20behavior%20of%20some%20of%20their%20key%20programs%20where%20they%20were%20able%20to%20measure%20performance%20running%20on%20“conventional”%20multi-core%20processors,%20perhaps,%20something%20like%20the%20quad-core%20machine%20I%20recently%20installed%20for%20my%20desktop%20that%20uses%20a%20memory%20bus%20with%20bandwidth%20in%20the%20range%20of%2010%20GB/sec.%20The%20press%20release%20seems%20to%20imply%20that%20the%20Sandia%20Labs%20baseline%20measurements%20were%20taken%20on%20current%20quad-core%20machines%20from%20Intel%20like%20mine,%20not%20the%20newer%20Nehalem%20processors%20where%20the%20memory%20architecture%20has%20been%20re-worked%20extensively.%20How%20useful%20or%20meaningful%20the%20results%20that%20Sandia%20Labs%20published%20may%20turn%20on%20this%20crucial%20point." mce_href="http://blogs.msdn.com/controlpanel/blogs/Exactly what the Sandia Labs folks are reporting here is a little sketchy. Presumably, the simulations are based on observing the behavior of some of their key programs where they were able to measure performance running on “conventional” multi-core processors, perhaps, something like the quad-core machine I recently installed for my desktop that uses a memory bus with bandwidth in the range of 10 GB/sec. The press release seems to imply that the Sandia Labs baseline measurements were taken on current quad-core machines from Intel like mine, not the newer Nehalem processors where the memory architecture has been re-worked extensively. How useful or meaningful the results that Sandia Labs published may turn on this crucial point."&gt;&lt;FONT size=3&gt;HyperTransport version 3.1&lt;/FONT&gt;&lt;/A&gt;&lt;FONT color=#000000 size=3&gt; has an aggregate bandwidth in excess of 50 GB/sec.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT color=#000000 size=3&gt;Meanwhile, the memory bus capacity on the latest &lt;/FONT&gt;&lt;A title="Nhealem architecture announcement" href="http://blogs.msdn.com/ddperf/archive/2008/04/01/thoughts-on-intel-s-recent-hardware-announcements.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2008/04/01/thoughts-on-intel-s-recent-hardware-announcements.aspx"&gt;&lt;FONT size=3&gt;Nehalem&lt;/FONT&gt;&lt;/A&gt;&lt;FONT color=#000000 size=3&gt;-class processors from Intel has been boosted significantly. Alternatively, it is when you cannot scale the memory bus with processor capacity that machines with &lt;/FONT&gt;&lt;A title="Blogging about NUMA 2008" href="http://blogs.msdn.com/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx"&gt;&lt;FONT size=3&gt;NUMA&lt;/FONT&gt;&lt;/A&gt;&lt;FONT color=#000000 size=3&gt; architectures become more attractive. The AMD processors use HyperTransport links&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;in a ring topology that implicitly leads to NUMA-characteristics. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT color=#000000 size=3&gt;In Intel’s approach to NUMA scalability, some small number of processors share a common memory bus, forming a &lt;I style="mso-bidi-font-style: normal"&gt;node&lt;/I&gt;. Current Nehalem machines (also known as the Core i7 architecture) have four cores sharing the Front-side memory bus (FSB). The physical layout of this chip is photographed in Figure 2, showing four cores, connected to DDR3 DRAM using an integrated memory controller. I wasn’t able to come find a speed rating for the FSB in the Nehalem on Intel’s web site or elsewhere, other than ballpark estimates that puts its speed in the range of 30-40 GB/sec. The QuickConnect technology links that are used to link memory controllers support 25 GB/sec transfers, which is probably a safe lower bound on the capacity of the FSB. &lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT color=#000000 size=3&gt;&lt;IMG style="WIDTH: 526px; HEIGHT: 363px" title="Core i7 4-way multiprocessor photo" alt="Core i7 4-way multiprocessor photo" src="http://5l3vgw.bay.livefilestore.com/y1pS_IQwDypWmRE8pD4mgMliuhbypb0uOI730CaN7MKi5QtXsiDyzMJ9eE4o2-kp03n19hsvrPV-MEMRRbv9L1d3Q/Nehalem%20multicore%20chip%20photo.jpg" width=526 height=363 mce_src="http://5l3vgw.bay.livefilestore.com/y1pS_IQwDypWmRE8pD4mgMliuhbypb0uOI730CaN7MKi5QtXsiDyzMJ9eE4o2-kp03n19hsvrPV-MEMRRbv9L1d3Q/Nehalem%20multicore%20chip%20photo.jpg"&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;FONT color=#4f81bd size=2&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoCaption&gt;&lt;STRONG&gt;Figure 2. Aerial photograph showing the layout of the 4-way Core i7 (Nehalem) microprocessor chip.&lt;/STRONG&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT color=#000000 size=3&gt;The &lt;/FONT&gt;&lt;A href="https://cfwebprod.sandia.gov/cfdocs/CCIM/docs/pim-mpi.pdf"&gt;&lt;FONT color=#0000ff size=3&gt;PIM architecture&lt;/FONT&gt;&lt;/A&gt;&lt;FONT color=#000000 size=3&gt;, whose scalability curves are close to ideal for the Sandia Labs workloads is, probably not coincidentally, a processor architecture championed at Sandia Labs. The idea behind PIM machines &lt;SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Calibri','sans-serif'; FONT-SIZE: 11pt; mso-ascii-theme-font: minor-latin; mso-fareast-font-family: Calibri; mso-fareast-theme-font: minor-latin; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: 'Times New Roman'; mso-bidi-theme-font: minor-bidi; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA"&gt;(&lt;U&gt;P&lt;/U&gt;rocessor &lt;U&gt;I&lt;/U&gt;n &lt;U&gt;M&lt;/U&gt;emory) &lt;/SPAN&gt;is that the processor (or processors) is embedded into the memory chip itself, which is a pretty interest approach to solving the “memory wall” that limits performance in today’s dominant computer architectures. Instead of loading up the microprocessor socket with more and more cores, which is the professed hardware roadmap at Intel &amp;amp; AMD, integrating memory into the socket is an intriguing alternative. Such machines, if anyone were to build them, would obviously have NUMA performance characteristics.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT color=#000000 size=3&gt;The debate is a bit academic for my taste, however, until these PIM architecture machines are a reality. For PIM architecture machines to ever get traction, either the microprocessor manufacturers would have to start building DRAM chips or the DRAM manufacturers would have to start building microprocessors. The way the semiconductor fabrication business is stratified today, that does not appear to be very likely in the near future.&lt;/FONT&gt;&lt;/P&gt;&lt;FONT color=#000000&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;So, in the end, the point of the Sandia Labs press release appears to be trying to publicize the multiprocessor hardware direction espoused mainly by Sandia Labs’ own researchers. Frankly, there have been lots and lots of different architectural approaches to parallel processing over the years, and it doesn’t look like any one approach is optimal for all computing situations. You ought to be pick another parallel programming workload to simulate in Figure 1 and get an entirely different ranking of the approaches.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;Still, the Sandia Labs simulation data are interesting mainly for they say about how difficult it is going to be for developers to write parallel programs that scale well on multi-core machines. No, achieving parallel isn’t child’s play for hardware manufacturers. Nor is it for software developers attempting to take advantage of parallel processing hardware, which is the subject I will start to drill into next time.&lt;/FONT&gt;&lt;/P&gt;&lt;/FONT&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoCaption&gt;&lt;A title="Continue to Part 2" href="http://blogs.msdn.com/ddperf/archive/2009/04/29/parallel-scalability-isn-t-child-s-play-part-2-amdahl-s-law-vs-gunther-s-law.aspx" mce_href="http://blogs.msdn.com/ddperf/archive/2009/04/29/parallel-scalability-isn-t-child-s-play-part-2-amdahl-s-law-vs-gunther-s-law.aspx"&gt;Continue to Part 2....&lt;/A&gt;.&lt;/P&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9481780" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ddperf/archive/tags/Scalability/default.aspx">Scalability</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Parallel+programming/default.aspx">Parallel programming</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Hardware/default.aspx">Hardware</category></item><item><title>Visual Studio 2010 Hardware Requirements</title><link>http://blogs.msdn.com/ddperf/archive/2008/12/23/visual-studio-2010-hardware-requirements.aspx</link><pubDate>Wed, 24 Dec 2008 10:56:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9251566</guid><dc:creator>David Berg</dc:creator><slash:comments>16</slash:comments><comments>http://blogs.msdn.com/ddperf/comments/9251566.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ddperf/commentrss.aspx?PostID=9251566</wfw:commentRss><description>&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT face=Calibri size=3&gt;Soma’s been talking about the upcoming Visual Studio 2010 release on his &lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/somasegar/default.aspx"&gt;&lt;FONT face=Calibri size=3&gt;blog&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face=Calibri size=3&gt;, which means I’m starting to get questions about what type of hardware you’re going to need to run VS2010 on.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT face=Calibri size=3&gt;Unfortunately, I can’t give you an official answer yet (other than to say, it depends on what you’re doing – obviously building small apps with one of the Express versions of Visual Studio won’t require the same resources as a multi-million line app using full blown Visual Studio Team System with lots of third party add-ins).&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT face=Calibri size=3&gt;What I can do is help put some of the things we’ve said about Visual Studio 2010 into context, to maybe help you make some better hardware decisions today:&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpFirst style="MARGIN: 0in 0in 0pt 0.75in; TEXT-INDENT: -0.5in; mso-add-space: auto; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="mso-fareast-font-family: Calibri; mso-fareast-theme-font: minor-latin; mso-bidi-font-family: Calibri; mso-bidi-theme-font: minor-latin"&gt;&lt;SPAN style="mso-list: Ignore"&gt;&lt;FONT face=Calibri size=3&gt;1)&lt;/FONT&gt;&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri size=3&gt;Memory – we’re trying to make VS2010 as frugal as we can here in order to run in as little memory as possible; however, we’re also adding a lot of functionality, and systems with more memory do tend to perform much better.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;So the general rule of buying systems still applies – spring for as much memory as you can afford.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;It’s hard to have too much memory, at a minimum you want to make sure that you’re not paging.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;That said, there’s very little benefit to making a text editor 64 bit (and lots of reasons not to), so anything over 4GB is likely to be wasted (unless you’re running or writing apps that need more).&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.75in; mso-add-space: auto"&gt;&lt;?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /&gt;&lt;o:p&gt;&lt;FONT face=Calibri size=3&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.75in; TEXT-INDENT: -0.5in; mso-add-space: auto; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="mso-fareast-font-family: Calibri; mso-fareast-theme-font: minor-latin; mso-bidi-font-family: Calibri; mso-bidi-theme-font: minor-latin"&gt;&lt;SPAN style="mso-list: Ignore"&gt;&lt;FONT face=Calibri size=3&gt;2)&lt;/FONT&gt;&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri size=3&gt;CPU – modern CPUs with their larger caches and tuned instruction pipelines tend to perform much better than one’s from just a few years ago (see our &lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/ddperf/archive/2008/06/18/lessons-from-the-test-lab-investigating-a-pleasant-surprise.aspx"&gt;&lt;FONT face=Calibri color=#0000ff size=3&gt;blog&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face=Calibri size=3&gt;).&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;If you’re going to do multi-threaded programming, you’ll want at least a dual core processor (and with the new &lt;/FONT&gt;&lt;A href="http://msdn.microsoft.com/en-us/concurrency/default.aspx"&gt;&lt;FONT face=Calibri size=3&gt;Parallel Computing&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face=Calibri size=3&gt; support in VS 2010, you &lt;U&gt;will&lt;/U&gt; want to do multi-threaded programming).&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in"&gt;&lt;o:p&gt;&lt;FONT face=Calibri size=3&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.75in; TEXT-INDENT: -0.5in; mso-add-space: auto; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="mso-fareast-font-family: Calibri; mso-fareast-theme-font: minor-latin; mso-bidi-font-family: Calibri; mso-bidi-theme-font: minor-latin"&gt;&lt;SPAN style="mso-list: Ignore"&gt;&lt;FONT face=Calibri size=3&gt;3)&lt;/FONT&gt;&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT face=Calibri size=3&gt;GPU – VS2010 will leverage WPF heavily to create richer editing and visualizations, so a decent GPU that supports at least DX9 is highly recommended (DX10 is preferred, but requires Vista).&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpMiddle style="MARGIN: 0in 0in 0pt 0.5in"&gt;&lt;o:p&gt;&lt;FONT face=Calibri size=3&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P class=MsoListParagraphCxSpLast style="MARGIN: 0in 0in 10pt 0.75in; TEXT-INDENT: -0.5in; mso-add-space: auto; mso-list: l0 level1 lfo1"&gt;&lt;SPAN style="mso-fareast-font-family: Calibri; mso-fareast-theme-font: minor-latin; mso-bidi-font-family: Calibri; mso-bidi-theme-font: minor-latin"&gt;&lt;SPAN style="mso-list: Ignore"&gt;&lt;FONT face=Calibri size=3&gt;4)&lt;/FONT&gt;&lt;SPAN style="FONT: 7pt 'Times New Roman'"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;Disk – If you’re building a large project or working with a large database, a large high-speed disk is pretty important.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;For large projects, you can often benefit by spreading your work across multiple disk spindles.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;At an extreme, putting your tools on one drive, your source code on another, and your object files on a third drive allows the three major sources of disk IO in building a project to be carried out independently of each other.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;If you have to use a slower disk (e.g. a notebook) then be sure to get lots of memory.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;Also keep in mind that modern hard drives tend to have more built in caching, so the same speed drive bought recently will likely outperform one bought a few years ago.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;FONT face=Calibri size=3&gt;So now that I’ve given you my thoughts on what hardware Visual Studio 2010 will need, what are your thoughts? &lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;What kind of hardware are you developing on today, and what do you expect to be using in the next couple years?&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;What are your expectations on how we should be leveraging your hardware to create a productive development environment?&lt;/FONT&gt;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9251566" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ddperf/archive/tags/Performance/default.aspx">Performance</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Visual+Studio/default.aspx">Visual Studio</category><category domain="http://blogs.msdn.com/ddperf/archive/tags/Hardware/default.aspx">Hardware</category></item></channel></rss>