<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://blogs.msdn.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Lessons from the test lab: investigating a pleasant surprise</title><link>http://blogs.msdn.com/ddperf/archive/2008/06/18/lessons-from-the-test-lab-investigating-a-pleasant-surprise.aspx</link><description>This post describes our recent investigation into an interesting performance problem: benchmarks that we were surprised to find running significantly faster than we expected on new hardware. Along the way we discuss useful benchmarking tools, how to validate</description><dc:language>en-US</dc:language><generator>CommunityServer 2.1 SP1 (Build: 61025.2)</generator><item><title>Just another day in the perf lab</title><link>http://blogs.msdn.com/ddperf/archive/2008/06/18/lessons-from-the-test-lab-investigating-a-pleasant-surprise.aspx#8618564</link><pubDate>Thu, 19 Jun 2008 00:55:22 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8618564</guid><dc:creator>Rico Mariani's Performance Tidbits</dc:creator><description>&lt;p&gt;Even though I've been doing general architecture work on Visual Studio for nearly a year now, my friends&lt;/p&gt;</description></item><item><title>re: Lessons from the test lab: investigating a pleasant surprise</title><link>http://blogs.msdn.com/ddperf/archive/2008/06/18/lessons-from-the-test-lab-investigating-a-pleasant-surprise.aspx#8619745</link><pubDate>Thu, 19 Jun 2008 04:48:48 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8619745</guid><dc:creator>Antonio D</dc:creator><description>&lt;p&gt;To explain the memory bandwidth difference...&lt;/p&gt;
&lt;p&gt;The pentium 4 (830D) was a more memory hungry architecture. Risking a gross oversimplification I would say that the 830 D agressively transfers memory in its cache, even if it ends up not doing it. So the bandwidth measured by a benchmark in a best case scenario could be much higher than the actual bandwith that your software can use.&lt;/p&gt;
&lt;p&gt;About the amount of cache used by a thread.&lt;/p&gt;
&lt;p&gt;I was under the impression that Windows scheduled threads on different cores. i.e. thread x is not guaranteed and is not going to run always on the first core of the processor. So if you have two concurrent threads, they are probably running both on both (or all four) cores. In this case the amount of cache used by each thread would be undetermined (except by further analysis). &lt;/p&gt;
&lt;p&gt;My understanding is that the part where you say that thread gets more of the cache when it needs it is is true regardless of the number of cores or wether these cores are sharing a cache or not. What am I missing?&lt;/p&gt;
</description></item><item><title>re: Lessons from the test lab: investigating a pleasant surprise</title><link>http://blogs.msdn.com/ddperf/archive/2008/06/18/lessons-from-the-test-lab-investigating-a-pleasant-surprise.aspx#8620320</link><pubDate>Thu, 19 Jun 2008 08:28:47 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8620320</guid><dc:creator>MarkBFriedman</dc:creator><description>&lt;p&gt;Antonio: &lt;/p&gt;
&lt;p&gt;It looks like we should have been a little clearer about what we meant when we used the word &amp;quot;thread.&amp;quot; Sorry about that. (Reminds me of the famous words of a former US President and semiotician, &amp;quot;It depends on what the meaning of the word &amp;quot;is&amp;quot; is.&amp;quot;) &lt;/p&gt;
&lt;p&gt;From the standpoint of the OS, the thread is the dispatchable unit. From the standpoint of the CPU, a thread is any set of executing instructions that aren't executing an Idle loop. There are software and hardware guys collaborating on this post. (It may be unusual, but they do get along -- most days.) And while we knew what we meant, it seems we used &amp;quot;thread&amp;quot; without clearly distinguishing the two meanings and contexts.&lt;/p&gt;
&lt;p&gt;Knowing the author and his tendencies, my guess is that the footnote about threads sharing the cache was written from the hardware perpective. &lt;/p&gt;
&lt;p&gt;On the new lab machines, there is a dedicated L1 cache for each processor core, and a shared L2 cache that each processor core can access. The L2 cache is dynamically allocated. If CPU A is idle and CPU B is cranking, CPU B is capable of allocating the entire L2 cache. (If you don't expect the CPUs sharing the cache to all be cranking all of the time, this is probably a good approach.) I hope that clarifies the point.&lt;/p&gt;
&lt;p&gt;Of course, I am a software guy, so, from the Windows point of view, let me also try to clarify your thread dispatching question:&lt;/p&gt;
&lt;p&gt;It is true that &amp;quot;thread x is not guaranteed and is not going to run always on the first core of the processor.&amp;quot; Having said that, however, the statement that follows isn't entirely true: &amp;quot;So if you have two concurrent threads, they are probably running both on both (or all four) cores.&amp;quot;&lt;/p&gt;
&lt;p&gt;Yes and no. On a symmetric multiprocessor (SMP), a thread by default tends to be a bit sticky to the processor it was last dispatched on. This is called &amp;quot;soft affinity&amp;quot; and is done to increase the probablity that a cache warm start will occur. This stickiness is especially noticeable when the processors are lightly loaded. The stickiness is also prominent in the WinSat benchmarks described here that run single threaded and were run in insolation. &lt;/p&gt;
&lt;p&gt;But, in general, you are correct and you often observe threads switching back and forth between available processors. Because thread scheduling is priority-based with preemptive scheduling, and User mode threads are typically subject to dynamic adjustments, once the machine is loaded, threads will usually wander (somewhat randomly) from CPU to CPU. &lt;/p&gt;
&lt;p&gt;The SMP soft affinity scheduling algorithm is roughly as follows: A waiting thread that transitions to the Ready state has an &amp;quot;ideal processor&amp;quot; where it will run if that processor is currently idle or running a lower priority thread. If the ideal (i.e., last) processor is busy or running a higher priority thread, but another processor is idle or running a lower priority thread, the ready thread will be scheduled there. This is the preemptive scheduling bit -- the highest priority Ready threads are always dispatched. &lt;/p&gt;
&lt;p&gt;You will find more details in my Windows Server 2003 Resource Kit &amp;quot;Performance Guide&amp;quot; book: the priority scheme, hard processor affinity, etc. The ntttcp program discussed in my recent &amp;quot;Mainstream NUMA and the TCP/IP stack&amp;quot; post used hard processor affinity, for example, to ensure that all network processing was confined to a single CPU. Which was why I only showed what was happening on the one CPU. Hard affinity is the exception, though, not the rule. &lt;/p&gt;
&lt;p&gt;Once you move to a NUMA architecture -- see my earlier blog posts on this subject, like it or not, NUMA is in your future if you are running server-class machines -- the thread scheduling scheme gets node-oriented. (Physical memory allocations are also node-oriented on NUMA machines.) Once scheduled on a node, a thread is likely to continue to be scheduled to run on one of that node's CPUs. (Subject to availability, similar to the SMP case.) &lt;/p&gt;
&lt;p&gt;This NUMA node-oriented soft affinity scheme works pretty well when the L2 cache is a resource that is shared by all the processor cores on the socket. In today's multi-core machines, so long as the thread is re-dispatched to a CPU on the same socket (or node) where it last ran, the thread will likely benefit from an L2 cache warm start. But for an L1 cache warm start, the thread still has to be dispatched on its ideal (and still preferred) processor since that resource is dedicated.&lt;/p&gt;
&lt;p&gt;This description of the the behavior of the Windows Scheduler is also worth my mentioning here because of its significance to my earlier &amp;quot;Mainstream NUMA and the TCP/IP stack&amp;quot; posting. In the next part of &amp;quot;Mainstream NUMA&amp;quot; post, which I hope to have ready in another week or so, I will try to make this connection explicit.&lt;/p&gt;
&lt;p&gt;So, thanks for keeping us honest and stay tuned!&lt;/p&gt;
&lt;p&gt;-- Mark&lt;/p&gt;
</description></item><item><title>re: Lessons from the test lab: investigating a pleasant surprise</title><link>http://blogs.msdn.com/ddperf/archive/2008/06/18/lessons-from-the-test-lab-investigating-a-pleasant-surprise.aspx#8621455</link><pubDate>Thu, 19 Jun 2008 15:19:03 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8621455</guid><dc:creator>Antonio D</dc:creator><description>&lt;p&gt;Thank you for the answer. I feel enlightened now!&lt;/p&gt;
&lt;p&gt;Especially about soft affinity. I kind of always worried about that.&lt;/p&gt;
</description></item><item><title>Performance analysis of multi-core systems</title><link>http://blogs.msdn.com/ddperf/archive/2008/06/18/lessons-from-the-test-lab-investigating-a-pleasant-surprise.aspx#8623010</link><pubDate>Thu, 19 Jun 2008 23:30:48 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8623010</guid><dc:creator>Jonathan Hardwick</dc:creator><description>&lt;p&gt;One of our main roles in DevDiv Performance Engineering is to help other teams with performance investigations&lt;/p&gt;
</description></item><item><title>re: Lessons from the test lab: investigating a pleasant surprise</title><link>http://blogs.msdn.com/ddperf/archive/2008/06/18/lessons-from-the-test-lab-investigating-a-pleasant-surprise.aspx#8680410</link><pubDate>Wed, 02 Jul 2008 09:57:43 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8680410</guid><dc:creator>Mohit Nanda</dc:creator><description>&lt;p&gt;Thanks Mark about the enlightning details about 'Soft Affinity', and the reference to your Win2003 Performance Engineering Handbook was also useful.&lt;/p&gt;
&lt;p&gt;Looking up to next part of &amp;quot;Mainstream NUMA&amp;quot; post.&lt;/p&gt;
</description></item><item><title>Visual Studio 2010 Hardware Requirements</title><link>http://blogs.msdn.com/ddperf/archive/2008/06/18/lessons-from-the-test-lab-investigating-a-pleasant-surprise.aspx#9251567</link><pubDate>Wed, 24 Dec 2008 11:00:07 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9251567</guid><dc:creator>(Semi) Official Developer Division Performance Engineering blog</dc:creator><description>&lt;p&gt;Soma’s been talking about the upcoming Visual Studio 2010 release on his blog , which means I’m starting&lt;/p&gt;
</description></item><item><title>Are we taking advantage of Parallelism?</title><link>http://blogs.msdn.com/ddperf/archive/2008/06/18/lessons-from-the-test-lab-investigating-a-pleasant-surprise.aspx#9584047</link><pubDate>Sun, 03 May 2009 01:41:23 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9584047</guid><dc:creator>Developer Division Performance Engineering blog</dc:creator><description>&lt;p&gt;Recently, a colleague of mine, Mark Friedman, posted a blog titled “ Parallel Scalability Isn’t Child’s&lt;/p&gt;
</description></item></channel></rss>