<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://blogs.msdn.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Mainstream NUMA and the TCP/IP stack: Part I.</title><link>http://blogs.msdn.com/b/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx</link><description>One of the intriguing aspects of the onset of the many-core processor era is the necessity of using parallel programming techniques to reap the performance benefits of this and future generations of processor chips. Instead of significantly faster processors</description><dc:language>en-US</dc:language><generator>Telligent Evolution Platform Developer Build (Build: 5.6.50428.7875)</generator><item><title>Parallel Scalability Isn’t Child’s Play</title><link>http://blogs.msdn.com/b/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx#9481952</link><pubDate>Mon, 16 Mar 2009 23:44:51 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9481952</guid><dc:creator>Developer Division Performance Engineering blog</dc:creator><description>&lt;p&gt;In a recent blog entry , Dr. Neil Gunther, a colleague from the Computer Measurement Group (CMG), warned&lt;/p&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9481952" width="1" height="1"&gt;</description></item><item><title>New NUMA Support with Windows Server 2008 R2 and Windows 7</title><link>http://blogs.msdn.com/b/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx#9252046</link><pubDate>Wed, 24 Dec 2008 21:53:03 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9252046</guid><dc:creator>Regarding Windows Server 2008 R2</dc:creator><description>&lt;p&gt;The 64-bit versions of Windows 7 and Windows Server 2008 R2 support more than 64 Logical Processors (LP)&lt;/p&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9252046" width="1" height="1"&gt;</description></item><item><title>Mainstream NUMA and the TCP/IP stack: Part II: Programming ccNUMA machines</title><link>http://blogs.msdn.com/b/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx#8784837</link><pubDate>Mon, 28 Jul 2008 19:51:55 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8784837</guid><dc:creator>(Semi) Official Developer Division Performance Engineering blog</dc:creator><description>&lt;p&gt;This is a continuation of Part I of this article posted here . In Part 1 of this article, we looked at&lt;/p&gt;
&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=8784837" width="1" height="1"&gt;</description></item><item><title>re: Mainstream NUMA and the TCP/IP stack: Part I.</title><link>http://blogs.msdn.com/b/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx#8680655</link><pubDate>Wed, 02 Jul 2008 11:12:22 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8680655</guid><dc:creator>gopal</dc:creator><description>&lt;p&gt;I hope you noted that &amp;nbsp;when you tested and saturated the CPU , IPsec was probably enabled and that took most of your CPU. Because otherwise, &amp;nbsp;CPU utilization would have been much less.( less than 25% on a mainstream dual core machine).&lt;/p&gt;
&lt;p&gt;Yes, &amp;nbsp;your concern remain valid, stack should be able to take advantage of multi-cores present on the machine.&lt;/p&gt;
&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=8680655" width="1" height="1"&gt;</description></item><item><title>re: Mainstream NUMA and the TCP/IP stack: Part I.</title><link>http://blogs.msdn.com/b/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx#8642786</link><pubDate>Mon, 23 Jun 2008 19:37:52 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8642786</guid><dc:creator>Mark B Friedman</dc:creator><description>&lt;p&gt;Manu,&lt;/p&gt;
&lt;p&gt;Thanks for the correction &amp;amp; amplification on ethernet bit rates.&lt;/p&gt;
&lt;p&gt;The 1000BaseX standard is mainly for fiber connections where the nominal data rate is 1 GB sec.&lt;/p&gt;
&lt;p&gt;In my test I was running 1000BaseT over cat 6 wiring, in which case it looks like I would get the following (which looks more like 8/12 encoding):&lt;/p&gt;
&lt;p&gt;&amp;quot;The data is transmitted over four copper pairs, eight bits at a time. First, eight bits of data are expanded into four 3-bit symbols through a non-trivial scrambling procedure based on a linear feedback shift register; this is similar to what is done in 100BASE-T2, but uses different parameters.&amp;quot; &lt;/p&gt;
&lt;p&gt;So, if I am reading this correctly, the nominal data rate is even less that what I reported in the test I ran. I will look into this a little more and correct the post accordingly. Thanks, again.&lt;/p&gt;
&lt;p&gt;Hey, what are a few megabytes/sec here and there between friends?&lt;/p&gt;
&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=8642786" width="1" height="1"&gt;</description></item><item><title>re: Mainstream NUMA and the TCP/IP stack: Part I.</title><link>http://blogs.msdn.com/b/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx#8641758</link><pubDate>Mon, 23 Jun 2008 13:15:14 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8641758</guid><dc:creator>manu</dc:creator><description>&lt;p&gt;Thanks for your clarifications. I just wanted to raise clock vs Instruction issues and as you rightly pointed out these things are much dependent on cpu architecture and increasingly becoming complicated to calculate.&lt;/p&gt;
&lt;p&gt;For ethernet line rates, you may want to look at&lt;/p&gt;
&lt;p&gt;&lt;a rel="nofollow" target="_new" href="http://en.wikipedia.org/wiki/Gigabit_Ethernet"&gt;http://en.wikipedia.org/wiki/Gigabit_Ethernet&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;excerpt :&lt;/p&gt;
&lt;p&gt;--------&lt;/p&gt;
&lt;p&gt;&amp;quot;The IEEE 802.3z standard includes 1000BASE-SX for transmission over multi-mode fiber, 1000BASE-LX for transmission over single-mode fiber, and the nearly obsolete 1000BASE-CX for transmission over balanced copper cabling. These standards use 8B/10B encoding, which inflates the line rate by 25%, from 1000 Mbit/s to 1250 Mbit/s to ensure a DC balanced signal. The symbols are then sent using NRZ..&amp;quot;&lt;/p&gt;
&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=8641758" width="1" height="1"&gt;</description></item><item><title>re: Mainstream NUMA and the TCP/IP stack: Part I.</title><link>http://blogs.msdn.com/b/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx#8627477</link><pubDate>Sat, 21 Jun 2008 00:51:16 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8627477</guid><dc:creator>Mark B Friedman</dc:creator><description>&lt;p&gt;Manu,&lt;/p&gt;
&lt;p&gt;Thanks for the comment.&lt;/p&gt;
&lt;p&gt;On the nominal data rate of an Ethernet link using 10/8 encoding, I am going to hold my ground. But you are certainly correct when you ask me to account for the Ethernet preamble, postamble, and the header itself. Ethernet is actually not capable of delivering the full nominal data rate to the next higher level protocol. I deliberately neglected that detail here, but it is an important detail that people are prone to overlook.&lt;/p&gt;
&lt;p&gt;On the instruction execution rate vs. clock cycle rate, your comment pushed me to make a correction I should have made earlier. If you double-check above, I changed &amp;quot;instructions&amp;quot; to &amp;quot;clocks&amp;quot; (or cycles) in equation 2 &amp;amp; 3 because that is clearly what I was measuring here.&lt;/p&gt;
&lt;p&gt;I should not have been so free about mixing up instruction execution rate and clocks, so thank you for making me clarify that. &lt;/p&gt;
&lt;p&gt;In this article I wanted to avoiding discussing the Intel microarchitecture. But your question invites me to deal with it a little bit. Please understand that in writing &amp;quot;instructions&amp;quot; instead of &amp;quot;clocks&amp;quot; in Equations 2 &amp;amp; 3, I was deliberately being conservative. My expectation for this workload is that the CPI (cycles per instruction) would actually be less than 1, although I admit I did not go to the trouble of measuring that in this case.&lt;/p&gt;
&lt;p&gt;To explain...&lt;/p&gt;
&lt;p&gt;You are correct when you say instructions require multiple clock cycles to complete. Instruction execution is pipelined, so even simple instructions require multiple clock cycles to execute. In the original 486 and Pentium chips, the pipeline depth was 5, I believe. Starting with the P6 microarchitecture, though, things got considerably more complicated. The goal of a pipelined architecture is to achieve a CPI of 1, which for a variety of reasons (where it is necessary to stall the pipeline) it is almost impossible to get there. (For really deep background on this, I recommend reading &amp;quot;Computer Architecture&amp;quot; by Hennessy and Patterson. If that looks intimidating, try their introductory textbook &amp;quot;Computer Organization and Design.&amp;quot;)&lt;/p&gt;
&lt;p&gt;The current microarchitectecture aims to do better than 1 CPI. The processor does not execute native x86 or x64 instruction directly. It translates native instructions into micro-ops and these can be executed out of order &amp;amp; in parallel. The microarchitecture does ensure that instructions are &amp;quot;retired&amp;quot; in their original sequence. (The most important performance metric is the rate that native x86 Instructions are &amp;quot;retired&amp;quot; per sec.) With Intel processors today, the pipeline depth is model-dependent (and more like 10-20 cycles per instruction), but they are all capable of retiring as many as 3 or 4 instructions per clock cycle. &lt;/p&gt;
&lt;p&gt;For critical code paths like the TCP/IP stack in Windows, it is safe to assume that you are executing optimized code that will retire more than 1 instruction per clock. That is why I felt safe using &amp;quot;instructions&amp;quot; without further explanation in this context, but &amp;quot;clocks&amp;quot; or &amp;quot;cycles&amp;quot; is still better.&lt;/p&gt;
&lt;p&gt;Thanks, again.&lt;/p&gt;
&lt;p&gt;-- Mark&lt;/p&gt;
&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=8627477" width="1" height="1"&gt;</description></item><item><title>re: Mainstream NUMA and the TCP/IP stack: Part I.</title><link>http://blogs.msdn.com/b/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx#8626071</link><pubDate>Fri, 20 Jun 2008 20:47:31 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8626071</guid><dc:creator>manu</dc:creator><description>&lt;p&gt;&amp;quot;With 10/8 encoding, a 10 Megabit Ethernet card has a nominal data rate of 1 Megabyte/sec, a 100 Mb Ethernet NIC transmits data at a 10 MB rate, etc. Similarly, the 10 Gb Ethernet card has the capacity to transmit application data at 1 GB/sec.&lt;/p&gt;
&lt;p&gt;&amp;quot;&lt;/p&gt;
&lt;p&gt;this is rather wrong assumtion. One 1 Gbps ethernet link generally have a line rate of 1.25 Gbps to give 1 Gbps( 1000/8 = 125 GB/sec). Though because of overhead of ethernet frame actual data throughput to higher layer will be less.&lt;/p&gt;
&lt;p&gt;And secondly, your instruction calculation is less convincing. &amp;nbsp;2.2 GHz does not mean 2.2 Giga instruction, it is just that many &amp;nbsp;clock ticks. One instruction can take more than one tick.&lt;/p&gt;
&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=8626071" width="1" height="1"&gt;</description></item></channel></rss>