<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://blogs.msdn.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Van's House : Optimization</title><link>http://blogs.msdn.com/xiangfan/archive/tags/Optimization/default.aspx</link><description>Tags: Optimization</description><dc:language>en-US</dc:language><generator>CommunityServer 2.1 SP1 (Build: 61025.2)</generator><item><title>Detect Shift Overflow</title><link>http://blogs.msdn.com/xiangfan/archive/2009/06/13/detect-shift-overflow.aspx</link><pubDate>Sat, 13 Jun 2009 08:19:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9742030</guid><dc:creator>xiangfan</dc:creator><slash:comments>0</slash:comments><comments>http://blogs.msdn.com/xiangfan/comments/9742030.aspx</comments><wfw:commentRss>http://blogs.msdn.com/xiangfan/commentrss.aspx?PostID=9742030</wfw:commentRss><description>&lt;P&gt;This is an intellectual exercise: when shifts a 32-bit unsigned integer in C++, how to detect whether the calculation overflows efficiently?&lt;/P&gt;
&lt;P&gt;Here is the function prototype. shl_overflow will return true if v &amp;lt;&amp;lt; cl overflows (cl is between 0 and 31. And we assume that sizeof(unsigned long) == 4 and sizeof(unsigned long long) == 8).&lt;BR&gt;&lt;BR&gt;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;bool&lt;/SPAN&gt; shl_overflow(&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;unsigned&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;long&lt;/SPAN&gt; v, &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;int&lt;/SPAN&gt; cl)&lt;/P&gt;
&lt;P&gt;The most natural way to implement this function is to extend v to 64-bit integer:&lt;BR&gt;&lt;BR&gt;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;bool&lt;/SPAN&gt; shl_overflow(&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;unsigned&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;long&lt;/SPAN&gt; v, &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;int&lt;/SPAN&gt; cl)&lt;BR&gt;{&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;unsigned&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;long&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;long&lt;/SPAN&gt; vl = v;&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;return&lt;/SPAN&gt; (vl &amp;lt;&amp;lt; cl &amp;gt;&amp;gt; 32) != 0;&lt;BR&gt;}&lt;/P&gt;
&lt;P&gt;Now, let’s dig into the assembly world. We’ll limit the discussion on x86.&lt;BR&gt;&lt;BR&gt;mov&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; eax, DWORD PTR _v$[esp-4]&lt;BR&gt;mov&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ecx, DWORD PTR _cl$[esp-4]&lt;BR&gt;xor&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; edx, edx&lt;BR&gt;call&amp;nbsp;&amp;nbsp;&amp;nbsp; __allshl&lt;BR&gt;xor&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; eax, eax&lt;BR&gt;or&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; eax, edx&lt;BR&gt;jne&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; overflow&lt;/P&gt;
&lt;P&gt;The implementation has to use three specific registers: eax, edx and ecx. And there is an expensive external function call.&lt;/P&gt;
&lt;P&gt;If you step into __allshl in the debugger, you can find that it will use shld to shift 64-bit integer. VC provides some intrinsics which map to CPU instructions. For example, __ll_lshift will map to shld.&lt;/P&gt;
&lt;P&gt;Because the high dword of vl is 0, we can simplify our code:&lt;BR&gt;&lt;BR&gt;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;bool&lt;/SPAN&gt; shl_overflow(&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;unsigned&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;long&lt;/SPAN&gt; v, &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;int&lt;/SPAN&gt; cl)&lt;BR&gt;{&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;unsigned&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;long&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;long&lt;/SPAN&gt; vl = __ll_lshift(v, cl);&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;return&lt;/SPAN&gt; (&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;static_cast&lt;/SPAN&gt;&amp;lt;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;unsigned&lt;/SPAN&gt;&amp;nbsp;&lt;FONT color=#0000ff&gt;long&lt;/FONT&gt;&amp;gt;(vl &amp;gt;&amp;gt; 32)) != 0;&lt;BR&gt;}&lt;/P&gt;
&lt;P&gt;The assembly looks like:&lt;BR&gt;&lt;BR&gt;
&lt;P&gt;mov&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; eax, DWORD PTR _v$[esp-4]&lt;BR&gt;mov&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ecx, DWORD PTR _cl$[esp-4]&lt;BR&gt;xor&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; edx, edx&lt;BR&gt;shld&amp;nbsp;&amp;nbsp;&amp;nbsp; edx, eax, cl&lt;BR&gt;test&amp;nbsp;&amp;nbsp;&amp;nbsp; edx&lt;BR&gt;jne&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; overflow&lt;/P&gt;
&lt;P&gt;Much better now.&lt;/P&gt;
&lt;P&gt;Another approach is based on bit representation.&lt;BR&gt;&lt;BR&gt;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;bool&lt;/SPAN&gt; shl_overflow(&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;unsigned&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;long&lt;/SPAN&gt; v, &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;int&lt;/SPAN&gt; cl)&lt;BR&gt;{&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; v = _rotl(v, cl);&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;unsigned&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;long&lt;/SPAN&gt; index;&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;return&lt;/SPAN&gt; _BitScanForward(&amp;amp;index, v) ? index &amp;gt;= cl : &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;false&lt;/SPAN&gt;;&lt;BR&gt;}&lt;/P&gt;
&lt;P&gt;The idea is simple. If v &amp;lt;&amp;lt; cl overflows, that means the most significant cl bits of v should contains "1".&lt;/P&gt;
&lt;P&gt;There are two ways to test that.&lt;/P&gt;
&lt;P&gt;1. Scan v from the least significant bits to the most, and test the index against 32 – cl. However, we have to handle the case when cl = 0.&lt;/P&gt;
&lt;P&gt;2. Rotate v cl bits left first, so the most significant cl bits will be the least significant cl bits. Then we can scan and test the index against cl directly.&lt;/P&gt;
&lt;P&gt;Notice that, the scan may fail if v is 0. The second way is simpler and more efficient.&lt;/P&gt;
&lt;P&gt;The assembly looks like:&lt;/P&gt;
&lt;P&gt;mov&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ecx, DWORD PTR _cl$[esp-4]&lt;BR&gt;mov&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; eax, DWORD PTR _v$[esp-4]&lt;BR&gt;rol&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; eax, cl&lt;BR&gt;bsf&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; eax, eax&lt;BR&gt;je&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; notoverflow&lt;BR&gt;cmp&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; eax, ecx&lt;BR&gt;jl&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; overflow&lt;/P&gt;
&lt;P&gt;It only uses two registers. It can also be extended to handle 64-bit shift. One drawback is an extra conditional jump (The extra jump can be replaced by "cmovz eax, ecx", but there is no way to ask the compiler to generate that)&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9742030" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/xiangfan/archive/tags/C_2B002B00_/default.aspx">C++</category><category domain="http://blogs.msdn.com/xiangfan/archive/tags/VC/default.aspx">VC</category><category domain="http://blogs.msdn.com/xiangfan/archive/tags/Optimization/default.aspx">Optimization</category></item><item><title>Optimize Your Code: Matrix Multiplication</title><link>http://blogs.msdn.com/xiangfan/archive/2009/04/28/optimize-your-code-matrix-multiplication.aspx</link><pubDate>Tue, 28 Apr 2009 15:30:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9573500</guid><dc:creator>xiangfan</dc:creator><slash:comments>1</slash:comments><comments>http://blogs.msdn.com/xiangfan/comments/9573500.aspx</comments><wfw:commentRss>http://blogs.msdn.com/xiangfan/commentrss.aspx?PostID=9573500</wfw:commentRss><description>&lt;P&gt;Matrix multiplication is common and the algorithm is easy to implementation. Here is one example:&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Version 1:&lt;/STRONG&gt;&lt;BR&gt;&lt;BR&gt;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;template&lt;/SPAN&gt;&amp;lt;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;typename&lt;/SPAN&gt; T&amp;gt;&lt;BR&gt;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;void&lt;/SPAN&gt; SeqMatrixMult1(&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;int&lt;/SPAN&gt; size, T** m1, T** m2, T** result)&lt;BR&gt;{&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;for&lt;/SPAN&gt; (&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;int&lt;/SPAN&gt; i = 0; i &amp;lt; size; i++) {&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;for&lt;/SPAN&gt; (&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;int&lt;/SPAN&gt; j = 0; j &amp;lt; size; j++) {&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; result[i][j] = 0;&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;for&lt;/SPAN&gt; (&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;int&lt;/SPAN&gt; k = 0; k &amp;lt; size; k++) {&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; result[i][j] += m1[i][k] * m2[k][j];&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR&gt;}&lt;/P&gt;
&lt;P&gt;This implementation is straight-forward and you can find it in text book and many online samples.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Version 2:&lt;/STRONG&gt;&lt;BR&gt;&lt;BR&gt;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;template&lt;/SPAN&gt;&amp;lt;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;typename&lt;/SPAN&gt; T&amp;gt;&lt;BR&gt;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;void&lt;/SPAN&gt; SeqMatrixMult2(&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;int&lt;/SPAN&gt; size, T** m1, T** m2, T** result)&lt;BR&gt;{&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;for&lt;/SPAN&gt; (&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;int&lt;/SPAN&gt; i = 0; i &amp;lt; size; i++) {&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;for&lt;/SPAN&gt; (&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;int&lt;/SPAN&gt; j = 0; j &amp;lt; size; j++) {&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; T c = 0;&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;for&lt;/SPAN&gt; (&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;int&lt;/SPAN&gt; k = 0; k &amp;lt; size; k++) {&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; c += m1[i][k] * m2[k][j];&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; result[i][j] = c;&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR&gt;} &lt;/P&gt;
&lt;P&gt;This version will use a temporary to store the intermediate result. So we can save a lot of unnecessary memory write. Notice that the optimizer can not help here because it doesn't know whether "result" is an alias of "m1" or "m2".&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Version 3:&lt;/STRONG&gt;&lt;BR&gt;&lt;BR&gt;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;template&lt;/SPAN&gt;&amp;lt;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;typename&lt;/SPAN&gt; T&amp;gt;&lt;BR&gt;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;void&lt;/SPAN&gt; Transpose(&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;int&lt;/SPAN&gt; size, T** m)&lt;BR&gt;{&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;for&lt;/SPAN&gt; (&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;int&lt;/SPAN&gt; i = 0; i &amp;lt; size; i++) {&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;for&lt;/SPAN&gt; (&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;int&lt;/SPAN&gt; j = i + 1; j &amp;lt; size; j++) {&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; std::swap(m[i][j], m[j][i]);&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR&gt;}&lt;BR&gt;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;template&lt;/SPAN&gt;&amp;lt;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;typename&lt;/SPAN&gt; T&amp;gt;&lt;BR&gt;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;void&lt;/SPAN&gt; SeqMatrixMult3(&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;int&lt;/SPAN&gt; size, T** m1, T** m2, T** result)&lt;BR&gt;{&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Transpose(size, m2);&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;for&lt;/SPAN&gt; (&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;int&lt;/SPAN&gt; i = 0; i &amp;lt; size; i++) {&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;for&lt;/SPAN&gt; (&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;int&lt;/SPAN&gt; j = 0; j &amp;lt; size; j++) {&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; T c = 0;&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;for&lt;/SPAN&gt; (&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;int&lt;/SPAN&gt; k = 0; k &amp;lt; size; k++) {&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; c += m1[i][k] * m2[j][k];&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; result[i][j] = c;&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Transpose(size, m2);&lt;BR&gt;}&lt;/P&gt;
&lt;P&gt;This optimization is tricky. If you profile the function, you'll find a lot of data cache miss. We transpose the matrix so that both m1[i] and m2[i] can be accessed sequentially. This can greatly improve the memory read performance.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Version 4:&lt;/STRONG&gt;&lt;BR&gt;&lt;BR&gt;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;template&lt;/SPAN&gt;&amp;lt;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;typename&lt;/SPAN&gt; T&amp;gt;&lt;BR&gt;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;void&lt;/SPAN&gt; SeqMatrixMult4(&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;int&lt;/SPAN&gt; size, T** m1, T** m2, T** result);&lt;BR&gt;&lt;SPAN style="COLOR: rgb(0,128,0)"&gt;// assume size % 2 == 0&lt;/SPAN&gt;&lt;BR&gt;&lt;SPAN style="COLOR: rgb(0,128,0)"&gt;// assume m1[i] and m2[i] are 16-byte aligned&lt;/SPAN&gt;&lt;BR&gt;&lt;SPAN style="COLOR: rgb(0,128,0)"&gt;// require SSE3 (haddpd)&lt;/SPAN&gt;&lt;BR&gt;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;template&lt;/SPAN&gt;&amp;lt;&amp;gt;&lt;BR&gt;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;void&lt;/SPAN&gt; SeqMatrixMult4(&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;int&lt;/SPAN&gt; size, &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;double&lt;/SPAN&gt;** m1, &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;double&lt;/SPAN&gt;** m2, &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;double&lt;/SPAN&gt;** result)&lt;BR&gt;{&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Transpose(size, m2);&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;for&lt;/SPAN&gt; (&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;int&lt;/SPAN&gt; i = 0; i &amp;lt; size; i++) {&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;for&lt;/SPAN&gt; (&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;int&lt;/SPAN&gt; j = 0; j &amp;lt; size; j++) {&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;__m128d&lt;/SPAN&gt; c = _mm_setzero_pd();&lt;BR&gt;&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;for&lt;/SPAN&gt; (&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;int&lt;/SPAN&gt; k = 0; k &amp;lt; size; k += 2) {&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; c = _mm_add_pd(c, _mm_mul_pd(_mm_load_pd(&amp;amp;m1[i][k]), _mm_load_pd(&amp;amp;m2[j][k])));&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; c = _mm_hadd_pd(c, c);&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; _mm_store_sd(&amp;amp;result[i][j], c);&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Transpose(size, m2);&lt;BR&gt;}&lt;BR&gt;&lt;SPAN style="COLOR: rgb(0,128,0)"&gt;// assume size % 4 == 0&lt;/SPAN&gt;&lt;BR&gt;&lt;SPAN style="COLOR: rgb(0,128,0)"&gt;// assume m1[i] and m2[i] are 16-byte aligned&lt;/SPAN&gt;&lt;BR&gt;&lt;SPAN style="COLOR: rgb(0,128,0)"&gt;// require SSE3 (haddps)&lt;/SPAN&gt;&lt;BR&gt;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;template&lt;/SPAN&gt;&amp;lt;&amp;gt;&lt;BR&gt;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;void&lt;/SPAN&gt; SeqMatrixMult4(&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;int&lt;/SPAN&gt; size, &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;float&lt;/SPAN&gt;** m1, &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;float&lt;/SPAN&gt;** m2, &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;float&lt;/SPAN&gt;** result)&lt;BR&gt;{&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Transpose(size, m2);&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;for&lt;/SPAN&gt; (&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;int&lt;/SPAN&gt; i = 0; i &amp;lt; size; i++) {&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;for&lt;/SPAN&gt; (&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;int&lt;/SPAN&gt; j = 0; j &amp;lt; size; j++) {&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;__m128&lt;/SPAN&gt; c = _mm_setzero_ps();&lt;BR&gt;&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;for&lt;/SPAN&gt; (&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;int&lt;/SPAN&gt; k = 0; k &amp;lt; size; k += 4) {&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; c = _mm_add_ps(c, _mm_mul_ps(_mm_load_ps(&amp;amp;m1[i][k]), _mm_load_ps(&amp;amp;m2[j][k])));&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; c = _mm_hadd_ps(c, c);&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; c = _mm_hadd_ps(c, c);&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; _mm_store_ss(&amp;amp;result[i][j], c);&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Transpose(size, m2);&lt;BR&gt;}&lt;/P&gt;
&lt;P&gt;For float types, we can use SIMD instruction set to parallel the data processing.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Parallel version using PPL (&lt;A href="http://msdn.microsoft.com/en-us/magazine/dd434652.aspx" target=_blank mce_href="http://msdn.microsoft.com/en-us/magazine/dd434652.aspx"&gt;Parallel Patterns Library&lt;/A&gt;) and lambda in VC2010 CTP:&lt;/STRONG&gt;&lt;BR&gt;&lt;BR&gt;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;template&lt;/SPAN&gt;&amp;lt;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;typename&lt;/SPAN&gt; T&amp;gt;&lt;BR&gt;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;void&lt;/SPAN&gt; ParMatrixMult1(&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;int&lt;/SPAN&gt; size, T** m1, T** m2, T** result)&lt;BR&gt;{&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;using&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;namespace&lt;/SPAN&gt; Concurrency;&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;for&lt;/SPAN&gt; (&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;int&lt;/SPAN&gt; i = 0; i &amp;lt; size; i++) {&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; parallel_for(0, size, 1, [&amp;amp;](&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;int&lt;/SPAN&gt; j) {&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; result[i][j] = 0;&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN style="COLOR: rgb(0,0,255)"&gt;for&lt;/SPAN&gt; (&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;int&lt;/SPAN&gt; k = 0; k &amp;lt; size; k++) {&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; result[i][j] += m1[i][k] * m2[k][j];&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; });&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR&gt;}&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Result&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Here are the test results (what really matters is the relative time between different version):&lt;/P&gt;
&lt;P&gt;Matrix size = 500 (Intel Core 2 Duo T7250, 2 cores, L2 cache 2MB)&lt;/P&gt;
&lt;TABLE border=1 cellSpacing=0 cellPadding=2 width=600&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD vAlign=top width=120&gt;&amp;nbsp;&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;int&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;long long&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;float&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;double&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TD vAlign=top width=120&gt;Version 1&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.931119s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;2.945134s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.774894s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.984585s&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TD vAlign=top width=120&gt;Version 2&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.571003s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;2.310568s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.724161s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.929064s&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TD vAlign=top width=120&gt;Version 3&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.239538s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.823095s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.570772s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.241691s&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TD vAlign=top width=120&gt;Version 4&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;N/A&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;N/A&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.063196s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.187614s&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TD vAlign=top width=120&gt;Version 1 + PPL&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.847534s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;1.683765s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.589513s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.994161s&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TD vAlign=top width=120&gt;Version 2 + PPL&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.380174s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;1.190713s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.409321s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.594859s&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TD vAlign=top width=120&gt;Version 3 + PPL&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.135760s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.495152s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.370499s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.185800s&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TD vAlign=top width=120&gt;Version 4 + PPL&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;N/A&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;N/A&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.041959s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.157932s&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;
&lt;P&gt;Matrix size = 500 (Intel&amp;nbsp;Xeon E5430,&amp;nbsp;4 cores, L2 cache 12MB)&lt;/P&gt;
&lt;TABLE border=1 cellSpacing=0 cellPadding=2 width=600&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD vAlign=top width=120&gt;&amp;nbsp;&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;int&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;long long&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;float&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;double&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TD vAlign=top width=120&gt;Version 1&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.514330s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;1.434509s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.455168s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.608127s&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TD vAlign=top width=120&gt;Version 2&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.314554s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;1.231696s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.447607s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.593517s&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TD vAlign=top width=120&gt;Version 3&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.180176s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.591002s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.432129s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.149511s&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TD vAlign=top width=120&gt;Version 4&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;N/A&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;N/A&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.042900s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.083286s&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TD vAlign=top width=120&gt;Version 1 + PPL&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.308766s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.482934s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.175585s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.309159s&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TD vAlign=top width=120&gt;Version 2 + PPL&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.105717s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.325413s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.124862s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.164156s&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TD vAlign=top width=120&gt;Version 3 + PPL&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.073418s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.193824s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.116971s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.061268s&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TD vAlign=top width=120&gt;Version 4 + PPL&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;N/A&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;N/A&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.017891s&lt;/TD&gt;
&lt;TD vAlign=top width=120&gt;0.031734s&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;
&lt;P&gt;From the results, you can find that:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Parallelism only helps if you carefully tune your code to maximize its effect (Version 1)&lt;/LI&gt;
&lt;LI&gt;Eliminating unnecessary memory write (Version 2) helps the parallelism&lt;/LI&gt;
&lt;LI&gt;Data cache miss can be a big issue when there are lots of memory access (Version 3)&lt;/LI&gt;
&lt;LI&gt;Using SIMD instead of FPU on aligned data is beneficial (Version 4)&lt;/LI&gt;
&lt;LI&gt;Different data types, data sizes and host architectures may have different kinds of bottlenecks&lt;/LI&gt;&lt;/UL&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9573500" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/xiangfan/archive/tags/C_2B002B00_/default.aspx">C++</category><category domain="http://blogs.msdn.com/xiangfan/archive/tags/Optimization/default.aspx">Optimization</category></item><item><title>Magic behind ValueType.Equals</title><link>http://blogs.msdn.com/xiangfan/archive/2008/09/01/magic-behind-valuetype-equals.aspx</link><pubDate>Mon, 01 Sep 2008 18:35:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8916815</guid><dc:creator>xiangfan</dc:creator><slash:comments>1</slash:comments><comments>http://blogs.msdn.com/xiangfan/comments/8916815.aspx</comments><wfw:commentRss>http://blogs.msdn.com/xiangfan/commentrss.aspx?PostID=8916815</wfw:commentRss><description>&lt;P&gt;In "Effective C#", Bill Wagner says "Always create an override of ValueType.Equals() whenever you create a value type". His main consideration is the performance, because reflection is needed to compare two value types memberwisely. 
&lt;P&gt;In fact, the framework provides optimization for "simple" value type. Let's find out the magic. 
&lt;P&gt;Here is the code disassembled by Reflector: 
&lt;P&gt;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;bool&lt;/SPAN&gt;&amp;nbsp;Equals(&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;object&lt;/SPAN&gt;&amp;nbsp;obj)&lt;BR&gt;{&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;SPAN style="COLOR: rgb(0,128,0)"&gt;//Compare&amp;nbsp;type&amp;nbsp;&lt;/SPAN&gt;&lt;BR&gt;&lt;SPAN style="COLOR: rgb(0,128,0)"&gt;&lt;/SPAN&gt;&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;object&lt;/SPAN&gt;&amp;nbsp;a&amp;nbsp;=&amp;nbsp;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;this&lt;/SPAN&gt;;&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;if&lt;/SPAN&gt;&amp;nbsp;(CanCompareBits(&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;this&lt;/SPAN&gt;))&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;{&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;SPAN style="COLOR: rgb(0,0,255)"&gt;return&lt;/SPAN&gt;&amp;nbsp;FastEqualsCheck(a,&amp;nbsp;obj);&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;}&amp;nbsp;&lt;BR&gt;&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;SPAN style="COLOR: rgb(0,128,0)"&gt;//Compare&amp;nbsp;using&amp;nbsp;reflection&lt;/SPAN&gt;&lt;BR&gt;}&lt;BR&gt;
&lt;P&gt;The magic is the two functions "CanCompareBits" and "FastEqualsCheck". They both have attribute "[MethodImpl(MethodImplOptions.InternalCall)]", which indicates that their implementations are in native dlls. 
&lt;P&gt;Walkthrough the source of &lt;A href="http://research.microsoft.com/sscli/" mce_href="http://research.microsoft.com/sscli/"&gt;rotor&lt;/A&gt;, you can find that these functions are in "clr\src\vm\comutilnative.cpp" 
&lt;P&gt;The comment of CanCompareBits says "Return true if the valuetype does not contain pointer and is tightly packed". And FastEqualsCheck use "memcmp" to speed up the comparison. 
&lt;P&gt;Then you may think you can safely rely on this optimization and stick to the default ValueType.Equals implementation. But wait a minute. Do you find anything wrong in CanCompareBits? 
&lt;P&gt;The problem is that the condition in the comment doesn't ensure the bitwise comparison to work. 
&lt;P&gt;Imagine you have a structure which only contains a float. What will occur if one contains +0.0, and the other contains -0.0? They should be the same, but the underlying binary representation are different.&lt;BR&gt;If you nest other structure which override the Equals method, that optimization will also fail. 
&lt;P&gt;This should be a bug. But it also provide another reason for why you should always implement your own instance Equals for ValueType.&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=8916815" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/xiangfan/archive/tags/Optimization/default.aspx">Optimization</category><category domain="http://blogs.msdn.com/xiangfan/archive/tags/Bug/default.aspx">Bug</category><category domain="http://blogs.msdn.com/xiangfan/archive/tags/C_2300_/default.aspx">C#</category></item></channel></rss>