[updated 10:50am 5/2/05: It turns out that I copied and pasted an error in my code from the newsgroup posting I was answering. However a kind reader was able to spot it and I've fixed it, I'm getting new data and will updated graphs later today, however the points of the article remain valid]

[updated 8:04am 5/3/05: Added new graphs for data from fixed code. As expected, the results are the same, the peaks just moved to the left]

Subtitle: Micro-optimizations for 64-bit platforms

DISCLAIMER: As usual with performance discussions, your mileage will vary, and it is highly recommended that you test your specific scenario and not make any vast over-generalizations about the data contained here.

The other day on the new MSDN forums a question came up of what the performance difference of a piece of code would be when it was run on 64-bit vs. 32-bit. In this case the poster specifically talked about the question of what the performance difference between managed code running natively on an X64 64-bit CLR and the corresponding managed code running natively on a 32-bit X86 CLR for a simple copy loop which moves data from one byte array to another. I decided it would be interesting to do an analysis of this and so here we are.

I wrote up a little unsafe C# code which approximates what I believe the poster to be talking about, it goes something like this:

class ByteCopyTest
{
    byte[] b1;
    byte[] b2;

    int iters;

    public ByteCopyTest (int size, int iters)
    {
        b1 = new byte[size];
        b2 = new byte[size];
       
        this.iters = iters;
    }

    unsafe public ulong DoIntCopySingle()
    {
        Timer t = new Timer();
       
        t.Start();
        int intCount = b1.Length / 4;
        fixed (byte* pbSrc = &b1[0], pbDst = &b2[0])
        {
            for (int j=0; j<iters; j++)
            {
                int* piSrc = (int*)pbSrc;
                int* piDst = (int*)pbDst;
               
                for (int i=0; i<intCount; ++i)
                {
                    *piDst++ = *piSrc++;
                }
            }
        }
        t.End();
       
        return t.GetTicks();
    }
}

Here we can see a simple piece of code that facilitates coping from one byte array into another. It is then easy enough to run this test under both the 64-bit and 32-bit CLR to compare performance. In this case I varied the byte array in size from 256 Bytes up to 256MB and ran a varying number of iterations so that each time measured is for copying the same amount of total data (about 5.1GB).

Something to note about these tests, is that they aren’t actually testing the internal working of the CLR so much as they test the code-generation capabilities of the JIT32 and JIT64 and the memory/cache of the machine that the test is run on.

Here, when using a copy loop that copies a single int (4-bytes) at a time from one array to another we can see that the JIT32 seems to generate better code and in many cases the 32-bit version wins. We can see that in both cases the time taken goes drastically up when we go from 1MB to 2MB and then levels off somewhat. This is where the processors on die cache stops being able to keep up as well and our program’s run time ends up being ruled by memory access, we will see later that the particular implementation of the copy loop at this point ceases to matter much. 

While that is interesting, it might be even more interesting to compare a copy loop that uses a long (8-byte) instead of an int given that registers are 8-bytes wide on the X64 platform that means we can fit the whole long into a single register in the inner copy loop. 

Here we can see that the long based copy loops definitely out perform the int based copy loops, and they do so consistently on both platforms… That this is faster on 32-bit is interesting, it turns out that the loop overhead is so great that breaking the long down into two 4-byte pieces to copy it inside of the loop is a win, effectively we’ve just made the jit unroll our loop one level for us. In this case it turns out to be a win.

loop2$    |   |   mov      esi,ebx
          |   |   add      ebx,0x8
          |   |   mov      ecx,ebp
          |   |   add      ebp,0x8
          |   |   mov      eax,[ecx]  // first 4 bytes
          |   |   mov      edx,[ecx+0x4] // second 4 bytes
          |   |   mov      [esi],eax
          |   |   mov      [esi+0x4],edx
          |   |   add      edi,0x1
          |   |   cmp      edi,[esp]
          |   |<--jl       ByteCopyTest.DoLongCopySingle()+0xb6 (loop2$)

We can see however that even with 32-bit beating it’s int based implementation the 64-bit version has a considerably smaller inner loop with fewer memory accesses which shows in the data above where we consistently see the 64-bit long based copy loop wining.

loop2$    |   |   lea      rdx,[r9+r10]
          |   |   lea      rax,[r9+0x8]
          |   |   mov      rcx,r9
          |   |   mov      r9,rax
          |   |   mov      rax,[rcx]
          |   |   mov      [rdx],rax
          |   |   inc      r11d
          |   |   cmp      r11d,esi
          |   |<--jl       ByteCopyTest.DoLongCopySingle()+0xb0 (loop2$)

We still see a plateau starting at copies of 2MB, here the latency of memory access takes over and the processor can’t keep up with the code. At this point the processor will be spending many cycles spinning waiting for data and few extra instructions aren’t going to hurt as badly.

The positive results of using the long copy loop on 32-bit invites us to try a copy loop which copies two ints or longs at a time instead of one to try and better utilize that processor. An implementation of this would look like:

    unsafe public ulong DoLongCopyDouble()
    {
        Timer t = new Timer();
        t.Start();
        int longCount = b1.Length / 16;
        fixed (byte* pbSrc = &b1[0], pbDst = &b2[0])
        {
            for (int j=0; j<iters; j++)
            {
                long* plSrc = (long*)pbSrc;
                long* plDst = (long*)pbDst;
               
                for (int i=0; i<longCount; ++i)
                {
                    plDst[0] = plSrc[0];
                    plDst[1] = plSrc[1];
                    plDst += 2;
                    plSrc += 2;
                }
            }
        }
        t.End();

        return t.GetTicks();
    }

We will call this a “double” copy loop (and our former code a “single” copy loop). Let’s look and see how the double copy loops do on 64-bit:

Here we can see that the double long copy loop wins over the others, and, interestingly the double int and single long loops are very close. This would be expected as they are coping the same amount of data per iteration through the inner loop, however, the double int implementation uses more instructions to do it and does look to be a bit slower through most of the graph.

When we put everything together into a single graph we can see that the best of the implementations (double long on 64-bit) beats the worst of the implementations (single int on 64-bit) by around 50% which is significant. Most of the implementations fall somewhere in the middle however and vary minimally from implementation to implementation.

We can see that unrolling the loop only works so far before we see diminishing returns in that on the 32-bit platform the double long implementation isn’t that much faster than the double int implementation even though it is moving twice as much data per iteration of the inner loop. This code is getting to the point where loop overhead is lost in the noise of memory access.

What is the moral of the story? This code can be faster on 64-bit for certain scenarios, but if you’re writing it you might have to think about it (once again good engineering triumphs over good hardware). For instance, you might have written the single int copy loop for some super optimized routine in your code when thinking about a 32-bit target, if that is the case then that piece of code may run marginally slower on 64-bit (or not, see other graphs below), and if it’s really important you might consider revising it to be long based for a 64-bit win. In the end we’ve seen that making it long based actually results in a win for both 32-bit and 64-bit platforms. This supports an assertion that you will commonly hear me broadcasting to anyone who will listen, “Good Engineering == Better Performance”. It’s true regardless of platform.

While examining this copy loop is a fun game to play, chances are that most of your code isn’t this low level. Chances are also good that most of your code is already fast enough on both platforms. As Rico is apt to say, premature optimization is the root of all evil. I highly recommend that you profile, a lot. And then make educated decisions about the parts of your program which it would make sense to specifically do some work to try and optimize for 64-bit. The likelihood is high that places where you can find something very low level that is 64-bit specific are few and far between. Often the hot spots that you find will be places where optimization just plain makes sense regardless of the target hardware platform. Then it’s just a task to think about that general optimization and hopefully keep 64-bit in mind.


Well, we’ve managed to make it to the end of this post without me directly answering the question posed in the title… In case you’ve forgotten, it is “Isn’t my code going to be faster on 64-bit???”

Maybe.

I know, a pointy haired answer at best… The fact of the matter is that there are a lot of cases where 64-bit processors will provide a significant boost to current applications which can take advantage of the additional memory and registers. However, there are some applications which just by their nature will run at a comparable speed to their 32-bit siblings. And some that will run somewhat slower. It is unfortunately impossible to provide a universal answer to the question for every application under the sun.

The big blocker to a universal speed boost from 64-bit processors is that they don’t fundamentally change one of the big limiting factors of many applications, I/O, both to memory and to the disk or network. Given that most of the time processors in modern machines are spinning, waiting for something to do, the difference of a few instructions in a tight loop when you’re waiting on memory can be so small as to not matter.

Which brings us to an interesting point, as can be clearly seen in the graphs above, running out of cache can be a significant problem on modern processors… This unfortunately is the current challenge for 64-bit computing, a challenge which is somewhat increased by managed runtimes which have a tendency to exacerbate coding patterns which are very reference heavy. References (pointers for you old-school c++ types like me) grow on 64-bit, in fact they double in size from 4 bytes (32-bits) to 8 bytes (64-bits). Depending on application architecture this can have a big effect on cache utilization and correspondingly performance.

So, maybe.

I’ll leave you with this sentiment: “Good Engineering == Good 64-bit Performance!”

 

The code for this example can be found here.