-
In a number of blog entries I have discussed how on 64-bit machines .Net applications can run as either 32-bit or 64-bit processes depending on how the exe is produced. Generally it is highly recommended that developers use the compiler options provided by Whidbey compilers to specify the platform on which to run.
However, some people want to change the loading characteristics of an application after it has been compiled, or maybe you don’t have access to the code, etc… For that case we have an SDK tool called corflags.exe. Corflags allows you to modify some of the loading characteristics of a managed app.
Here’s is the command line help for Corflags.exe
Microsoft (R) .NET Framework CorFlags Conversion Tool. Version 2.0.50405.00
Copyright (C) Microsoft Corporation. All rights reserved.
Usage: Corflags.exe Assembly [options]
If no options are specified, the flags for the given image are displayed.
Options:
/ILONLY+ /ILONLY- Sets/clears the ILONLY flag
/32BIT+ /32BIT- Sets/clears the 32BIT flag
/UpgradeCLRHeader Upgrade the CLR Header to version 2.5
/RevertCLRHeader Revert the CLR Header to version 2.0
/Force Force an assembly update even if the image is
strong name signed.
WARNING: Updating a strong name signed assembly
will require the assembly to be resigned before
it will execute properly.
/nologo Prevents corflags from displaying logo
/? or /help Display this usage message
WARNING: Corflags is a powerful tool, and you can break code in weird ways by using it incorrectly. As mentioned previously, it is highly recommended that you control the loading characteristics of your application through compiler switches.
Here are a couple of examples of Corflags.exe usage:
Scenario: you have an application foo.exe which was compiled “any cpu” with a Whidbey compiler, but you want to force it to run as a 32-bit application even on 64-bit machines.
Run: corflags.exe /32BIT+ foo.exe
Result: foo.exe is now marked as if it was compiled /platform:x86
Scenario: you have an Everett (.Net 1.1) application bar.exe which you would like to enable to run on 64-bit.
Run: corflags /UpgradeCLRHeader bar.exe
Result: bar.exe now looks like a Whidbey (.Net 2.0) “any cpu” application and will load under the 64-bit 2.0 runtime on a 64-bit OS. It is still Everett compatible however and will run as a 32-bit 1.1 application on a 32-bit OS.
See this blog entry for more info on what /UpgradeCLRHeader does: http://blogs.msdn.com/joshwil/archive/2004/10/15/243019.aspx
It is notable that Corflags.exe doesn’t have the ability to force an application to only run on a 64-bit machine, that is something that needs to be done at the compiler level (Corflags.exe operates on PE32 images, whereas 64-bit only applications need to be PE32+ images).
Corflags wears another hat as a diagnosis tools for figuring out what the loading characteristics of your application will be:
E:\temp>corflags chartest.exe
Microsoft (R) .NET Framework CorFlags Conversion Tool. Version 2.0.50405.00
Copyright (C) Microsoft Corporation. All rights reserved.
Version : v2.0.41026
CLR Header: 2.5
PE : PE32
CorFlags : 1
ILONLY : 1
32BIT : 0
Signed : 0
If you run Corflags.exe and pass a managed application name without any switches it will tell you what the state of the interesting flags in the image are. It is educational to play with compiling some test code using the different /platform:X switches and then running Corflags on the exe to see what the state of the executable is. Briefly:
ILONLY: Managed images are allowed to contain native code, however C# and VB images don’t. To be “any cpu” an image may only contain IL.
32BIT: Even if you have an image that only contains IL it still might have platform dependencies, the 32BIT flag is used to distinguish “x86” images from “any cpu” images. 64-bit images are distinguished by the fact that they have a PE type of PE32+.
CLR Header: 2.0 indicates a .Net 1.0 or .Net 1.1 (Everett) image, 2.5 indicates a .Net 2.0 (Whidbey) image. Very confusing, unfortunate but true.
Here’s some other interesting reading on the subject of where managed applications run on 64-bit machines:
http://blogs.msdn.com/joshwil/archive/2005/04/08/406567.aspx
http://blogs.msdn.com/joshwil/archive/2004/03/13/89163.aspx
http://blogs.msdn.com/joshwil/archive/2004/03/11/88280.aspx
-
[updated 10:50am 5/2/05: It turns out that I copied and pasted an error in my code from the newsgroup posting I was answering. However a kind reader was able to spot it and I've fixed it, I'm getting new data and will updated graphs later today, however the points of the article remain valid]
[updated 8:04am 5/3/05: Added new graphs for data from fixed code. As expected, the results are the same, the peaks just moved to the left]
Subtitle: Micro-optimizations for 64-bit platforms
DISCLAIMER: As usual with performance discussions, your mileage will vary, and it is highly recommended that you test your specific scenario and not make any vast over-generalizations about the data contained here.
The other day on the new MSDN forums a question came up of what the performance difference of a piece of code would be when it was run on 64-bit vs. 32-bit. In this case the poster specifically talked about the question of what the performance difference between managed code running natively on an X64 64-bit CLR and the corresponding managed code running natively on a 32-bit X86 CLR for a simple copy loop which moves data from one byte array to another. I decided it would be interesting to do an analysis of this and so here we are.
I wrote up a little unsafe C# code which approximates what I believe the poster to be talking about, it goes something like this:
class ByteCopyTest
{
byte[] b1;
byte[] b2;
int iters;
public ByteCopyTest (int size, int iters)
{
b1 = new byte[size];
b2 = new byte[size];
this.iters = iters;
}
unsafe public ulong DoIntCopySingle()
{
Timer t = new Timer();
t.Start();
int intCount = b1.Length / 4;
fixed (byte* pbSrc = &b1[0], pbDst = &b2[0])
{
for (int j=0; j<iters; j++)
{
int* piSrc = (int*)pbSrc;
int* piDst = (int*)pbDst;
for (int i=0; i<intCount; ++i)
{
*piDst++ = *piSrc++;
}
}
}
t.End();
return t.GetTicks();
}
}
Here we can see a simple piece of code that facilitates coping from one byte array into another. It is then easy enough to run this test under both the 64-bit and 32-bit CLR to compare performance. In this case I varied the byte array in size from 256 Bytes up to 256MB and ran a varying number of iterations so that each time measured is for copying the same amount of total data (about 5.1GB).
Something to note about these tests, is that they aren’t actually testing the internal working of the CLR so much as they test the code-generation capabilities of the JIT32 and JIT64 and the memory/cache of the machine that the test is run on.
Here, when using a copy loop that copies a single int (4-bytes) at a time from one array to another we can see that the JIT32 seems to generate better code and in many cases the 32-bit version wins. We can see that in both cases the time taken goes drastically up when we go from 1MB to 2MB and then levels off somewhat. This is where the processors on die cache stops being able to keep up as well and our program’s run time ends up being ruled by memory access, we will see later that the particular implementation of the copy loop at this point ceases to matter much.

While that is interesting, it might be even more interesting to compare a copy loop that uses a long (8-byte) instead of an int given that registers are 8-bytes wide on the X64 platform that means we can fit the whole long into a single register in the inner copy loop.

Here we can see that the long based copy loops definitely out perform the int based copy loops, and they do so consistently on both platforms… That this is faster on 32-bit is interesting, it turns out that the loop overhead is so great that breaking the long down into two 4-byte pieces to copy it inside of the loop is a win, effectively we’ve just made the jit unroll our loop one level for us. In this case it turns out to be a win.
loop2$ | | mov esi,ebx
| | add ebx,0x8
| | mov ecx,ebp
| | add ebp,0x8
| | mov eax,[ecx] // first 4 bytes
| | mov edx,[ecx+0x4] // second 4 bytes
| | mov [esi],eax
| | mov [esi+0x4],edx
| | add edi,0x1
| | cmp edi,[esp]
| |<--jl ByteCopyTest.DoLongCopySingle()+0xb6 (loop2$)
We can see however that even with 32-bit beating it’s int based implementation the 64-bit version has a considerably smaller inner loop with fewer memory accesses which shows in the data above where we consistently see the 64-bit long based copy loop wining.
loop2$ | | lea rdx,[r9+r10]
| | lea rax,[r9+0x8]
| | mov rcx,r9
| | mov r9,rax
| | mov rax,[rcx]
| | mov [rdx],rax
| | inc r11d
| | cmp r11d,esi
| |<--jl ByteCopyTest.DoLongCopySingle()+0xb0 (loop2$)
We still see a plateau starting at copies of 2MB, here the latency of memory access takes over and the processor can’t keep up with the code. At this point the processor will be spending many cycles spinning waiting for data and few extra instructions aren’t going to hurt as badly.
The positive results of using the long copy loop on 32-bit invites us to try a copy loop which copies two ints or longs at a time instead of one to try and better utilize that processor. An implementation of this would look like:
unsafe public ulong DoLongCopyDouble()
{
Timer t = new Timer();
t.Start();
int longCount = b1.Length / 16;
fixed (byte* pbSrc = &b1[0], pbDst = &b2[0])
{
for (int j=0; j<iters; j++)
{
long* plSrc = (long*)pbSrc;
long* plDst = (long*)pbDst;
for (int i=0; i<longCount; ++i)
{
plDst[0] = plSrc[0];
plDst[1] = plSrc[1];
plDst += 2;
plSrc += 2;
}
}
}
t.End();
return t.GetTicks();
}
We will call this a “double” copy loop (and our former code a “single” copy loop). Let’s look and see how the double copy loops do on 64-bit:

Here we can see that the double long copy loop wins over the others, and, interestingly the double int and single long loops are very close. This would be expected as they are coping the same amount of data per iteration through the inner loop, however, the double int implementation uses more instructions to do it and does look to be a bit slower through most of the graph.
When we put everything together into a single graph we can see that the best of the implementations (double long on 64-bit) beats the worst of the implementations (single int on 64-bit) by around 50% which is significant. Most of the implementations fall somewhere in the middle however and vary minimally from implementation to implementation.
We can see that unrolling the loop only works so far before we see diminishing returns in that on the 32-bit platform the double long implementation isn’t that much faster than the double int implementation even though it is moving twice as much data per iteration of the inner loop. This code is getting to the point where loop overhead is lost in the noise of memory access.

What is the moral of the story? This code can be faster on 64-bit for certain scenarios, but if you’re writing it you might have to think about it (once again good engineering triumphs over good hardware). For instance, you might have written the single int copy loop for some super optimized routine in your code when thinking about a 32-bit target, if that is the case then that piece of code may run marginally slower on 64-bit (or not, see other graphs below), and if it’s really important you might consider revising it to be long based for a 64-bit win. In the end we’ve seen that making it long based actually results in a win for both 32-bit and 64-bit platforms. This supports an assertion that you will commonly hear me broadcasting to anyone who will listen, “Good Engineering == Better Performance”. It’s true regardless of platform.
While examining this copy loop is a fun game to play, chances are that most of your code isn’t this low level. Chances are also good that most of your code is already fast enough on both platforms. As Rico is apt to say, premature optimization is the root of all evil. I highly recommend that you profile, a lot. And then make educated decisions about the parts of your program which it would make sense to specifically do some work to try and optimize for 64-bit. The likelihood is high that places where you can find something very low level that is 64-bit specific are few and far between. Often the hot spots that you find will be places where optimization just plain makes sense regardless of the target hardware platform. Then it’s just a task to think about that general optimization and hopefully keep 64-bit in mind.
Well, we’ve managed to make it to the end of this post without me directly answering the question posed in the title… In case you’ve forgotten, it is “Isn’t my code going to be faster on 64-bit???”
Maybe.
I know, a pointy haired answer at best… The fact of the matter is that there are a lot of cases where 64-bit processors will provide a significant boost to current applications which can take advantage of the additional memory and registers. However, there are some applications which just by their nature will run at a comparable speed to their 32-bit siblings. And some that will run somewhat slower. It is unfortunately impossible to provide a universal answer to the question for every application under the sun.
The big blocker to a universal speed boost from 64-bit processors is that they don’t fundamentally change one of the big limiting factors of many applications, I/O, both to memory and to the disk or network. Given that most of the time processors in modern machines are spinning, waiting for something to do, the difference of a few instructions in a tight loop when you’re waiting on memory can be so small as to not matter.
Which brings us to an interesting point, as can be clearly seen in the graphs above, running out of cache can be a significant problem on modern processors… This unfortunately is the current challenge for 64-bit computing, a challenge which is somewhat increased by managed runtimes which have a tendency to exacerbate coding patterns which are very reference heavy. References (pointers for you old-school c++ types like me) grow on 64-bit, in fact they double in size from 4 bytes (32-bits) to 8 bytes (64-bits). Depending on application architecture this can have a big effect on cache utilization and correspondingly performance.
So, maybe.
I’ll leave you with this sentiment: “Good Engineering == Good 64-bit Performance!”
The code for this example can be found here.
-
[fixed typo: 9:37am]
I received a question about this recently, so i figured i'd elaborate here with a little example...
Let's assume we have the following three dlls:
anycpu.dll -- compiled "any cpu"
x86.dll -- compiled "x86"
x64.dll -- compiled "x64"
And the following three exes:
anycpu.exe -- compiled "any cpu"
x86.exe -- compiled "x86"
x64.exe -- compiled "x64"
What happens if you try to use these exes and dlls together? We have to consider two possible scenarios, running on a 32-bit machine and running on a 64-bit machine...
On a 32-bit x86 machine:
anycpu.exe -- runs as a 32-bit process, can load anycpu.dll and x86.dll, will get BadImageFormatException if it tries to load x64.dll
x86.exe -- runs as a 32-bit process, can load anycpu.dll and x86.dll, will get BadImageFormatException if it tries to load x64.dll
x64.exe -- will get BadImageFormatException when it tries to run
On a 64-bit x64 machine:
anycpu.exe -- runs as a 64-bit process, can load anycpu.dll and x64.dll, will get BadImageFormatException if it tries to load x86.dll
x86.exe -- runs as a 32-bit process, can load anycpu.dll and x86.dll, will get BadImageFormatException if it tries to load x64.dll
x64.exe -- runs as a 64-bit process, can load anycpu.dll and x64.dll, will get BadImageFormatException if it tries to load x86.dll
-
Last night I played with Paint.Net for a while and got it running on the native 64-bit CLR with whidbey... Pretty cool stuff!!!
Actually, a while might be overstating things... It took longer to install a copy of VS 2005 on my clean AMD64 machine than it did to get Paint.Net running natively, in fact there were no code changes needed to get it to run on .Net 2.0 (just had to change a couple project settings) and only 3 small changes to the code to get it to run on 64-bit (the biggest being removing a dependency on some assembly code used to figure out whether or not a machine has HyperThreading). All of this took < 20 minutes.
Check out a blog about Paint.Net here: http://blogs.msdn.com/rickbrew/
Here's some screen shots to whet your appitite
Paint.Net running on 64-bit CLR
Copy and Paste between 32-bit and 64-bit versions of Paint.Net
A 2GB image in Paint.Net, note the 4GB memory usage (it actually has a 6GB VM footprint at this point, yes... this machine does have a lot of RAM).
Note: this is all running on a "very close to done" internal Beta2 .Net Framework.
-
I just ran across these, they are a couple of really good reads on the .Net GC...
http://blogs.msdn.com/maoni/archive/2004/06/15/156626.aspx
http://blogs.msdn.com/maoni/archive/2004/09/25/234273.aspx
-
Here are a couple of really worthwhile entries written by Junfeng on the GAC, why you might compile platform specifc and 64-bit... I don't know if I need to bother with my intended GAC entry anymore...
http://blogs.msdn.com/junfeng/archive/2004/08/11/212555.aspx
http://blogs.msdn.com/junfeng/archive/2004/09/12/228635.aspx
-
[10/15, 2:04pm, fixed a couple typos; 10/15, 4:51pm, clarified a point]
Before you read this entry, you might want to read these two entries:
- http://blogs.msdn.com/joshwil/archive/2004/03/13/89163.aspx
- http://blogs.msdn.com/joshwil/archive/2004/03/11/88280.aspx
In case you skipped the link and kept on reading I’ll summarize the first post linked to (I however believe that they are really worth reading):
- “Bitness” is what we call an assembly’s ability to tell the OS and CLR what type of machine the assembly is safe to run on (32-bit vs. 64-bit, and on 64-bit: X64, IA64 or both).
- .Net 1.0 and 1.1 assemblies didn’t know anything about bitness
- There are things that you can do in a managed assembly which force you to need to bind it to one platform; these include: p/invoke, unsafe code, managed c++, etc…
The point is, in v1.0 and v1.1 we let you create assemblies that might very well have problems running on a 64bit platform, because of that we have decided that with .Net 2.0 (which is the first version of .Net to support 64-bit platforms natively) there will be an enforced loader policy such that 1.0 and 1.1 apps will be loaded by the 32-bit runtime on the machine.
We have gone back and forth on the right way to do this and have settled on the following:
- .Net 2.0 compilers will produce PE images with the IMAGE_COR20_HEADER.MinorRuntimeVersion set to 5, this matters in the case of MSIL (/platform:anycpu) images where the loader has the opportunity to choose whether the image should be loaded as a 32-bit or 64-bit image.
- MinorRuntimeVersion <5 will be used by the OS Loader to determine if the assembly is a .Net 2.0+ assembly or an older .Net assembly
- This will cause only .Net 2.0 executables to be loaded by the 64-bit runtime.
For reference, here is the definition of the IMAGE_COR20_HEADER:
// CLR 2.0 header structure.
typedef struct IMAGE_COR20_HEADER
{
// Header versioning
ULONG cb;
USHORT MajorRuntimeVersion;
USHORT MinorRuntimeVersion;
// Symbol table and startup information
IMAGE_DATA_DIRECTORY MetaData;
ULONG Flags;
ULONG EntryPointToken;
// Binding information
IMAGE_DATA_DIRECTORY Resources;
IMAGE_DATA_DIRECTORY StrongNameSignature;
// Regular fixup and binding information
IMAGE_DATA_DIRECTORY CodeManagerTable;
IMAGE_DATA_DIRECTORY VTableFixups;
IMAGE_DATA_DIRECTORY ExportAddressTableJumps;
// Precompiled image info (internal use only - set to zero)
IMAGE_DATA_DIRECTORY ManagedNativeHeader;
} IMAGE_COR20_HEADER;
It turns out that the 2.0 in the header structure name is kind of a misnomer and results from some pre-1.0 definitions of this structure. All released versions of the .Net runtime have had images that contain a IMAGE_COR20_HEADER structure. Additionally, all released versions of the .Net runtime have specified the MajorRuntimeVersion=2 and MinorRuntimeVersion=0. Kind of weird you might say for a v1.0/1.1 product? Yeah… That’s history for you…
What does this mean?
Well, it means that if you have a v1.0/1.1 assembly and run it on a 64-bit box with .Net 2.0 installed it will run under the 32-bit runtime in the WOW64. Whether it runs under a 1.1 32-bit runtime or the 2.0 32-bit runtime will be determined by the CLR’s loader policy, just as it would on a native 32-bit box. If you’re using v2.0 assemblies on a 64-bit machine and you have compiled them without the /platform switch or with /platform:anycpu (the default) then your image will load in the native runtime on whatever box you put it on. And of course the /platform:x86, /platform:x64, /platform:itanium switches work as they imply.
NOTE: /platform:anycpu doesn’t keep you from shooting yourself in the foot with a bad P/Invoke signature, unsafe code, etc… See this (http://blogs.msdn.com/joshwil/archive/2004/03/16/90612.aspx) blog entry for an example.
If you do have a managed v1.0/1.1 app which you believe to be 64-bit safe and whidbey compatible there is an easy way to make it run in 64-bit mode… You just have to whack the MinorRuntimeVersion to 5 in the IMAGE_COR20_HEADER for the image. Along these lines there will be a tool in the .Net 2.0 SDK called corflags.exe which will allow you to modify an image this way. A v1.0/1.1 image which has had its MinorRuntimeVersion whacked will still be compatible with the v1.0/1.1 runtimes (as per the version it was compiled against).
WARNING: If you don’t know what’s going on inside of some v1.0/1.1 image (if you didn’t write it say) you should be _VERY_ careful bumping it up to 64-bit as it may break in unexpected ways.
If you are a compiler writer and you want your application to run in 64-bit native mode under .Net 2.0 then you will need to produce images with this updated IMAGE_COR20_HEADER.
NOTE: .Net 2.0 assemblies that were compiled against the Beta 1 version of the framework (and the community drops up to now [10/15/04]) will act like v1.0/1.1 assemblies on newer builds of 64-bit OSes and load under the WOW64 in 32-bit mode.
What about v1.0/1.1 dlls?
If you make a .Net 2.0 executable and link against a v1.0/1.1 dll you will be able to load it into your process as if it was a .Net 2.0 MSIL assembly. If that v1.0/1.1 dll has code that isn’t safe to run in 64-bit mode it may crash.
Appendix: Why I think this is a good solution:
There are a number of ways in which we could have solved this problem. An easy one would have been to use the version string contained within the metadata for managed images. That string looks something like “v1.1.4322”, it represents the version of the framework which the assembly was compiled against, and.Net 2.0 assemblies will look something like “v2.0.X” (X still to be determined). You can argue that this makes sense, after all, Microsoft’s .Net 2.0 compilers (csc.exe, vbc.exe, cl.exe, etc…) will produce images that will be marked correctly. They need to since they support new features of the .Net 2.0 runtime (generics comes to mind) and bind against 2.0 frameworks libraries.
This however forces a tight binding between the runtime version and the compiler version, and doesn’t recognize the fact that there are plenty of assemblies compiled against the v1.0/1.1 frameworks that will run fine in 64-bit mode under v2.0. But even given that, its downfall is that it is a very Microsoft centric view, it assumes that the compilers will be updated at the same time as the runtime (which currently our compilers are). That tight binding wouldn’t recognize that there are compilers out there that only produce code which is safe to run in either 32-bit or 64-bit mode. Those compilers currently produce an image that works with the v1.0 or v1.1 runtime, and the compiler vendors may very well want to produce an image that also runs in 64-bit mode under the v2.0 runtime.
That’s the distinction, while Microsoft is producing compilers in this version of .Net that won’t create images which will run under any runtime prior to v2.0, other compiler vendors might very well want to produce images that run under prior versions (for deployments that maybe don’t have v2.0 installed yet) and yet still run in 64-bit native mode under v2.0 on a 64-bit machine.
This does put a great responsibility upon compiler vendors who chose to produce images that are marked this way, those compilers should only produce code which is safe to run in both 32-bit and 64-bit mode. Alternatively, they can do as our compilers do and let you shoot yourself in the foot if you want to by producing MSIL images with bad P/Invoke signatures and unsafe code… That puts the onus on the developer.
-
I was just playing around with some stuff at home where I was wishing that we had compile time attributes, it looks like XC# supports what I was looking for, I'll have to play with it a bit to be sure...
http://www.resolvecorp.com/Products.aspx
-
Check out Mike's blog for tons of great insight into the CLR debugging APIs.
http://blogs.msdn.com/jmstall/
-
<edited post, 3:51pm, something in the formatting was messing up the whole page...>
The dynamically growing ArrayList structure is an interesting one, the other day I was looking at memory usage of some performance scenarios on 64-bit machines vs. 32-bit machines and noticed that ArrayLists are double the size on 64-bit machines. After looking at the screen for a second wondering why that was it hit me… Under the covers an ArrayList is just a big Object[], and what’s an Object[]? It’s an array of references to Objects, references which are bigger on 64-bit platforms because they have to be able to reference more memory. Unfortunately that’s just a fact of life with bigger pointers.
But that realization did make me wonder what’s the difference in memory usage characteristics between using an ArrayList and using our new (v2.0) generic List<T> when you’re filling the list with primitive (non-reference) data types. It turns out to be interesting (at least to me) though predictable, so I thought I’d show it here…
As a spoiler, if you think that using the generic List<T> is better, you’re right! But you might not have guessed how right you are!!
Here’s my test code (I use a compile time define to compile with the generic v. non-generic list), I decided to just play with differences between lists containing ints here because they have an interesting property in that the size of a primitive int is the same as an Object reference on a 32bit machine. More about that at the end.
// SIZE is big enough that the list size dominates my memory usage
const int SIZE = 1000000;
public static void Main()
{
#if ARRAYLIST
ArrayList list = new ArrayList();
#else
List<int> list = new List<int>();
#endif
for (int i=0; i<SIZE; i++)
{
list.Add(i);
}
}
First off let’s think about what happens in these two scenarios:
1) ArrayList, under the covers this is an Object[], so to put an int value into the ArrayList we’ll have to box it (in the process creating a heap allocated Int32) and then put the reference to the boxed value into the array.
2) List<int>, with the generic implementation under the covers we’re using an int[], no boxing needed and instead we just shove the actual value being added into the array (thereby avoiding creating the heap allocated Int32).
Given that, we would expect the ArrayList implementation to be more of a memory hog, and it is
Results from a 32bit machine:

Here we can see that 19MB is allocated by the 32-bit ArrayList test, 11MB of which is simply heap allocated Int32 objects which result from the boxing of the int values. We also have 8MB worth of Object[] that have been allocated, notice that the amount of allocated Object[] is larger than the actual amount of space needed because I didn’t set the ArrayList capacity to something near what I knew we would actually. Instead I relied on the ArrayList auto-resizing which uses a 2x growth factor thus possibly ending up with up to ½ of the underlying array being unused, in this example that doesn’t effect the point that I’m trying to make so I ignored it.

On the other hand, using the List<int> version of the test we under the hood allocate a large Int32[] instead Object[], and we don’t need to heap allocate any Int32 objects. These differences result in memory usage of only 8MB for the Int32[]. Conveniently for my example on a 32-bit machine the size of an Object reference is the same as a 32-bit int so these arrays are the same size.
What do we see on a 64bit machine?

In this case the ArrayList test uses approximately two times more memory than the 32-bit version, this is unfortunately a fact of life in the 64-bit world… Given this data we can see that the reason is two fold:
1) References are bigger, and therefore our Object[] which underlies the ArrayList is double the size of it’s 32-bit version (16MB instead of 8MB)
2) Objects take more space in memory in a 64bit process. Every managed object has a couple pieces of CLR goo attached to it, a MethodTable* and a sync block, and these each end up being 8 bytes instead of 4 bytes like they are on 32-bit. In a small object like Int32 this combined with the fact that we keep the total size pointer size aligned results in an object that is two times as big on 64-bit. This results in 23MB of Int32 objects in a 64-bit process vs. 11MB in a 32-bit process. (see appendix below)

However, we can see that if we had instead used the generic List<int> we would have only allocated the same 8MB that we did in a 32-bit process. It is of note that since object references are 8 bytes on a 64-bit machine but an int is still 4 bytes we end up with the size of the underlying array being 50% smaller in the List<int> test case than the ArrayList case whereas on 32-bit they were the same. With smaller primitive data types (short, char, bool) the difference is of course more significant and is visible on both platforms since at that point the types are smaller than a object reference on 32-bit platforms as well.
Given that, what’s the difference in memory usage on the two platforms for the two solutions?
ArrayList List<int> Difference (%)
32-bit 19MB 8MB 237%
64-bit 39MB 8.1MB 481%
We can see that though it is a definitely a better idea to use the generic List<int> on both platforms it becomes especially important on a 64-bit machine…
Then again, how many people are using ArrayList’s for big lists of ints and other primitive types? That’s a question that I don’t know the answer to…
The analysis (from a memory perspective) would be a lot different if we were talking reference types instead of ints. In that case we don’t have the option to avoid creating object instances; and the size of the reference to the reference type is the same size as that of a reference to an object. There are still plenty of reasons to use the generic List<T> in that case… But memory usage would no longer factor in.
Appendix: using SOS to look at Int32 objects in memory
64bit, Int32:
0:003> !do 000007ffe724d768
Name: System.Int32
MethodTable: 000000005cbf4408
EEClass: 000000005cc66380
Size: 24(0x18) bytes
(C:\WINDOWS\Microsoft.NET\Framework64\v2.0.amd64chk\assembly\GAC_64\mscorlib\2.0.3600.0__b77a5c561934e089\mscorlib.dll)
Fields:
MT Field Offset Type Attr Value Name
000000005cbf4408 40003cf 8 System.Int32 instance 3 m_value
32bit, Int32:
0:003> !do 00a56adc
Name: System.Int32
MethodTable: 78c12528
EEClass: 78c4ae44
Size: 12(0xc) bytes
(C:\WINDOWS\Microsoft.NET\Framework\v2.0.x86ret\assembly\GAC_32\mscorlib\2.0.3600.0__b77a5c561934e089\mscorlib.dll)
Fields:
MT Field Offset Type Attr Value Name
78c12528 40003bc 4 System.Int32 instance 437 m_value
-
Sometimes the differences between platforms can show up in interesting ways. Last week I was looking at a bug that was filed about a difference in error mode between IA64 and x64/x86 platforms… I thought the investigation led me down an interesting path so I thought I’d share it with you.
What might you assume the thread stack layout to look like on x86, x64, IA64? Well, fundamentally they all look pretty similar, something like this (artistic license has been taken):
-- top of stack
0x9000 -- Frame A
0x8000 -- Frame B
0x7000 -- Frame C
0x6000 -- Frame D [call graph looks like: A()->B()->C()->D()]
0x5FF8 <return address>
0x5FF0 BYTE* ptr
0x5F00 BYTE[] a (stack allocated byte array, size F0)
0x5000 <space>
0x1000 Soft guard
0x0000 Hard guard
Of course that has been significantly simplified for the purposes of this discussion, and some of the addresses might be a little bogus as I just made them up. The interesting take-away’s however are:
1) The stack “grows” down. That is D() is called by C() and therefore D’s frame is at a lower address than C’s.
2) At the end of the stack is a “guard” region, this is to implement stack overflow exception handling. If you touch the soft guard the OS will raise a stack overflow exception and you will be given the stack space that is the guard region to deal with it.
3) After the soft guard is the hard guard. The hard guard is always unallocated memory which will cause an AV if you touch it, in fact, if you’re dealing with a stack overflow caused by touching the soft guard and you use up too much stack and touch the hard guard then you’ll get an AV which will take down the process.
4) There is no stack guard region at the top of the stack to protect you from stack underflow, up there is just some random memory, could be another thread stack, could be the managed heap, could be the end of memory…
If you read off the top of your stack the results are undefined, but you can safely assume that if you keep reading then at some point you’ll get an AV, or so it seems.
Who would read off the top of the stack you might ask? Probably no one, but yesterday I ran into a test case that was doing just that, it would create a stack based byte array and then pass a pointer to the first element of the array to our unsafe string constructor which takes a byte* and a length. Instead of giving it the actual length of the byte[] that was created on the stack, the test case would proceed to pass a length like Int32.MaxValue or some other such huge (and incorrect) thing.
What happens behind the scenes in the BCL at this point isn’t exactly rocket science, we create a string and proceed to read bytes out of the passed in byte[], it is very similar in concept to if you wrote the following c# code yourself:
public unsafe void EventuallyBlowUp()
{
SByte* p1 = stackalloc SByte[256];
SByte temp;
for (int i=0; i<Int32.MaxValue; i++)
{
temp = *(p1 + i);
}
}
That’s over simplified, really we make a string after doing some range checks and such and then memcpy the data from the byte* into the string (there’s a reason that this code is marked as unsafe). Note that while the stack grows down, our reading of the data from the SByte[] results in addresses that grow up. Therefore at some point if the offset gets big enough we read off the top of the stack and into random memory.
In this specific test case we were looking for the “expected” AV to happen and be converted into our new AccessViolationException (I think this is new in v2.0). But on IA64 it wasn’t, instead it was coming back as a StackOverflowException. Confusion ensued… For a while I was convinced we had something weird going on where in this random case we had two thread stacks next to each other and for some reason instead of getting the expected AV when we hit the hard guard for the next stack we were skipping into its soft guard and getting a stack overflow instead, the problem however didn’t turn out to be nearly so convoluted.
First a little background, the IA64 platform actually has two stacks for a thread, the “normal” stack and the “backing store”. I really should get around to writing up a piece on the IA64 calling convention and by association the interesting thing that is the backing store and rotating register stack… but for now it is enough to know that IA64 has this other thing called the backing store which is used for storing register values to memory from registers that have been allocated by a function for use as input, locals and output… And this backing store is laid out in memory such that it is next to and contiguous with the “normal stack”… And it grows up instead of down. The picture looks something like this:
0x1a000 Backing store hard guard
0x19000 Backing store soft guard
0x14000 <space>
0x13000 -- Frame D rotating register store
0x12000 -- Frame C rotating register store
0x11000 -- Frame B rotating register store
0x10000 -- Frame A rotating register store
-- “top” of “backing store” stack
-- top of “normal” stack
0x9000 -- Frame A
0x8000 -- Frame B
0x7000 -- Frame C
0x6000 -- Frame D [call graph looks like: A()->B()->C()->D()]
0x5FF8 <return address>
0x5FF0 BYTE* ptr
0x5F00 BYTE[] a (stack allocated byte array, size F0)
0x5000 <space>
0x1000 Soft guard
0x0000 Hard guard
When we have code like that which we saw above, and we run it on an IA64 box the result of running off the top of our “normal” stack (where the byte[] is allocated) is different. Instead of immediately running into random memory (and presumably AVing), we will consistently run into a known piece of memory that is the backing store stack. And, as we continue reading up that stack eventually we will run into the backing store soft guard region and cause the OS to issue a stack overflow exception which the CLR will convert to a managed StackOverflowException and return to the code in EventuallyBlowUp(). Maybe EventuallyBlowUp’s caller deals with the stack overflow, maybe not, of course the same can be said for the AV.
The moral of the story, it’s difficult to completely abstract away the underlying platform. In this case we had a discussion about whether or not to “fix the bug” in the string constructor such that it would always return an AV by checking whether or not the requested start offset and length when used with the given pointer (if it was stack allocated) would result in stack underflow. We decided for now to leave it like it is because it’s unsafe code and the current implementation makes the failure mode match that of a programmer writing similar unsafe code themselves.
Fixing the general unsafe code stack underflow case is of course far from trivial, and of debatable value.
-
I know... I've been meaning to write more, but I've been _really_ busy. I am going to Utah for a little spring skiing this weekend and I promise to spend some quality time with my laptop on the plane and crank out the finishing touches on a couple of half finished entries...
-
Ok, so Word for the Mac is failing me right now. I've tried twice to start this entry there and both times Word has gone kaput on me. Back to my trusty text editor… As for the inevitable "why Mac?" question... Well, I still haven't found a laptop I like as much as my Titanium Powerbook. What can I say, I'm a hardware snob.
I originally intended to write my next entry about the GAC and its usefulness on 64-bit machines (for both the 32-bit and 64-bit CLR(s) that live there). I think this an interesting topic, especially given this article on Chris Sells' site. Alas, in writing it I realized that I need to do a little research and talk to a couple people before I feel completely competent with my facts.
So, to whet your appetite while we wait, how about an entry about the managed x64 calling convention, and a fun PInvoke bug that shows up on 64bit platforms because of the hardware difference and this calling convention? I would highly recommend reading Raymond's treatment of x64 calling convention.
One of the nice things about x64 is that we have narrowed ourselves down to one standard calling convention unlike x86 (of course if you're writing assembly you can do whatever you want). Both native code generated by the VC++ compiler and JITted managed code follows this convention. And it goes something like this:
- Arguments 1-4 are passed in registers rcx,rdx,r8,r9 (or if floating point Xmm0-Xmm3)
- Spill space is allocated by the caller for the enregistered parameters
- Additional parameters are passed on the stack previous (stack grows down) to this spill space in SLOT sized (read: 8-byte) chunks (i.e. even if you have a 1-byte bool, if you pass it on the stack it will take 8 bytes).
- The call instruction pushes an 8-byte return address onto the stack; this value will immediately follow the spill space for rcx.
- The stack must always be aligned to 16 bytes by non-leaf functions (read: if you make a call then you have to align it in the prolog).
- Non floating point returns are through rax (exception "retbufarg" which is treated later).
- Floating point returns are in Xmm0.
Floating point note: enregistered floating point parameters are put into the floating point register corresponding to their correct position in the argument list (e.g. if parameter 3 is floating point then it will be in Xmm2 instead of r8, this is different from IA64 where the floating point registers are filled using a "next available" heuristic).
That's the basics. Here are a couple of rules that build on that:
- If there is a "this" parameter (i.e. instance methods) it is put at the front of the argument list as arg1 and other args are moved by 1 slot.
- If there is a "retbufarg" then it will be treated as arg1, moving other arguments by 1 slot (including the “this” parameter). (e.g. arg1=retbufarg, arg2=this, arg3=declaration arg1, arg4=declaration arg2, etc...)
Most people (at least those reading this blog right?) know what a "this" parameter is, but what's a "retbufarg" parameter? It is a "secret" reference to space that is caller-allocated to receive the return value. This "retbufarg" parameter is passed when we can't put the return value in the return register rax. On x64 this happens when:
- the return value > 64-bits (e.g. won't fit in rax), excepting Doubles which will be returned through Xmm0.
- the size of the return value is not a power of two. e.g. a 7-byte value class (struct) returned by value will be returned by reference in a retbufarg.
Ok, so that's all well and good, but why did I need to know that you might be asking? Well, b/c it can affect lots of things. Lets take a PInvoke example that one of the devs on the 64-bit CLR team ran into on Thursday:
// defintion that worked on 32-bit
[DllImport(ExternDll.User32, ExactSpelling=true)]
public static extern IntPtr MonitorFromPoint(int x, int y, int flags);
The actual Win32 API specified that MonitorFromPoint() takes a POINT structure and an int argument named flags. Someone decided that it would be nice to not have to define a POINT structure (which is just an 8-byte structure consisting of two ints, x and y) and instead wrote their PInvoke using the two ints shown above.
This works on x86 where those parameters are passed on the stack. In fact, because you get lucky with the calling convention they look to the Win32 API as if you had correctly declared the POINT structure and passed it instead.
But!! On x64 this breaks in a rather interesting way... Let's go back to the calling convention discussion above. Using this scheme, the parameters will be set up as such:
rcx <- x
rdx <- y
r8 <- flags
Now, these register slots on x64 are 8 bytes wide, which means our 8-byte POINT structure, when passed by value, should actually be passed in a single register. What was the MonitorFromPoint() Win32 API expecting?
rcx <- POINT { LONG x, LONG y }
rdx <- flags
NOTE: keep in mind that the LONG as specified by MSDN here is the c++ LONG which is still 32 bits on 64-bit platforms, not 64 bits like the C# long. It is the equivalent of the C# int.
[correction made here, x/y high/low were reversed]
MonitorFromPoint() expected that x was the low 4 bytes of rcx and y was the high 4 bytes. As can be imagined, this code failed horribly on x64 as such:
-Specifically, the call was in some code that tried to compensate for multiple monitors by putting a dialog on the monitor where your mouse is.
return new Screen(SafeNativeMethods.MonitorFromPoint(point.X, point.Y, MONITOR_DEFAULTTONEAREST));
- The calculation depends on the x and y that you pass it (remember that the monitor’s upper left hand corner actually starts at 2000, 2000 or something like that)
[correction made here, re:messing up x/y position within struct... wrote it too late at night]
- The calculation that we do ends up FUBAR because the x you give the method ends up being seen by the Win32 API as the whole POINT structure. Thus, it thinks that y==0, and the dialog ends up pretty much unusable up in the upper left hand corner of the screen (halfway off the screen) with its title bar inaccessible to grab it and move it.
So, the fix, if you haven’t already guessed, is to define a POINT structure containing 2 ints “x” and “y” which you then correctly define as the first parameter to MonitorFromPoint(), in this way ensuring that the usage of MonitorFromPoint() is correct.
public static extern IntPtr MonitorFromPoint(NativeMethods.POINT pt, int flags);
NOTE: this will fail in the same way on IA64, but since this is an entry about the x64 calling convention, I thought I'd stick to talking about x64.
PInvoke errors are insidious because you might take for granted that the method you're calling is declared correctly. You would be likely to spend hours having to convince yourself that your managed code is correct. Or even worse, spend hours looking at your unmanaged code (or the disassmbly of some unmanaged code in Win32 for instance), convinced it is broken. Usually if there are PInvokes involved, I would take a look at those first, hopefully some of the CDP (customer debug probes) that are going into CLR for V2.0 will help out a lot. I haven't really played with them at all, but Adam Nathan's blog would probably be a good place to start.
Additionally, Raymond discusses what can go wrong when you mismatch calling conventions. This is something you might think impossible on 64-bit as we only have the one... But, a PInvoke declaration can have calling convention assumptions built into it, as seen above... Yet another case of "old problem, new form"!
-
As I alluded to in my previous post there are multiple ways that we can go in terms of supporting legacy 1.0/1.1 assemblies on a Win64 machine. The context of the 1.0/1.1 support story is made somewhat simpler by the current plan of having both a v2.0 32bit CLR and a v2.0 64bit CLR on the box but no 1.0/1.1 CLR bits.
I mentioned that the 1.0/1.1 compilers didn't know anything about “bitness”. Basically they spit out a PE image that said “Hey! I'm managed! Run me with the CLR!” (gross simplification), whereas the v2.0 compilers produce images that range from “Hey! I'm managed, and I can run everywhere!!” to “Hey! I'm managed and I only run on x86!” etc...
This brings us to the fundamental question of this post -- what to do with 1.0/1.1 assemblies?
Option 1: call them “Legacy” assemblies since they don't know about “bitness”. Require them to run in the WOW64 under the 32bit CLR as we can't say for sure that the developer who created them was thinking about 64bit compatibility when they were created (remember that many of these were created years before even a 64bit alpha of .NET was available at PDC last year). Additionally, make the loader get angry and spew something along the lines of “BAD_IMAGE_FORMAT” if you try to load a legacy assembly in a native 64bit managed process just as if you had tried to load a v2.0 assembly marked x86 only.
Option 2: treat them like the v2.0 notion of MSIL assemblies, allow them to be used from both 32bit and 64bit managed processes. By default if they are an exe kick off the 64bit CLR when someone tries to start them. This would cause them to run as a 64bit process even though their creators probably didn't have that potential in mind when the code was written and tested.
Cases can be made for both sides. Right now the more conservative approach is “Option 1” which is what we are leaning towards. But there are definitely some negatives to that, the primary one in my mind being that is makes the transition to 64bit harder for groups that have dependencies on a lot of managed code that they don't own but are willing to do the testing legwork to make sure they work in 64bit mode anyway. In effect it makes the 1.0/1.1 managed code assemblies much like 32bit native code components as dependencies for moving your app to 64bit because in that scenario we won't let you load 1.0/1.1 assemblies in your 64bit process.
One of the great things about managed code is that frequently there isn't much if any work to be done to move it to 64bit. But given “Option 1” above we would at least require the work of a recompile (though someone could imagine a tool that would be frightfully dangerous which would modify 1.0/1.1 headers to look like 2.0 headers to pretend to be a v2.0 compiled MSIL image... Please don't do this!!). If you don't own the managed code you're using that means waiting for whoever does to recompile and give you the properly tagd version before you can move your app to 64bit.
Mind you, that is probably better than the alternative. If we were to just load up 1.0/1.1 images in a 64bit process expecting that they should be the equivalent of v2.0’s MSIL (which is what the compilers are currently producing as a default) you could end up with all manner of random execution failure, usually related to calling into some native 32bit code or other... “Option 2” would allow those who are willing do the legwork in testing to do their due diligence, test their application in a 64bit environment thoroughly even though it might contain 1.0/1.1 components and be able to say with reasonable confidence that their customers wont have problems running 64bit native. The fact that I id “willing to do the legwork in testing” and “due diligence” etc.. should be setting off huge danger signals in your head. How many people are willing to toughly test some component they paid for, isn’t that part of what you paid for??
There are of course all manner of “in-between” scenarios, few of which are supportable or justifiable, so for the purposes of this debate lets stick to these two options.
The main reason I started writing this post however wasn't to make up your mind but to poll your thoughts…
Thoughts?
-
So, what is the WOW64? If you already have a firm grasp please feel free to skip these first few paragraphs to the section titled “Break in here for the .Net perspective”.
When I went and Google’d a bit to see if anyone had a convincing answer out there what I seemed to find instead was confusion. Note this article titled “WOW64 for AMD Released to the Public”. With a firmer understanding of what the WOW64 is the fact that this headline is misleading would not have escaped that journalist. So lets start with the name:
WOW64 = Windows On Windows64
It might be more appropriately written “Windows32 On Windows64” but then the acronym isn’t nearly as cool. Basically it is a layer of code that allows for 32bit processes to run just like they were running on a 32bit system under a 32bit OS (e.g. your normal managed app running on the CLR on WindowsXP on a Pentium 4) even though in reality they are running on a 64bit OS. In fact you can find a good indicator of this by looking in your WINDOWS directory where now you will not only find the misnamed SYSTEM32 directory (which holds the 64bit dlls for the 64bit OS) but also the SYSWOW64 directory which holds 32bit versions of dlls for the virtual 32bit OS.
<historical note> I believe that there was a WOW32 effort as well during the move from Win16 to Win32. That move was accompanied by a significant change in programming model as the Win32 API was introduced (The Win16 stuff still lives in WINDOWS\SYSTEM). Generally the Win64 platform uses the Win32 API with platform specific data types (pointers and such) just expanded to fit the new hardware. This is why Windows decided to stick with the SYSTEM32 directory as the primary home for Win64 dlls even though the name can be misleading. This entry and discussion on Raymond’s blog is an interesting read that talks about the joys of backwards compatibility and Win16. </historical note>
You can imagine (and not be too far off) that once you’ve been kicked off by the loader as a 32bit process WINDOWS\SysWow64 gets aliased (for your purposes) to WINDOWS\System32. This has side effects, like the fact that one of my favorite little text editors (EmEditor) which is 32bit can’t find files in the real WINDOWS\System32 directory when running in the WOW64 on my 64bit boxes (it took me a while to figure out why I couldn’t find a couple of scripts that I knew for sure I had dropped in there). There are other parts of the system which are also split into two parts, most notably the registry, unfortunately I don’t know nearly enough about the technical details of that split to discuss it intelligently here.
Under the hood Windows is doing a lot of work to make sure things continue to look just like you’re on a 32bit box and this generally pays off in app compatibility. As would be expected however there is a performance hit that you take for running in this mode, it is an especially heavy hit on IA64 where in the current implementation the x86 instruction set is emulated with software. It is worth noting however that it does work pretty darn well, and the x86 CLR that we’ll be shipping to Win64 users to run in the WOW64 is the same build as that which we ship to x86 users (we just have to mess with the installer a bit). Other large apps can run in the WOW64 and be quite snappy on 64bit Extended hardware platforms (read:AMD64, IA32e) which can natively run the 32bit x86 instructions in a special more of the processor. I've played with VS for instance running on an AMD64 machine like this and it's great.
It is important to understand that once you’re loaded up in the WOW64 you’re a 32bit process and there’s no turning back. You have to use 32bit dlls all around. On the other hand, once you start up 64bit you’re 64bit, no 32bit dlls allowed. So, unfortunately this is not one of those things you can slowly transition. It is possible to kick off another process of a different bitness from your own and communicate with it through RPC or out of proc COM, and this can serve as a transition for some. But hopefully over the next couple months I’ll help to convince you that the move to running native as a 64bit process is going to be worth it.
>> Break in here for the .NET perspective <<
So, what does this mean for .Net and the CLR? The Whidbey CLR will be available in 32bit and 64bit versions with the 64bit version supporting x64 (AMD64/IA32e) and IA64 (Itanium). And the current plan is that when you install the 64bit CLR on a machine we’ll slap down the 32bit version of Whidbey at the same time. Yes, you read that right – your 64bit machine will end up with two copies of the runtime. The 32bit version will be installed in \WINDOWS\Microsoft.Net\Framework just as it would on a native 32bit machine whereas the 64bit version ends up in \WINDOWS\Microsoft.Net\Framework64, no rocket science here, though the GAC issues can be interesting (I’ll talk about these for sure in a later blog entry). Why do we need two frameworks on the machine you might ask? Isn’t one of the cool things about compiling to IL that the JIT takes care of the hardware specific stuff and your code just automatically takes advantage of the new platform it is there?
Well… There are many cases where you just want your code to “float up” to the new platform and run as native 64bit code. In fact, if you’re writing a fully managed app today chances are good that this is what you desire. But, there are still a lot of apps that use PInvoke, COM Interop to inproc servers (including using VB libraries pre-.Net) or APIs that won’t initially be available in 64bit which could break horribly. Remember that once you’re process is started as 64bit that’s it. Trying to load and execute code from a 32bit dll at that point is undefined and your process will certainly die a quick and painful death. While it is true that most of the dlls that come with Windows will be available on both 32bit and 64bit platforms many custom unmanaged code out there hasn’t yet made the jump. The plus side to not going to a new Win64 API is that if you define your PInvoke signatures correctly, most of the Win32 API is still there in a 64bit version and you’ll just pick up the new dlls and run native! In cases where you’re using some unmanaged code that you have no control over and isn’t available in a 64bit version it behooves you to specify that your app requires that it be run in a 32bit process.
How do you specify that? Well, fundamentally it is specified in the CLR headers of your PE image (the format of your dll or exe on disk and in memory). Realistically it is set at compile time (e.g. with C# use the /platform switch). Your exes and dlls can be marked one of four things: MSIL, x86, x64, IA64 (names vary depending on who you talk to). Let’s start out with the obvious ones:
x86: on a 32bit platform this will just run, no hassles. On a 64bit platform this will get started up as a 32bit process in the WOW64 running under the 32bit CLR. You’ll load 32bit .Net FX images out of the GAC (like MSCorLib, System.Web, System.Windows.Forms, etc..). Your code will be JITted by the x86 JITter and will run (in emulation on IA64) as x86 code.
x64: on a 64bit x64 platform this will run as a 64bit process. On a IA64 or x86 box you’ll get a BadImageFormatException. When you’re running as a 64bit process your image will be JITted 64bit native, we’ll pull 64bit images out of the GAC (be it by loading a 64bit native image that has been previously ngen’d or grabbing IL out of the GAC and JITting it at runtime to be 64bit).
IA64: like x64, but swap x64 for IA64.
Then there is MSIL (or “anycpu” as I think the C# compiler switch goes, some people will also refer to this as agnostic or neutral), this indicates that the code here really isn’t processor specific, on a 32bit platform it will get loaded up as a 32bit process and on a 64bit platform it will float up to a 64bit process and run under the 64bit CLR. This is rather important to know as this is currently the default for applications that are compiled using Whidbey compilers (disclaimer: this is still being hotly debated as to whether or not it is the right default, “currently” in this case means if you look at the build that I’m working with on my box on 3/11/04, what we ship may/may not reflect that).
<1.0&1.1 Note> As for 1.0/1.1 apps that are deployed onto a 64bit box, right now we’re probably going with a model that assumes that those assemblies (which didn’t know about bitness) will run in a 32bit process under the WOW64 as if they had been compiled with /platform:x86. The thought process leading us this direction is that we’d prefer to be conservative and by running 1.0/1.1 apps in the WOW64 we ensure that they don’t run into bitness issues that their developers might not have anticipated as it wasn’t part of the .NET world when they were being developed.
Under that model, if you do want your application to run in a 64bit process you’ll need to explicitly recompile it with one of the Whidbey compilers which know about bitness. Then (currently, see disclaimer above) if you don’t specify /platform:x86 your app will run 64bit when/if it is deployed to an x64 or IA64 box. </1.0&1.1 Note>
<GEEK> There is actually some interesting stuff that goes on under the covers here as well, that is if you’re interested in PE images and loader magic, x64 and IA64 images generated by the compiler with /platform:x64 or :IA64 will be PE32+ images (the 64bit extension to PE32) whereas x86 and MSIL are PE32 images (other wise they wouldn’t work on 32bit OS’s). When the OS loader comes across a managed image the first thing it does is hand it to some CLR code we call the shim (mscoree.dll) which interrogates the image, potentially makes some fixups and gives it back to the OS loader to let the OS then kick it off and trigger the actual runtime to start up. But, to get the OS loader to load your app as a 64bit process you have to give it a 64bit image, so in the shim on 64bit machines we will actually modify MSIL images in memory to turn them into PE32+ images before handing them back to the OS loader. This then in turn causes the loader to start up the right (64bit) runtime. </GEEK>
>> End .NET perspective <<
Recap: So, what have I talked about?
- WOW64 isn’t the OS per se, but a subset of the Win64 OS which enables a 32bit application to run inside of a 32bit process on a 64bit OS while using 32bit system dlls and such.
- Whidbey CLR will include both 32bit and 64bit versions, both of which will be installed on 64bit machines. This allows both 32bit and 64bit managed applications to run in bitness correct native process depending on how the assemblies are tagged at compile time.
- Once a process is started up as either 32bit or 64bit all of the dlls/assemblies that are loaded into that process have to be compatible with that bitness. There is a significant complexity for instance in the GAC to make this possible for .NET FX images (I promise to talk more about that in a later blog entry).
- If you have a 32bit managed app which has dependencies on 32bit unmanaged code then you’ll need to either find a 64bit version of the unmanaged code or tag your managed app as x86 at compile time to make sure that you don’t float up to a native 64bit process. This will result in you having to live with your process running under the WOW64.