Welcome to MSDN Blogs Sign in | Join | Help

Haven't posted in a while... New challenges... New peers?

So, I haven't posted in quite a while, I need to get on that... In the meantime I've moved on from the CLR team after about 5 years of fun times and am now having fun exploring some new stuff that unfortunately I can't talk about... However, the peers are awesome, and, now we're looking for more awesome people:

http://www.douglasp.com/blog/2007/11/15/MyTeamIsHiring.aspx

Interested in kick ass technology with crazy smart people? Email Doug (douglasp@microsoft.com).

Posted by joshwil | 1 Comments

Patrick is blogging, make sure to check it out!

http://blogs.msdn.com/patrick_dussud/default.aspx

Patrick is super-smart... Hopefullly his blog will pick up where Chris Brumme's left off a couple years back.

 -josh

Posted by joshwil | 1 Comments

PerfConsole is my kind of 20% time...

Other companies have much talked about 20% time... That sounds great to me! Here at Microsoft (at least in my group) we do too, kinda. I find that every one of my peers has some interesting project that they're working on outside of their normal work activities, some of the categories for people I know of on the CLR are:

- pet projects related to the product (in this case CLR) which they are prototyping in the hope of getting some traction in the problem space

- pet projects related to the engineering effort around the product (e.g. static analysis, etc...)

- tools related to software engineering in general (much of my stuff falls into this category)

- random stuff using managed code just to use manage code (eating your own dog food is good right? especially when it tastes this good)

Many of the tools that we use to build and test the CLR started this way, two public examples being PerfConsole and MDBG. Both of these were started as pet projects and have become an integral part of CLR processes (admittedly MDBG significantly more so than PerfConsole at the moment, MDBG being the CLR's primary managed debugging test harness).

 

Anecdotally I believe that there are lots of people inside Microsoft doing this as well, internally we have a system for sharing tools you've written that help you get things done and last time I checked it had 4500+ tools on it, and that's just the stuff that people bother to publish.

 

I strongly believe that pet projects should be encouraged within any software development team. I'm glad that my experience with Microsoft has been that they do encourage experimentation, after all where do you think the CLR came from?

Posted by joshwil | 2 Comments

PerfConsole is unleashed...

So, Rico beat me to the punch: http://blogs.msdn.com/ricom/archive/2006/08/03/688019.aspx

 

As we have seen in the previous entry it is nice that the VSTS Profiler team has provided a tool which converts their .VSP files into something which is more easily readable by other programs. While I was working on CLR oerformance last year I spent a lot of time playing with the output of the VSTS Profiler before the UI pieces were ready, so I spent a lot of time using the data in the .CSV files to diagnose problems. After just a short amount of time doing this by hand I decided to go ahead and automate the process, I created a number of utilities to parse the output CSV files and find some interesting information. In that first rev you basically ha d to run the application anytime you wanted to change the inputs to the query (different string, different cutoffs for various things, etc.). It didn't take too long to realize that this wasn't going to scale, and PerfConsole was born...

What we can see above are some of the basics of PerfConsole's command syntax; overall it strives to mix some of the console module that cmd.exe or Windows PowerShell has with what is more or less a domain specifica language for interpreting profile data. There are a number of commands which are included in the base package which include the following which we can see above:

- load: loads a profile from disk, commonly you specify a .VSP file and PerfConsole will look for the (required) associated .CSV files which were generated either from the "export" function in VS or using the command line as outlined here.

- functions: takes a profile as input and returns the piece that corresponds to a list of functions which were seen in the profile with Inclusive and Exclusive times.

- modules: takes a profile as input and returns the piece that contains a list of modules which were seen in the profile with Inclusive and Exclusive times.

- sort: takes in a list and returns a sorted list ordered by some property (in this case "ex" is shorthand for "ExclusivePercentage")

- top: takes in a list and returns a subset of that list including X elements starting at element 0

Through the example above we can start to see the basic syntax of using PerfConsole: you use the '|' to connect statements where each statement (reading left to right) has input from the return value of the previous statement and the parameters specified. Like Windows PowerShell when things are passed from left to right they are full fledged .Net objects, and PerfConsole can use those objects to do some type checking as you go in order to tell you if you're combining commands in unsupported ways.

Incidentally I initially tried really hard to reverse the execution order and instead execute from right to left, it made for statements that read like a sentance "top 5 | sort ex | functions @board_ngen" == "get me the top 5 exclusive functions in @board_ngen"; however the people internally revolted and said that if that is what I wanted I couldn't overload '|' to do it.

As a side effect (or design point) of the way that PerfConsole passes objects between statements I have been able to implement a very powerful help system, for instance at any point you're able to pipe the result of any statement or set of statements to '?' and it will tell you the resulting data type and which commands will accept that type as a input (in a strongly typed manner). For instance:

> functions @board_ngen | ?
Data type is: PerfConsole.Data.FunctionSummaryData

This data type can be input to the following commands:

bottom
count
find
getitem
getproperty
sort
top
trim
timebytype
partition

Here we see that the functions command returns a FunctionSummaryData object which can be input to a number of other commands, including the top and sort commands. Here we also see that functions has this wierd "@board_ngen" parameter, in PerfConsole @ denotes a temporary (you can create these using the save command), and the temporary named "board_ngen" is the profile that I've imported for this demo.

Help also works well with commands. For instance:

> ? sort
Sort a ISortable (e.g. FunctionSummary, ModuleSummary). Each type has its own
sort types (e.g. 'name', 'address', 'exclusive', etc...).

Usage:
ISortable | sort <field:SortField> | ISortable
ISortable | sort <field:SortField> <direction:SortDirection> | ISortable

Have enum paramters:
SortDirection { Ascending, Descending }
SortField { Name, ExclusivePercentage, InclusivePercentage, Address }

Here we can see that we are able to see that the sort command takes as input something which implements ISortable (which by inference from the previous help output we believe that FunctionSummaryData does), it also can take up to two parameters which are values of the enum SortField and SortDirection (these are allowed to be shortened arbitrarially so long as they remain disabiguated from their peers). As a return value sort specifies that it returns an ISortable, if we pipe the return value from sort to the help we see:

> functions @board_ngen | sort ex | ?
Data type is: PerfConsole.Data.FunctionSummaryData

This indicates that sort actually returns a FunctionSummaryData (which implements ISortable), behind the scenes the sort command clones the original FunctionSummaryData and returns the modified copy.

 

Note that for simplicity's sake I've used PerfConsole with the /console command line option which causes results to show in the console, the default is to show results as HTML which you can click around to do more queries and such, in fact, before I end this blog entry which has already managed to get too long I'll show you Rico's favorite command:

> calltree @board_ngen | trim in > 5 | compacttree

 

Anyway, that's a primer on using PerfConsole, please download it here. When you start it up, the ?? command will open a longer document which should hopefully get you started quickly.

Posted by joshwil | 6 Comments

Performance Analysis

I have spent the last couple years investigating the performance of various parts of the CLR and managed code, in that time I have learned that a profiler is an invaluable tool. There are quite a few profilers available including: CLR Profiler, Intel's VTune, AMD's CodeAnalyst, Microsoft's VSTS Profiler, Kernrate and various profilers which Microsoft has developed internally (including the predecessors to VSTS Profiler) as well as many others which I don't list here only because I don't have experience with them. While every profiler has strengths and weaknesses, lately I've been finding that I turn to the VSTS Profiler more often than not for my first analysis.

If you have VSTS and want to learn more about the profiler see the team's blog: http://blogs.msdn.com/profiler/

VSTS Profiler has many options and through those can get at a lot of data about your application, however I usually find that some of the most interesting can be obtained with simple vanilla sample based profiling. You can do this through the VS interface, or from the command line using the VSTS Profiler command line utilities, basically the steps are as follows:

1) Only if profiling managed code: vsPerfClrEnv.cmd /sampleOn 

2) vsPerfCmd.exe /start:sample /output:<name>.VSP

3) vsPerfCmd.exe /launch:<application>.exe

    <<wait for application to close>>

4) vsPerfCmd.exe /shutdown

By following those 4 steps you can profile an application, the result is a .VSP file named <name>.VSP. This contains all the profile data collected during the run. This file is readable directly by VSTS. There are however interesting analysis to do on the data that the VSTS 2005 interface doesn't have support for, to address this issue the VSTS Profiler team has included a utility called vsPerfReport.exe which can translate a .VSP file into a set of comma seperated value (CSV) files, one for each of the major views seen in VSTS. In order to do this you execute:

5) vsPerfReport.exe /summary:all <name>.VSP

at the end of all this you will have a set of files named <name>_<type>.CSV which are views into the data in the VSP, you can open these in Notepad to view the text or import them into Excel to see the columns more clearly called out.

Once the data is in this form we can do whatever we like with it, for instance sorting by a particular column in Excel or searching for a particular row using a string in Notepad. However, wouldn't it be interesting to be able to do more?

Stay tuned...

Posted by joshwil | 4 Comments

Visual Studio Team System 180 day trial...

I just found out that VSTS has a free 180-day trial edition.

http://msdn.microsoft.com/vstudio/products/trial/

This is AWESOME because VSTS has a great profiler included in the package!

Posted by joshwil | 1 Comments

Should I choose to take advantage of 64-bit?

Here's the guts of a response that I posted a while back to an internal mailing list re: tradeoffs of runing your managed code as 64-bit vs 32-bit. YMMV, and I'll remind you that every perf question has a thousand answers depending on the situation.

 

>>>>>>> snip >>>>>>>>>>>>>>>>>

 

Here's my own personal list of the big pluses and minuses of moving to 64-bit code...

 

Pluses:

- more memory (+++++)

- better 64-bit math (+++)

- X64 OS kernel takes advantage of more memory to do good things for a lot of stuff (+++)

Minuses:

- things need more memory (pointers are bigger, and especially in managed code references are everything and are everywhere) (--)

- the processor's cache is effectively smaller (when comparing against the same machine in 32-bit vs 64-bit mode) because of the prior point (----)

- code also tends to be bigger because of extra prefix bytes and instructions that carry around 8-byte immediate values instead of 4 byte immediate values

 

What this tends to mean is that code that runs extremely well on 32-bit, doesn't have any 64-bit math (or otherwise take advantage of improvements in the 64-bit processor) and runs well in < 2GB of memory without having to bother hitting the disk for anything will likely continue to run on 64-bit with somewhat MORE memory usage and a little bit slower because the processors cache is effectively smaller when compared to the bloated size of the things that need to be in it.

In the scenario described above you get the minuses of the platform without taking advantage of the pluses.

If however you have an application or set of applications that can take advantage of the pluses to offset the minuses they can come out in the black (sometimes _very_ much so). We have seen a number of large applications which used to be memory starved on 32-bit and had some type of home-grown paging able to throw that more or less out the window and see their performance go up by 2, 3 or even 4X. PaintDotNet (which is a pretty cool photo editing application, Rick Brewster's blog: http://blogs.msdn.com/rickbrew/default.aspx) rewrote a bunch of their filters to take advantage of 64-bit math and saw speed boosts moving to x64 of 3X+ for those filters. I just saw a presentation the other day where microsoft.com was saying that they have seen both significant reliability boosts and throughput increases moving to 64-bit (however they were running 12 app pools on a box and were definitely running into the memory limits of the 32-bit system).

Posted by joshwil | 11 Comments

Blog about writing profiler stubs to interact with the 2.0 runtime.

Check out this blog entry (http://blogs.msdn.com/jkeljo/archive/2005/08/11/450506.aspx) to see some information that is rather interesting to people writing managed profilers, and probably not very interesting to everyone else.

Posted by joshwil | 3 Comments

BigArray<T>, getting around the 2GB array size limit

I’ve received a number of queries as to why the 64-bit version of the 2.0 .Net runtime still has array maximum sizes limited to 2GB. Given that it seems to be a hot topic of late I figured a little background and a discussion of the options to get around this limitation was in order.

First some background; in the 2.0 version of the .Net runtime (CLR) we made a conscious design decision to keep the maximum object size allowed in the GC Heap at 2GB, even on the 64-bit version of the runtime. This is the same as the current 1.1 implementation of the 32-bit CLR, however you would be hard pressed to actually manage to allocate a 2GB object on the 32-bit CLR because the virtual address space is simply too fragmented to realistically find a 2GB hole. Generally people aren’t particularly concerned with creating types that would be >2GB when instantiated (or anywhere close), however since arrays are just a special kind of managed type which are created within the managed heap they also suffer from this limitation.

<Sidenote> managed arrays: arrays are a first class type in the CLR world and they are laid out in one contiguous block of memory in the managed garbage collected heap. In the CLR 1.1 they can be thought of as the only generic type (in 2.0 we’re introducing a much more universal concept of generics) in that you can have an array that is of the type of any managed type that you like (primitive types, value types, reference types). It is interesting to think about what that means in context of the 2GB object instance size limit imposed on objects in the managed heap. With value types (bool, char, int, long, struct X {}, etc…) the actual data of the instance for each element in the array will be laid out contiguous with the next element in memory, since the 2GB limit discussed earlier applies to the total array size, and the array size is a factor of the type size the maximum number of elements you can store in an array of type X will vary proportionally to the size of X.

Differing from this are arrays of reference types (e.g. objects, strings, class Y {}, etc…), for these arrays the actual array will be that of a bunch of references, initially null. To initialize the array your code will need to go through one element at a time and create or assign an appropriate instance of the type to that array element. The 2GB size limit for arrays applies to this array of references, not the instances of the objects themselves. On a 32-bit machine if you create an array of type object (object[]) and one instance of type object per element in the array then your available virtual address space will end up limiting the size of your array as you will never be able to fit enough objects in memory to be able to fill up a 2GB object array with unique object references.</Sidenote>

The developer visible side of this is that array indexes are signed integers (with a byte[] you can use the full positive space of the signed integer as an index (assuming the array is 0 based), with other types you use some subset of that space until the total array size is 2GB). While some of the BCL APIs that deal with arrays have alternate signatures that take longs this isn’t yet ubiquitous in the framework (i.e. the IList interface (which the BCL’s Array class implements) uses int indexes).

It is debatable whether or not we should have included a “Big Array” implementation in the 2.0 64-bit runtime, and I’m sure that debate will rage for some years to come. However, as 2.0 is getting ready to ship and there currently isn’t any support for this we are going to have to live without it until at least the next version.

So, what is there to do in .Net 2.0 if you have an application which requires arrays that are very large?

Well, first switch to 64-bit! As mentioned, it is next to impossible to allocate a full 2GB array on 32-bit because of the way that the virtual address space is broken up by modules and other various allocations. Simply switching to 64-bit will buy you the ability to allocate those full 2GB blocks (well, close anyway, the total object size is limited to 2GB, but there is some CLR book-keeping goo in there that takes a few bytes).

What if that still isn’t enough? You have a couple of choices:

A) Rethink your application’s design? Do you really need a single gigantor array to store your data? Keep in mind that if you’re allocating 8GB of data in a single array and then accessing it in a sparse and random manner you’re going to be in for a world of paging pain unless you have a ton of physical memory. It is very possible that there is another data organization scheme you can use where you can group data into frequently used groups of some sort or another and manage to keep under the 2GB limit for an individual group. If you choose correctly you can vastly improve your applications performance due to lower paging and better cache access characteristics that come from keeping things that are used together close to one another.

B) Use native allocations. You can always P/Invoke to NT’s native heap and allocate memory which you can then use unsafe code to access. This isn’t going to work if you want to have an array full of object references, but if you just need a huge byte[] to store an image this might work out fine, even great. The added cost of the P/Invoke is low because the NT APIs have simple signatures that don’t require marshaling and the code executed when allocating an 8GB block is probably mostly zeroing the memory anyway. If you choose this option you will have to write a small memory management class of some kind and be comfortable using unsafe code. I know that Paint.Net (http://blogs.msdn.com/joshwil/archive/2005/04/07/406218.aspx) uses this very method for allocating the memory in which they store the image (and it’s various layers) which you’re editing. This is a good solution for the case where you really need a single unbroken allocation of some large size. While it isn’t a very general purpose solution it works out great for the Paint.Net guys.

C) Write your own BigArray class.

I’d stress that option C is my least favorite of the above three, but I will acknowledge that there are probably cases where it is the right thing to do. Given that, I have gone and written one myself. This is a very bare bones implementation, just the array allocation and accessors are implemented, I will leave implementing any extended functionality (like the functionality provided by the static members of the Array class, Sort, Copy, etc… or writing big collections on top of it) as an exercise for the reader.

// Goal: create an array that allows for a number of elements > Int.MaxValue
class BigArray<T>
{
    // These need to be const so that the getter/setter get inlined by the JIT into
    // calling methods just like with a real array to have any chance of meeting our
    // performance goals.
    //
    // BLOCK_SIZE must be a power of 2, and we want it to be big enough that we allocate
    // blocks in the large object heap so that they don't move.
    internal const int BLOCK_SIZE = 524288;
    internal const int BLOCK_SIZE_LOG2 = 19;

    // Don't use a multi-dimensional array here because then we can't right size the last
    // block and we have to do range checking on our own and since there will then be
    // exception throwing in our code there is a good chance that the JIT won't inline.
    T[][] _elements;
    ulong _length;

    // maximum BigArray size = BLOCK_SIZE * Int.MaxValue
    public BigArray(ulong size)
    {
            int numBlocks = (int)(size / BLOCK_SIZE);
            if ((numBlocks * BLOCK_SIZE) < size)
            {
                numBlocks += 1;
            }

            _length = size;
            _elements = new T[numBlocks][];
            for (int i=0; i<(numBlocks-1); i++)
            {
                _elements[i] = new T[BLOCK_SIZE];
            }
            // by making sure to make the last block right sized then we get the range checks
            // for free with the normal array range checks and don't have to add our own
            _elements[numBlocks-1] = new T[NumElementsInLastBlock];
    }

    public ulong Length
    {
        get
        {
            return _length;
        }
    }

    public T this[ulong elementNumber]
    {
        // these must be _very_ simple in order to ensure that they get inlined into
        // their caller
        get
        {
            int blockNum = (int)(elementNumber >> BLOCK_SIZE_LOG2);
            int elementNumberInBlock = (int)(elementNumber & (BLOCK_SIZE – 1));
            return _elements[blockNum][elementNumberInBlock];
        }
        set
        {
            int blockNum = (int)(elementNumber >> BLOCK_SIZE_LOG2);
            int elementNumberInBlock = (int)(elementNumber & (BLOCK_SIZE – 1));
            _elements[blockNum][elementNumberInBlock] = value;
        }
    }
}

The beauty of this implementation is that the JIT already understands single dimensional array accesses intrinsically, including range checking code. In practice this class ends up being almost as fast as real array access for small arrays (< BLOCK_SIZE) and not too much slower once you get to reasonably big arrays. It doesn’t waste much space compared to a normal array because the last block is right sized and the performance is good because the getter and setter for array elements are simple enough that they should get inlined into the calling method, this becomes very important for getting anywhere close to normal array access speeds.

Here is an example of big array usage:

public static void Main()
{
    long size = 0x1FFFFFFFF;
    BigArray<int> baInt = new BigArray<int>(size);
    long len = baInt.Length;
    for (long i=0; i<len; i++)
    {
        baInt[i] = i;
    }
    Console.WriteLine("baInt[len/2]=" + baInt[len/2]);
}

You could imagine also exposing the fact that this BigArray<T> implementation has blocks through a couple of properties and a indexer of this[int block, int element] which would allow people to intelligently write code to do block based access on the array (e.g. merge sorts that are block intelligent). This can be important for performance as we know that elements within a single block are contiguous in memory, however we cannot make that guarantee about elements in neighboring blocks.

It is worth noting that given the allocation scheme of the BigArray<T> constructor we may very well have multiple garbage collections while it runs, because of this you don’t really want to be using large instances of this class in a throw away manner. My advice would be to use this carefully and sparingly, instead favoring architectures which don’t require such large single arrays.

 

What is the difference in a P/Invoke signature between “byref byte” and “byte[]”?

Lately we’ve seen a spate of issues coming up on 64-bit platforms within the Developer Division around usages of P/Invoke signatures which declare a parameter as type “byref byte” where the developer really means “byte[]” (the corresponding native parameter type being something like LPBYTE). Usually when something works on 32-bit and doesn’t work on 64-bit we quickly get a phone call or email indicating that this must be a CLR problem, and this case was no different.

I received an email which pointed me to the following P/Invoke signature:
[DllImport(“kernel32.dll”)]
public static extern int ReadProcessMemory(
                             IntPtr hProcess,
                             IntPtr lpBaseAddress,
                             ref byte lpBuffer,
                             IntPtr nSize, 
                             IntPtr lpNumberOfBytesWritten
);

Looking at MSDN we can see that the C prototype for this function is:
BOOL ReadProcessMemory(
         HANDLE hProcess,
         LPCVOID lpBaseAddress,
         LPVOID lpBuffer,
         SIZE_T nSize,
         SIZE_T* lpNumberOfBytesRead
);

There are a number of problems with the P/Invoke declaration (it’s return type should be BOOL for instance and the nSize parameter should probably be a UIntPtr instead of IntPtr), those aside, the real problem is that the lpBuffer parameter shouldn’t be defined as a byref byte. The intended usage was:

byte[] b = new byte[100];
ReadProcessMemory(…, …, ref b[0], …, …);

The expectation being that this would result in a pointer to the beginning of the byte array being delivered to the native code to play with. However that wasn’t happening and ReadProcessMemory was returning a failure (something that was actually very convenient in tracking down this bug). In the end though, this isn’t a CLR problem, it is a usage problem with the P/Invoke signature declaration. If that’s the case then you might ask: why did it work on 32-bit in the first place? Well, because of an “optimization” (I put it in quotes for a reason) in the x86 P/Invoke code “byref byte” means that we just happen to pin the reference to the byte which is passed through the P/Invoke layer and we pass that pinned original reference on to the native code.

This means that if you pass in a reference to the first byte (or any byte for that matter) of an array of bytes then we will pass a pointer to that and the native code can party on the rest of the array just as if we passed an interior pointer into the object (well, we did). It is very possible that this makes a lot of sense to those C++ programmers out there who have become very accustomed to a reference and a pointer being the same thing, and being able to do fancy pointer math on references just by casting them to pointers.

It turns out that in the 64-bit implementation of P/Invoke (which under the covers is radically different than the 32-bit implementation) we decided to more accurately represent a “byref byte” as a reference to a single byte, in fact, we allocate the byte on the interop layer’s stack and pass along a reference to that to the native code. On the way back to managed code we copy that byte back into the GCHeap wherever the managed object identified by the incoming object reference is currently living (in this case some byte in a byte array). This decision was also made as an “optimization” to avoid some of the frequent pinning that the CLR does during interop (as pinning can be rather hard for the GC to deal with, especially for very small objects and generally the less of it the better).

We do this for small native types that we can move around with an instruction or two, however for larger types (like an actual byte array, specified as a “byte[]” in a P/Invoke signature (or a “byref byte” identified by an array attribute of sorts) we still do go ahead pin the reference in the GC Heap and pass along the pinned reference to native code to party. This is what the developer of the above code intended to happen.

The correct P/Invoke signature would be (conveniently this can be found on http://www.pinvoke.net):
[DllImport(“kernel32.dll”)]
public static extern bool ReadProcessMemory(
                              IntPtr hProcess,
                              IntPtr lpBaseAddress,
                              [Out] byte[] lpBuffer,
                              UIntPtr nSize,
                              IntPtr lpNumberOfBytesWritten
);

Given this fixed signature we will pin the byte[] reference and pass along a pointer to the unmanaged code and everything will work as expected. Fortunately in this case for the group that wrote this code ReadProcessMemory was able to return a failure when it received what it deemed to be a bad pointer for lpBuffer, in most cases you will probably just end up seeing spectacular failures when the native code that you’re P/Invoke-ing out to starts overwriting your application’s stack. So it is very important to remember to get your P/Invoke signatures right!!! It will save you some serious debugging later.

 

Bit specific code in agnostic assemblies???

In previous blog entries I’ve spent some time talking about how to mark assemblies as bit specific and how the loader deals with those markings.

What however is the preferred mode of an application? I will posit that it is to be compiled agnostic and to run equally well on both 32-bit and 64-bit platforms. It makes a lot of things easier: development, build, testing, deployment, servicing…

Caveat: The following discussion deals only with fully IL assemblies. If you generate managed C++ code you may end up with some native code in your image at which point it has to be tied to one specific platform.

If you have a reason to tie yourself to only one platform (e.g. x86 because you only have an x86 version of some native DLL that you need to P/Invoke to) then your decision is easy and it’s been made for you. Just flip the /platform:x86 switch on your compiler and go. However, if you have some code that works on both 32-bit and 64-bit platforms with just some subtle difference then you have a couple of options to think about for implementing the differences:

1) Use compile time defines (#if/#else in C#) to separate your 64-bit code from 32-bit code. Use the /platform:X switch of your compiler to generate different assemblies for 32-bit and (both) 64-bit platforms.
2) Use runtime if/else blocks to separate 32-bit and 64-bit code.

Both of these end up having their place. In most cases I’ve seen, people have only a small amount of code which needs to be bit specific, and in those cases dealing with the rest of the hassles around building and deploying multiple assemblies aren’t really worth it…

But, what about the runtime cost of the check? What if your bit specific code is on the hot path? Won’t that hurt?

Actually, that’s the cool part, if you do it right it won’t[1]. So, what are your options for determining bitness of a process at runtime?

A) if (Marshal.SizeOf(IntPtr.GetType()) == 8) {/*64*/} else {/*32*/ }
B) if (IntPtr.Size == 8) {/*64*/} else {/*32*/}
C) readonly static bool is64Bit = (IntPtr.Size==8);
    if (is64Bit) {/*64*/} else {/*32*/}

Of those options there are 2 right ways and a wrong way. Unfortunately, some of the early information coming out of Microsoft indicated that you should use Marshal.SizeOf, which is definitely the wrong way to do this. That check involves a call to the marshaling code in mscorlib.dll and since the JIT (or ngen) compiler doesn’t know at JIT (or ngen) time what the result will be the unused half of the code can’t be optimized away as dead code.

The easiest way to do this is B, since IntPtr.Size is a constant which is hard-coded into mscorlib.dll when we build the runtime, the JIT (or ngen) can check the loaded mscorlib.dll (which will vary depending on bitness) and optimize away the check and the unused half of the code.

Option C works also, but it has a potentially subtle bug to it. If you don’t mark the static variable definition as readonly then the JIT (and ngen) won’t be able to optimize away the check and unused code. This is because it has to assume that the value can change at runtime. This is very important to remember because without this keyword this solution will become almost as bad as A.

Recommendation: for simple cases, use “if (IntPtr.Size==8)” to determine 64-bitness. For more complex cases consider using a static boolean, but remember to mark it as readonly.

Unfortunately the if/else solution won’t work well for cases where you need different structure definitions on 64-bit and 32-bit platforms for P/Invoke-ing to native routines. If you have a very small number of usages you might consider having two separate structure definitions and P/Invoke declarations, and using if/else to determine which one you use (maybe hiding the bitness stuff behind a wrapper). However, if it is a frequently used structure then it probably makes more sense to just use platform specific assemblies and compile time defines to determine structure layout as then changes only need to be made at the structure definition site.

If you’d like to see some of this stuff in action I’ve posted the source to a test that you can run and then inspect in the debugger to see what the JIT does (http://homepage.mac.com/willij3/blog/testing_bitness.cs). I’m sure it can be done more easily in VS, but I’ve been using WinDbg with SOS’s !name2ee and !u commands to disassemble the resulting code.


[1] Well, there is a small cost involved in the JIT having to parse the extra IL code for both platforms before it can evaluate the const condition and throw away half. However this cost is minimal and for frequently executed code is trivial. For ngen’d code the cost at runtime is non-existant.

Ferrari 4000

I am forced to admit that this is one damn fine notebook. Thanks to the helpful instructions on Volker's blog (http://blogs.msdn.com/volkerw/) I was able to get it up and running with 32-bit and 64-bit OSes very quickly. I'm currently trying to live with the 64-bit OS for a while before I fully commit to it. I've been fully 64-bit on my dev machines at work for over a year now and most everything works seamlessly. I feel like it's the smart however to give it a bit of a test run before fully committing to 64-bit on a laptop. I'll keep you updated. Also, once I kill the 32-bit install I'll have room for 64-bit Longhorn (whoops, I mean Vista).

My only complaint about the laptop is it's size. Generally I'm more of a ThinkPad X series form factor type of guy. I like my laptops small and light. This is neither, though at 6lbs and change isn't bad given the size. The screen is great, battery life seems reasonable, having a CD/DVD burner is cool and the DVI out on the back has me contemplacing ditching my desktop machine at home and just keeping an extra LCD around to do dual-monitor with the laptop for when I'm working from home.

Here's to the 64-bit future of computing. Now all I yearn for is a quad core laptop so that I can do all my builds quickly on the run!

 

p.s. Does anyone have any suggestions for a low-power/quiet case that I can stuff my old desktop P4 into to turn it into a media server hidden in the closet? The current power supply sucks way to much juice to leave it on all the time...

Posted by joshwil | 5 Comments

Flipping bits on managed images to make them load with the right bitness...

In a number of blog entries I have discussed how on 64-bit machines .Net applications can run as either 32-bit or 64-bit processes depending on how the exe is produced. Generally it is highly recommended that developers use the compiler options provided by Whidbey compilers to specify the platform on which to run.

However, some people want to change the loading characteristics of an application after it has been compiled, or maybe you don’t have access to the code, etc… For that case we have an SDK tool called corflags.exe. Corflags allows you to modify some of the loading characteristics of a managed app.

Here’s is the command line help for Corflags.exe
 
Microsoft (R) .NET Framework CorFlags Conversion Tool.  Version  2.0.50405.00
Copyright (C) Microsoft Corporation. All rights reserved.

Usage: Corflags.exe Assembly [options]

If no options are specified, the flags for the given image are displayed.

Options:
/ILONLY+ /ILONLY-     Sets/clears the ILONLY flag
/32BIT+  /32BIT-      Sets/clears the 32BIT flag
/UpgradeCLRHeader     Upgrade the CLR Header to version 2.5
/RevertCLRHeader      Revert the CLR Header to version 2.0
/Force                Force an assembly update even if the image is
                      strong name signed.
                      WARNING: Updating a strong name signed assembly
                       will require the assembly to be resigned before
                       it will execute properly.
/nologo               Prevents corflags from displaying logo
/? or /help           Display this usage message

WARNING: Corflags is a powerful tool, and you can break code in weird ways by using it incorrectly. As mentioned previously, it is highly recommended that you control the loading characteristics of your application through compiler switches.

Here are a couple of examples of Corflags.exe usage:

Scenario: you have an application foo.exe which was compiled “any cpu” with a Whidbey compiler, but you want to force it to run as a 32-bit application even on 64-bit machines.

Run: corflags.exe /32BIT+ foo.exe

Result: foo.exe is now marked as if it was compiled /platform:x86

Scenario: you have an Everett (.Net 1.1) application bar.exe which you would like to enable to run on 64-bit.

Run: corflags /UpgradeCLRHeader bar.exe

Result: bar.exe now looks like a Whidbey (.Net 2.0) “any cpu” application and will load under the 64-bit 2.0 runtime on a 64-bit OS. It is still Everett compatible however and will run as a 32-bit 1.1 application on a 32-bit OS.

See this blog entry for more info on what /UpgradeCLRHeader does: http://blogs.msdn.com/joshwil/archive/2004/10/15/243019.aspx

 

It is notable that Corflags.exe doesn’t have the ability to force an application to only run on a 64-bit machine, that is something that needs to be done at the compiler level (Corflags.exe operates on PE32 images, whereas 64-bit only applications need to be PE32+ images).


Corflags wears another hat as a diagnosis tools for figuring out what the loading characteristics of your application will be:

E:\temp>corflags chartest.exe
Microsoft (R) .NET Framework CorFlags Conversion Tool.  Version  2.0.50405.00
Copyright (C) Microsoft Corporation. All rights reserved.

Version   : v2.0.41026
CLR Header: 2.5
PE        : PE32
CorFlags  : 1
ILONLY    : 1
32BIT     : 0
Signed    : 0

If you run Corflags.exe and pass a managed application name without any switches it will tell you what the state of the interesting flags in the image are. It is educational to play with compiling some test code using the different /platform:X switches and then running Corflags on the exe to see what the state of the executable is. Briefly:

ILONLY: Managed images are allowed to contain native code, however C# and VB images don’t. To be “any cpu” an image may only contain IL.

32BIT: Even if you have an image that only contains IL it still might have platform dependencies, the 32BIT flag is used to distinguish “x86” images from “any cpu” images. 64-bit images are distinguished by the fact that they have a PE type of PE32+.

CLR Header: 2.0 indicates a .Net 1.0 or .Net 1.1 (Everett) image, 2.5 indicates a .Net 2.0 (Whidbey) image. Very confusing, unfortunate but true.


Here’s some other interesting reading on the subject of where managed applications run on 64-bit machines:
http://blogs.msdn.com/joshwil/archive/2005/04/08/406567.aspx
http://blogs.msdn.com/joshwil/archive/2004/03/13/89163.aspx
http://blogs.msdn.com/joshwil/archive/2004/03/11/88280.aspx


 

Posted by joshwil | 2 Comments

Isn’t my code going to be faster on 64-bit???

[updated 10:50am 5/2/05: It turns out that I copied and pasted an error in my code from the newsgroup posting I was answering. However a kind reader was able to spot it and I've fixed it, I'm getting new data and will updated graphs later today, however the points of the article remain valid]

[updated 8:04am 5/3/05: Added new graphs for data from fixed code. As expected, the results are the same, the peaks just moved to the left]

Subtitle: Micro-optimizations for 64-bit platforms

DISCLAIMER: As usual with performance discussions, your mileage will vary, and it is highly recommended that you test your specific scenario and not make any vast over-generalizations about the data contained here.

The other day on the new MSDN forums a question came up of what the performance difference of a piece of code would be when it was run on 64-bit vs. 32-bit. In this case the poster specifically talked about the question of what the performance difference between managed code running natively on an X64 64-bit CLR and the corresponding managed code running natively on a 32-bit X86 CLR for a simple copy loop which moves data from one byte array to another. I decided it would be interesting to do an analysis of this and so here we are.

I wrote up a little unsafe C# code which approximates what I believe the poster to be talking about, it goes something like this:

class ByteCopyTest
{
    byte[] b1;
    byte[] b2;

    int iters;

    public ByteCopyTest (int size, int iters)
    {
        b1 = new byte[size];
        b2 = new byte[size];
       
        this.iters = iters;
    }

    unsafe public ulong DoIntCopySingle()
    {
        Timer t = new Timer();
       
        t.Start();
        int intCount = b1.Length / 4;
        fixed (byte* pbSrc = &b1[0], pbDst = &b2[0])
        {
            for (int j=0; j<iters; j++)
            {
                int* piSrc = (int*)pbSrc;
                int* piDst = (int*)pbDst;
               
                for (int i=0; i<intCount; ++i)
                {
                    *piDst++ = *piSrc++;
                }
            }
        }
        t.End();
       
        return t.GetTicks();
    }
}

Here we can see a simple piece of code that facilitates coping from one byte array into another. It is then easy enough to run this test under both the 64-bit and 32-bit CLR to compare performance. In this case I varied the byte array in size from 256 Bytes up to 256MB and ran a varying number of iterations so that each time measured is for copying the same amount of total data (about 5.1GB).

Something to note about these tests, is that they aren’t actually testing the internal working of the CLR so much as they test the code-generation capabilities of the JIT32 and JIT64 and the memory/cache of the machine that the test is run on.

Here, when using a copy loop that copies a single int (4-bytes) at a time from one array to another we can see that the JIT32 seems to generate better code and in many cases the 32-bit version wins. We can see that in both cases the time taken goes drastically up when we go from 1MB to 2MB and then levels off somewhat. This is where the processors on die cache stops being able to keep up as well and our program’s run time ends up being ruled by memory access, we will see later that the particular implementation of the copy loop at this point ceases to matter much. 

While that is interesting, it might be even more interesting to compare a copy loop that uses a long (8-byte) instead of an int given that registers are 8-bytes wide on the X64 platform that means we can fit the whole long into a single register in the inner copy loop. 

Here we can see that the long based copy loops definitely out perform the int based copy loops, and they do so consistently on both platforms… That this is faster on 32-bit is interesting, it turns out that the loop overhead is so great that breaking the long down into two 4-byte pieces to copy it inside of the loop is a win, effectively we’ve just made the jit unroll our loop one level for us. In this case it turns out to be a win.

loop2$    |   |   mov      esi,ebx
          |   |   add      ebx,0x8
          |   |   mov      ecx,ebp
          |   |   add      ebp,0x8
          |   |   mov      eax,[ecx]  // first 4 bytes
          |   |   mov      edx,[ecx+0x4] // second 4 bytes
          |   |   mov      [esi],eax
          |   |   mov      [esi+0x4],edx
          |   |   add      edi,0x1
          |   |   cmp      edi,[esp]
          |   |<--jl       ByteCopyTest.DoLongCopySingle()+0xb6 (loop2$)

We can see however that even with 32-bit beating it’s int based implementation the 64-bit version has a considerably smaller inner loop with fewer memory accesses which shows in the data above where we consistently see the 64-bit long based copy loop wining.

loop2$    |   |   lea      rdx,[r9+r10]
          |   |   lea      rax,[r9+0x8]
          |   |   mov      rcx,r9
          |   |   mov      r9,rax
          |   |   mov      rax,[rcx]
          |   |   mov      [rdx],rax
          |   |   inc      r11d
          |   |   cmp      r11d,esi
          |   |<--jl       ByteCopyTest.DoLongCopySingle()+0xb0 (loop2$)

We still see a plateau starting at copies of 2MB, here the latency of memory access takes over and the processor can’t keep up with the code. At this point the processor will be spending many cycles spinning waiting for data and few extra instructions aren’t going to hurt as badly.

The positive results of using the long copy loop on 32-bit invites us to try a copy loop which copies two ints or longs at a time instead of one to try and better utilize that processor. An implementation of this would look like:

    unsafe public ulong DoLongCopyDouble()
    {
        Timer t = new Timer();
        t.Start();
        int longCount = b1.Length / 16;
        fixed (byte* pbSrc = &b1[0], pbDst = &b2[0])
        {
            for (int j=0; j<iters; j++)
            {
                long* plSrc = (long*)pbSrc;
                long* plDst = (long*)pbDst;
               
                for (int i=0; i<longCount; ++i)
                {
                    plDst[0] = plSrc[0];
                    plDst[1] = plSrc[1];
                    plDst += 2;
                    plSrc += 2;
                }
            }
        }
        t.End();

        return t.GetTicks();
    }

We will call this a “double” copy loop (and our former code a “single” copy loop). Let’s look and see how the double copy loops do on 64-bit:

Here we can see that the double long copy loop wins over the others, and, interestingly the double int and single long loops are very close. This would be expected as they are coping the same amount of data per iteration through the inner loop, however, the double int implementation uses more instructions to do it and does look to be a bit slower through most of the graph.

When we put everything together into a single graph we can see that the best of the implementations (double long on 64-bit) beats the worst of the implementations (single int on 64-bit) by around 50% which is significant. Most of the implementations fall somewhere in the middle however and vary minimally from implementation to implementation.

We can see that unrolling the loop only works so far before we see diminishing returns in that on the 32-bit platform the double long implementation isn’t that much faster than the double int implementation even though it is moving twice as much data per iteration of the inner loop. This code is getting to the point where loop overhead is lost in the noise of memory access.

What is the moral of the story? This code can be faster on 64-bit for certain scenarios, but if you’re writing it you might have to think about it (once again good engineering triumphs over good hardware). For instance, you might have written the single int copy loop for some super optimized routine in your code when thinking about a 32-bit target, if that is the case then that piece of code may run marginally slower on 64-bit (or not, see other graphs below), and if it’s really important you might consider revising it to be long based for a 64-bit win. In the end we’ve seen that making it long based actually results in a win for both 32-bit and 64-bit platforms. This supports an assertion that you will commonly hear me broadcasting to anyone who will listen, “Good Engineering == Better Performance”. It’s true regardless of platform.

While examining this copy loop is a fun game to play, chances are that most of your code isn’t this low level. Chances are also good that most of your code is already fast enough on both platforms. As Rico is apt to say, premature optimization is the root of all evil. I highly recommend that you profile, a lot. And then make educated decisions about the parts of your program which it would make sense to specifically do some work to try and optimize for 64-bit. The likelihood is high that places where you can find something very low level that is 64-bit specific are few and far between. Often the hot spots that you find will be places where optimization just plain makes sense regardless of the target hardware platform. Then it’s just a task to think about that general optimization and hopefully keep 64-bit in mind.


Well, we’ve managed to make it to the end of this post without me directly answering the question posed in the title… In case you’ve forgotten, it is “Isn’t my code going to be faster on 64-bit???”

Maybe.

I know, a pointy haired answer at best… The fact of the matter is that there are a lot of cases where 64-bit processors will provide a significant boost to current applications which can take advantage of the additional memory and registers. However, there are some applications which just by their nature will run at a comparable speed to their 32-bit siblings. And some that will run somewhat slower. It is unfortunately impossible to provide a universal answer to the question for every application under the sun.

The big blocker to a universal speed boost from 64-bit processors is that they don’t fundamentally change one of the big limiting factors of many applications, I/O, both to memory and to the disk or network. Given that most of the time processors in modern machines are spinning, waiting for something to do, the difference of a few instructions in a tight loop when you’re waiting on memory can be so small as to not matter.

Which brings us to an interesting point, as can be clearly seen in the graphs above, running out of cache can be a significant problem on modern processors… This unfortunately is the current challenge for 64-bit computing, a challenge which is somewhat increased by managed runtimes which have a tendency to exacerbate coding patterns which are very reference heavy. References (pointers for you old-school c++ types like me) grow on 64-bit, in fact they double in size from 4 bytes (32-bits) to 8 bytes (64-bits). Depending on application architecture this can have a big effect on cache utilization and correspondingly performance.

So, maybe.

I’ll leave you with this sentiment: “Good Engineering == Good 64-bit Performance!”

 

The code for this example can be found here.

 

How do exes/dlls generated with the /platform:x switch interact?

[fixed typo: 9:37am]

I received a question about this recently, so i figured i'd elaborate here with a little example...

 

Let's assume we have the following three dlls:

   anycpu.dll      -- compiled "any cpu"

   x86.dll           -- compiled "x86"

   x64.dll           -- compiled "x64"

 

And the following three exes:

   anycpu.exe     -- compiled "any cpu"

   x86.exe          -- compiled "x86"

   x64.exe          -- compiled "x64"

 

What happens if you try to use these exes and dlls together? We have to consider two possible scenarios, running on a 32-bit machine and running on a 64-bit machine...

 

On a 32-bit x86 machine:

anycpu.exe -- runs as a 32-bit process, can load anycpu.dll and x86.dll, will get BadImageFormatException if it tries to load x64.dll

x86.exe -- runs as a 32-bit process, can load anycpu.dll and x86.dll, will get BadImageFormatException if it tries to load x64.dll

x64.exe -- will get BadImageFormatException when it tries to run

 

On a 64-bit x64 machine:

anycpu.exe -- runs as a 64-bit process, can load anycpu.dll and x64.dll, will get BadImageFormatException if it tries to load x86.dll

x86.exe -- runs as a 32-bit process, can load anycpu.dll and x86.dll, will get BadImageFormatException if it tries to load x64.dll

x64.exe -- runs as a 64-bit process, can load anycpu.dll and x64.dll, will get BadImageFormatException if it tries to load x86.dll

 

Posted by joshwil | 9 Comments
More Posts Next page »
 
Page view tracker