12/6/2007 Update: The bug described below has been fixed in CLR 2.0 SP1.  (See this post for more info.)  However, you may still find this info useful as a tutorial on performing IL rewriting.

Previously, I posted about a bug with using the CLR Profiling API to inspect value-type return values. The workaround I described involved using IL rewriting to instrument managed functions to call your own custom managed function to do the inspection. In a cliffhanger worthy of a nighttime soap's season finale, I promised I'd return to describe another workaround where you instrument the IL to call into your own unmanaged code.

Even if you don't care about this bug (because you don't care about inspecting return values), you might still find it useful to watch IL rewriting in action as it injects calls from managed code to your unmanaged profiler code. You'll definitely want to look over that previous post first, as it has some basic IL rewriting details that will not be repeated here.

Also, a word of caution. The technique described below for calling into your unmanaged profiler code is dangerous, extremely difficult to get right, and relies on internal behavior of the runtime that can change from version to version. If you're a writer of applications, and not profilers, this technique cannot be used in your code, so you might as well stop reading now. If you're a writer of a profiler, it's still really easy to mess this up, so pay particular attention to the warnings and rules below. And, as usual for profiler writers, expect the CLR internals to change out from under you with each new version. There's no guarantee the technique or even calling conventions I describe below will remain in future versions of the CLR.

Who moved my data?

The biggest obstacle you're going to face when calling into your profiler's unmanaged code is ensuring you play nicely with the garbage collector. In the first workaround I described (same post), the GC was of no concern. We were just happy managed code calling more happy managed code. If a GC occurred in the middle, and we were looking at an object reference that got moved, no biggie! The GC will have paused our execution, updated our references to point to the current locations of the objects, and set us on our way in blissful ignorance that anything happened. But once you step outside the happy managed world and write unmanaged code to inspect managed values, the world gets a little more complicated.

Now hold it right there, Dave-o, some of you might be thinking. The bug we're working around involves inspecting VALUE TYPES. Those don't live on the GC heap, so who cares if a GC occurs while my code is running?

That's not entirely true.  Value types are copied around to each point of use.  A value type embedded in an instance of a reference type (i.e., Object) will therefore live on the GC heap and move around.  Now it's true that, as a value type is returned from one function to another, a copy must be made.  But in some cases, that copy might be made directly to a field in another object.  You need to keep all this in mind when rewriting IL to avoid any surprises.  Here are some more reasons you might care:

  • Perhaps you'd like all your inspection code to run through the same path, so you'd like to inspect reference types (i.e., Objects) by a similar mechanism.  And, of course, those always live on the GC heap.
  • Perhaps you'd like to do some deep inspection for some value types, and follow pointers inside the value type to the Objects they reference.  If you ignore the existence of the GC, and it moves those Objects, you'll follow the pointers to la-la-land (which, despite the name, is not a happy place).
  • If you ignore how all this stuff works and just have managed code P/Invoke out to your unmanaged code, the JIT may create stubs between the managed code and your unmanaged code, burning extra cycles, and potentially marshalling the arguments in ways you don't expect.
  • Depending on how you pass the value type to your unmanaged code, it might well end up on the GC heap (e.g., if you box it first, as I did in the first workaround).

Secret ingredient

The secret I'll use is the same secret used internally in the CLR. As you can imagine, it could be devastating if some thread were executing inside the bowels of the CLR, manipulating an Object, and then got interrupted by a GC that moved that Object out from under the thread. This mayhem is prevented by a rule used by the GC. The GC will not continue until all threads are "ready". And one way for a thread not to be ready, is for it to be executing unmanaged code that is not the target of a P/Invoke or COM call. In particular, if a thread is running inside the CLR to perform usual runtime duties (e.g., JITting, loading a type, etc.), that's not considered a P/Invoke or COM call, so that thread is not "ready". So that thread will block any GC attempt until that thread is ready.

The CLR takes advantage of this rule all the time, so why not your profiler?

Please stop reading

If any of you application writers are still out there, stop reading.  There's nothing to see here. Move along. Following this technique will doom your application to rely on internals of the runtime that may change from version to version, and will cause deadlocks and foul GC behavior inside your app.

If any of you profiler writers are still out there, you need to be very, very careful with how you use this technique.

Danger!  If you use the following technique to force IL to call your unmanaged code:
  • DO expect your profiler to break in future CLR versions, as it is relying on internal CLR implementation details
  • DO keep the unmanaged code down to a bare minimum, with only simple operations like reading and writing memory (but no heap allocations!)
  • DO relinquish control back to the runtime as soon as possible
  • DO NOT take any locks or perform any kind of thread synchronization
  • DO NOT call back into the runtime (e.g., don't use ICorProfilerInfo(2), metadata APIs, etc.)

In case you forget how easy it is to cause deadlocks, you might want to look at my previous post on stackwalking, particularly the section "GC helps you make a cycle".

Rewrite that IL!

Ok. If you're still with me, let's say I want to rewrite IL to call into my unmanaged inspection function in a way that will block the GC. How do I do this? Well, there are 3 possible IL instructions to use to make a function call: call, callvirt, and calli. The first two use a metadata token to identify the target of the call. But calli is special. It lets you use a plain old function pointer. Crack open your copy of the ECMA CLI Specification. Surely you have one at your fingertips by now! If not, it's currently on msdn at http://msdn.microsoft.com/netframework/ecma/.

Partition III, 3.20 describes the miracle that is calli. Its assembly format looks like this:

	calli callsitedescr

And its effect on the IL stack is as follows:

	…, arg1, arg2 … argN, ftn -> …, retVal (not always returned)

Setting up the IL stack is the easy part. I just load the args onto the IL stack, followed by that ftn dude. The ftn is simply an address to my unmanaged inspection function (i.e., good old-fashioned function pointer).

Where it gets interesting is that little callsitedescr you see in the assembly format for the calli instruction. What's that thing?! It's a metadata token that references a signature for the function. In particular, this can be a StandAloneMethodSig, as described in partition II, 23.2.3. Part of this signature includes the "calling convention" for my profiler's unmanaged inspection function. Partition II, 15.3 lists them out:

CallConv ::= [ instance [ explicit ]] [ CallKind ]
CallKind ::=
| unmanaged cdecl
| unmanaged fastcall
| unmanaged stdcall
| unmanaged thiscall
| vararg

Where we start to get tricky is our choice of the calling convention. The CallKinds "default" and "vararg" are what are normally used when managed code calls managed code. When managed code calls unmanaged code, usually one of the other "unmanaged *" CallKinds are used. Specifying one of the "unmanaged *" CallKinds implicitly tells the JIT that I want all the usual baggage with a P/Invoke or COM call: generate stub code to marshal parameters as necessary, and let the GC know that it's ok to do its thing while we're executing inside the unmanaged call.

So the trick is to use the default CallKind when calling into my profiler's unmanaged code. This forces the JIT not to add any stub code, so the GC will remain blocked for the duration of my call. Armed with this knowledge, let's say I'm going to try to call an unmanaged function with this prototype to do my inspection:

    void UnmanagedInspectValue(void * pv)

I should note that, in the real world, I wouldn't get very far with an inspection function like this. I'm going to want another parameter to identify the type of pv, so I can better inspect it. Maybe my inspection function will be capable of doing rich inspection of a set of known types, and then default to dumping the value in binary or string form for other types. For my purposes here, though, I'm keeping it simple, and leaving the prototype at one parameter.

Given that simple prototype, here is an example StandAloneMethodSig that uses the default CallKind and describes my inspection function:

COR_SIGNATURE unmanagedInspectValueSignature[] = 
	IMAGE_CEE_CS_CALLCONV_DEFAULT,        // Default CallKind!
	0x01,                                 // Parameter count
	ELEMENT_TYPE_VOID,                    // Return type
	ELEMENT_TYPE_PTR, ELEMENT_TYPE_VOID   // Parameter type (void *)

For this to work right, I'll need to make sure my unmanaged code uses the proper calling convention used by compiled managed code. On x86, this means using fastcall (well, sort of).

void __fastcall UnmanagedInspectValue(void * pv)
    // pv points to the value I will inspect
    // ...cool inspection code here...

This is where it might get a little confusing. First of all, I should be clear on the distinction between the CallKinds, and the native calling convention my unmanaged inspection function will actually use. You may recall my list of CallKinds above. One of those is actually named "unmanaged fastcall". I'm not saying I should use that CallKind in my StandAlondMethodSig. I really do want to use the CallKind default. But when I define my unmanaged function that will be called by the IL I rewrite, that unmanaged function needs to use the fastcall calling convention, because that's the convention used when managed code calls managed code.  And I'm trying to appear to the JIT as if that's exactly what I'm doing (even though I'm really having managed code call my unmanaged inspection function).

That brings me to the second fuzzy thing here. My unmanaged inspection function isn't really going to be fastcall. But it's pretty close! (Am I driving you mad yet?) Microsoft's CLR implementation uses a calling convention similar to fastcall for managed calls, but not exactly. The main difference arises when you have to pass more parameters than will fit into the two registers dedicated for fastcalls (ECX, EDX). Normal __fastcall has the remaining parameters pushed onto the stack right-to-left. But the CLR pushes those remaining parameters left-to-right. For more details, check out ECMA Partition II, (be sure you're reading the Microsoft-specific version of the ECMA spec, or you won't find that section; again, this is currently on msdn at http://msdn.microsoft.com/netframework/ecma/).

So, to be completely perfect, I really should write my inspection function in assembly--or have an assembly wrapper that accepts the parameters as passed by the CLR, and then passes them to a C function that uses a more typical calling convention. However, since in my case I only have one parameter anyway, I can cheat and just call it __fastcall, and the compiler is none the wiser. Just so long as I remember that, once I add a third parameter to my function, the gig is up!

Although the original bug that spawned this post (and thus my workaround) is x86-only, it's worthwhile to mention that, on x64, the native calling convention I'd have to use is just the standard x64 calling convention used everywhere else.  In this calling convention, the first four parameters are passed in registers; the rest on the stack.  The CLR does not diverge from this in its implementation of managed calls, so that makes things a little simpler. You can read more about the x64 calling convention at http://msdn2.microsoft.com/en-us/library/ms235286.aspx.

You can also read about this sneaky trick of using calli to call native code directly in ECMA (MS-specific), Partition II,

Now that you have some more context, it feels like it's time for another warning. The fact that the CLR uses this pseudo-fastcall convention for calling managed code is an internal implementation detail. The fact that calli works against arbitrary function pointers, and calls through to the underlying unmanaged code while continuing to block the GC is an internal implementation detail. There is no guarantee that these things will always be. And if you think you're going to call all sorts of nifty code in your unmanaged inspection function, scroll up to the yellow warning above and refresh your memory. That said, profiler writers are often at the bleeding edge and take all sorts of dependencies on internals as a matter of course. Profilers often run code at dangerous times and need to be careful about what they do and when. It's simply my job to remind you that this is one of those times.

What have I got so far?

Let me pause for a breather to summarize my progress. I'm rewriting IL in the profiled code to call my unmanaged inspection function. So far I have:

    ldc.i4 UnmanagedInspectValue (This is my fcn pointer; use the 8-byte ldc.i8 on 64 bits)
    calli callsitedescr

Geez, Dave-o, after all this, all you've got is a push and a calli?!

Yeah, well, that callsitedescr was tricky.

Boxing is for wimps?

So what's next? I know the prototype for my unmanaged inspection function, and I know the IL I'll use to call it. Well, I obviously need some IL to pass the return value that I wish to inspect to my function. And actually, I could just use the code from the previous post detailing the first workaround (where I just dup and box the return value when it's already on the IL stack for me). That would produce this rewrite:



New and improved:

    box (TypeRef / TypeDef token)
    ldc.i4(8) UnmanagedInspectValue
    calli callsitedescr

But you know what? Boxing is really the wimpy way out (please don't tell Mr. Tyson I said so). It's easy to code up, but there's overhead involved in creating a copy of the value on the GC heap, just so I can get a pointer to it. Really, I just want to do an "address-of" on that baby, and pass that in to my function. Boxing the value just for that is using a cannon to kill a mouse. Unfortunately, IL doesn't have some easy way for me to just take an address of an item on the IL stack. What is one to do?

Real programmers use locals

I can store the value into a local, and then take the address of the local and push that onto the IL stack. That would produce the following IL rewrite:


New and improved:

    stloc(.s) indx     
// Copy return value into local
    ldloca(.s) indx    
// Push address to local onto stack
    ldc.i4(8) UnmanagedInspectValue (fcn pointer)
    calli callsitedescr

Hmmm, what's that indx thing? In IL locals are referenced by their 0-based ordinal. I'd like to use a brand new local, so all I need to do is figure out the highest ordinal this function I'm rewriting uses, add one to it, and use that! The good news is, all functions have a signature blob that describes all the locals they use. So it's easy for me to parse through there to find the highest ordinal in use. The bad news is that all functions have a signature blob that describes all the locals they use... so I'm gonna have to update that blob when I add a new local to the function. Erk!

I got yer local vars signature right here, buddy

Ok, first things first. Where do I find this local variables signature blob, and how do I parse it? The signature blob is identified by a token that sits in the method header. The method header is what you get back when you call ICorProfilerInfo2::GetILFunctionBody. Despite the name, this method doesn't actually return the body directly to you. It returns a pointer to the header, which you must skip over before you get to the body. In reality, a profiler will often need to inspect the header as well, so it's rare that one would want to just skip it. Indeed, I'll need to look at that header to find my local variables signature blob token.

To make things interesting, there are two kinds of headers: tiny and fat. Tiny headers can be used when the function is very small and has no local variables. Otherwise, fat headers are used. Luckily, the .NET SDK ships with some helpful wrappers to abstract this stuff away. Open up corhlpr.h, and look for class COR_ILMETHOD_DECODER. That baby makes finding the sig token as easy as:

            m_moduleId, tkMethod, &pMethodBytes, NULL);

        COR_ILMETHOD_DECODER decoder((COR_ILMETHOD *) pMethodBytes);

        m_tkLocalVarSig = decoder.GetLocalVarSigTok();

If the function has a tiny header, or otherwise has no local variables signature, then m_tkLocalVarSig will be 0. Otherwise, I can use it to look up the signature blob to parse. For details on the format of the LocalVarSig, look in the ECMA spec, Partition II, 23.2.6. The SigParse sample is capable of parsing this kind of signature. But you'll see that parsing it manually is rather easy, since the format is relatively simple, and there's really very little information I need to get (really, just the number of locals). To modify the signature, again, it's pretty easy. I just increment the count of locals, keep the rest of the signature the same, and then append at the end the type information for the new local I'm adding. Some very rough sample code with very little error checking and almost no testing can be found here (use at your own risk!).

Ok, I'm almost done. The sample code linked above has rewritten the locals signature, given me a suitable index to use for my new local variable, and produced a new token to identify the new locals signature. I just need to stick this new token back into the rewritten method header. This is done at the time I call ICorProfilerInfo2::SetILFunctionBody since that (like GetILFunctionBody) deals with a pointer to the method header that immediately precedes the method body. I just use the structures in the .NET SDK files, like IMAGE_COR_ILMETHOD_FAT to help me.

    BYTE * pNewMethod = (BYTE *) m_pIMethodMalloc->Alloc(totalSize);
    pHeader->LocalVarSigTok = m_tkLocalVarSig;
// More code here to fully fill out the header, and write the new IL

But wait, Dave-o. What if the function you're rewriting had a tiny header? You're adding a local, and thus eliminating the ability to use the tiny header anymore.

Yes, indeed, I will always rewrite the method as having a fat header, even if it was tiny to begin with. This is always safe to do.

More fun than sorting razorblades

To sum up, here's what I did:

  • Create an unmanaged function in my profiler to inspect value types from a void *
    • Native calling convention should be the CLR-pseudo-__fastcall on x86!
  • Create a StandAlondMethodSig to describe my unmanaged inspection function.
    • Must use the default CallKind!
  • Grab the local variables signature token out of the method header
  • Get the raw signature blob out of the metadata (using the token)
  • Parse the blob to figure out the current number of locals
  • Rewrite the blob to add one more local of type void *
  • Put the rewritten blob back into the metadata, and get back a new token for it
  • Rewrite the method header using the new token
  • Rewrite the method IL
    • Stick the return value into the new local
    • Push the address of that local onto the IL stack as argument for my inspection function
    • calli my unmanaged inspection function

What do you think? Interesting? Boring? Questions? Post your comments!