Welcome to MSDN Blogs Sign in | Join | Help

BUG: GetILFunctionBody returns wrong size

In case you missed it, there was a post on our forum here:

http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=1366051&SiteID=1

and a comment on my blog here:

http://blogs.msdn.com/davbr/archive/2007/03/06/creating-an-il-rewriting-profiler.aspx#1881536

about an issue with GetILFunctionBody returning the wrong size.  This is indeed a bug in CLR 2.x, and it is recommended you follow the ECMA spec help you parse the function header to determine the actual size.  More details are in the forum post linked above.  This bug will be fixed in a future release of the CLR.

Posted by davbr | 3 Comments

Debugging Your Profiler II: SOS and IDs

In this debugging post, I'll talk about the various IDs the profiling API exposes to your profiler, and how you can use SOS to give you more information about the IDs.  As usual, this post assumes you're using CLR 2.x.

S.O.What Now?

SOS.DLL is a debugger extension DLL that ships with the CLR.  You'll find it sitting alongside mscorwks.dll.  While originally written as an extension to the windbg family of debuggers, Visual Studio can also load and use SOS.  If you search the MSDN blogs for "SOS" you'll find lots of info on it.  I'm not going to repeat all that's out there, but I'll give you a quick primer on getting it loaded.

In windbg, you'll need mscorwks.dll to load first, and then you can load SOS.  Often, I don't need SOS until well into my debugging session, at which point mscorwks.dll has already been loaded anyway.  However, there are some cases where you'd like SOS loaded at the first possible moment, so you can use some of its commands early (like !bpmd to set a breakpoint on a managed method).  So a surefire way to get SOS loaded ASAP is to have the debugger break when mscorwks gets loaded (e.g., "sxe ld mscorwks").  Once mscorwks is loaded, you can load SOS using the .loadby command:

0:000> sxe ld mscorwks
0:000> g
ModLoad: 79e70000 7a3ff000   C:\Windows\Microsoft.NET\Framework\v2.0.50727\mscorwks.dll
eax=00000000 ebx=00000000 ecx=00000000 edx=00000000 esi=7efdd000 edi=20000000
eip=77a1a9fa esp=002fea38 ebp=002fea78 iopl=0         nv up ei pl nz na po nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000202
ntdll!NtMapViewOfSection+0x12:
77a1a9fa c22800          ret     28h
0:000> .loadby sos mscorwks

With SOS loaded, you can now use its commands to inspect the various IDs that the profiling API passes to your profiler.

Note: The following contains implementation details of the runtime.  While these details are useful as a debugging aid, your profiler code cannot make assumptions about them.  These implementation details are subject to change at whim.

FunctionID Walkthrough

For starters, take a look at FunctionIDs.  Your profiler receives a FunctionID anytime you hit a callback that needs to, well, identify a function!  For example, when it's time to JIT, the CLR issues JITCompilationStarted (assuming your profiler subscribed to that callback), and one of the parameters to the callback is a FunctionID.  You can then use that FunctionID in later calls your profiler makes back into the CLR, such as GetFunctionInfo2.

As far as your profiler is concerned, a FunctionID is just an opaque number.  It has no meaning in itself; it's merely a handle you can pass back into the CLR to refer to the function.  Under the covers, however, a FunctionID is actually a pointer to an internal CLR data structure called a MethodDesc.  I must warn you again that you cannot rely on this information when coding your profiler.  The CLR team reserves the right to change the underlying meaning of a FunctionID to be something radically different in later versions.  This info is for entertainment and debugging purposes only!

Ok, so FunctionID = (MethodDesc *).  How does that help you?  SOS just so happens to have a command to inspect MethodDescs: !dumpmd.  So if you're in a debugger looking at your profiler code that's operating on a FunctionID, it can beneficial to you to find out which function that FunctionID actually refers to.  In the example below, the debugger will break in my proifler's JITCompilationStarted callback and look at the FunctionID.  It's assumed that you've already loaded SOS as per above.

0:000> bu UnitTestSampleProfiler!SampleCallbackImpl::JITCompilationStarted
0:000> g
...
Breakpoint 0 hit
eax=00c133f8 ebx=00000000 ecx=10001218 edx=00000001 esi=002fec74 edi=00000000
eip=10003fc0 esp=002fec64 ebp=002feca4 iopl=0         nv up ei pl nz na po nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000202
UnitTestSampleProfiler!SampleCallbackImpl::JITCompilationStarted:
10003fc0 55              push    ebp

The debugger is now sitting at the beginning of my profiler's JITCompilationStarted callback.  Let's take a look at the parameters.

0:000> dv
           this = 0x00c133f8
     functionID = 0x1e3170
 fIsSafeToBlock = 1

Aha, that's the FunctionID about to get JITted.  Now use SOS to see what that function really is.

0:000> !dumpmd 0x1e3170
Method Name: test.Class1.Main(System.String[])
Class: 001e1288
MethodTable: 001e3180
mdToken: 06000001
Module: 001e2d8c
IsJitted: no
m_CodeOrIL: ffffffff

Lots of juicy info here, though the Method Name typically is what helps me the most in my debugging sessions.  mdToken tells us the metadata token for this method.  MethodTable tells us where another internal CLR data structure is stored that contains information about the class containing the function.  In fact, the profiing API's ClassID is simply a MethodTable *.  [Note: the "Class: 001e1288" in the output above is very different from the MethodTable, and thus different from the profiling API's ClassID.  Don't let the name fool you!]  So we could go and inspect a bit further by dumping information about the MethodTable:

0:000> !dumpmt 0x001e3180
EEClass: 001e1288
Module: 001e2d8c
Name: test.Class1
mdToken: 02000002  (C:\proj\HelloWorld\Class1.exe)
BaseSize: 0xc
ComponentSize: 0x0
Number of IFaces in IFaceMap: 0
Slots in VTable: 6

And of course, !dumpmt can be used anytime you come across a ClassID and want more info on it.

IDs and their Dumpers

Now that you see how this works, you'll need to know how the profiling IDs relate to the various SOS commands that dump info on them:

ID Internal CLR Structure SOS command
AssemblyID Assembly * !DumpAssembly
AppDomainID AppDomain * !DumpDomain
ModuleID Module * !DumpModule
ClassID MethodTable * !DumpMT
ThreadID Thread * !Threads (see note)
FunctionID MethodDesc * !DumpMD
ObjectID Object * (i.e., a managed object) !DumpObject

Note:  !Threads takes no arguments, but simply dumps info on all threads that have ever run managed code.  If you use "!Threads -special" you get to see other special threads separated out explicitly, including threads that perform GC in server-mode, the finalizer thread, and the debugger helper thread.

More Useful SOS Commands

It would probably be quicker to list what isn't useful!  I encourage you to do a !help to see what's included. Here's a sampling of what I commonly use:

!u is a nice SOS analog to the windbg command "u". While the latter gives you a no-frills disassembly, !u works nicely for managed code, including spanning the disassembly from start to finish, and converting metadata tokens to names.

!bpmd lets you place a breakpoint on a managed method. Just specify the module name and the fully-qualified method name. For example:

!bpmd MyModule.exe MyNamespace.MyClass.Foo 

If the method hasn't jitted yet, no worries. A "pending" breakpoint is placed.  If your profiler performs IL rewriting, then using !bpmd on startup to set a managed breakpoint can be a handy way to break into the debugger just before your instrumented code will run (which, in turn, is typically just after your instrumented code has been jitted). This can help you in reproducing and diagnosing issues your profiler may run into when instrumenting particular functions (due to something interesting about the signature, generics, etc.).

!PrintException: If you use this without arguments you get to see a pretty-printing of the last outstanding managed exception on the thread; or specify a particular Exception object's address.

 

Ok, that about does it for SOS. Hopefully this info can help you track down problems a little faster, or better yet, perhaps this can help you step through and verify your code before problems arise.

Posted by davbr | 2 Comments

Debugging Your Profiler I: Activation

This is the first of some tips to help you debug your profiler.  Note that these tips assume you're using CLR 2.x (see this entry for info on how CLR version numbers map to .NET Framework version numbers).  In today's post, I address a frequent question from profiler developers and users: "Why didn't my profiler load?".

Event log

In the Application event log, you'll see entries if the CLR attempts, but fails, to load and initialize your profiler.  So this is a nice and easy place to look first, as the message may well make it obvious what went wrong.

Weak link in the chain?

The next step is to carefully retrace this chain to make sure everything is registered properly:

Environment variables --> Registry --> Profiler DLL on File system.

The first link in this chain is to check the environment variables inside the process that should be profiled.  If you're running the process from a command-prompt, you can just try a "set co" from the command prompt:

C:\>set co
(blah blah, other vars beginning with "co")
Cor_Enable_Profiling=0x1
COR_PROFILER={C5F90153-B93E-4138-9DB7-EB7156B07C4C}

If your scenario doesn't allow you to just run the process from a command prompt, like say an asp.net scenario, you may want to attach a debugger to the process that's supposed to be profiled, or use IFEO (HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options) to force a debugger to start when the worker process starts.  In the debugger, you can then use "!peb" to view the environment block, which will include the environment variables.

Once you verify Cor_Enable_Profiling and COR_PROFILER are ok, it's time to search the registry for the very same GUID set in your COR_PROFILER environment variable.  You should find it at a path like this:

HKEY_LOCAL_MACHINE\SOFTWARE\Classes\CLSID\{C5F90153-B93E-4138-9DB7-EB7156B07C4C}

Note!  If you're on a 64 bit box, be aware of the WOW64 redirectors, and ensure you're looking at the proper view of the environment and registry!

If the registry has the GUID value, it's finally time to check out your file system.  Go under the InprocServer32 subkey under the GUID:

HKEY_LOCAL_MACHINE\SOFTWARE\Classes\CLSID\{C5F90153-B93E-4138-9DB7-EB7156B07C4C}\InprocServer32

and look at the default value data.  It should be a full path to your profiler's DLL.  Verify it's accurate.  If not, perhaps you didn't properly run regsvr32 against your profiler, or maybe your profiler's DllRegisterServer had problems.

Time for a debugger

If the above investigation indicates everything's ok, then your profiler is properly registered and your environment is properly set up, but something bad must be happening at run time.  You'll want symbols for the CLR, which are freely available via Microsoft's symbol server.  If you set this environment variable, you can ensure windbg will always use the symbol server:

set _NT_SYMBOL_PATH=srv*C:\MySymbolCache*http://msdl.microsoft.com/download/symbols

Feel free to add more paths (separate them via ";") so you can include your profiler's symbols as well.  Now, from a command-prompt that has your Cor_Enable_Profiling and COR_PROFILER variables set, run windbg against the executable you want profiled.  The debuggee will inherit the environment, so the profiling environment variables will be propagated to the debuggee.

Note: The following contains implementation details of the runtime.  While these details are useful as a debugging aid, your profiler code cannot make assumptions about them.  These implementation details are subject to change at whim.

Once windbg is running, try setting this breakpoint:

bu mscordbc!EEToProfInterfaceImpl::CreateProfiler

Now go!  If you hit that breakpoint, that verifies the CLR has determined that a profiler has been requested to load from the environment variables, but the CLR has yet to read the registry.  Let's see if your DLL actually gets loaded.  You can use

sxe ld NameOfYourProfiler.dll

or even set a breakpoint inside your Profiler DLL's DllMain.  Now go, and see if your profiler is getting loaded.  If you can verify your profiler's DLL is getting loaded, then you now know your registry is pointing to the proper path, and any static dependencies your profiler has on other DLLs have been resolved.  But will your profiler COM object get instantiated properly?  Set breakpoints in your class factory (DllGetClassObject) and your profiler COM object's QueryInterface to see if you can spot problems there.  For example, if your profiler only works against CLR 1.x, then the CLR's call into your QueryInterface will fail, since you don't implement ICorProfilerCallback2.

If you're still going strong, set a breakpoint in your profiler's Initialize() callback.  Failures here are actually a popular cause for activation problems.  Inside your Initialize() callback, your profiler is likely calling QueryInterface for the ICorProfilerInfoX interface of your choice, and then calling SetEventMask, and doing other initialization-related tasks, like calling SetEnterLeaveFunctionHooks(2).  Do any of these fail?  Is your Initialize() callback returning a failure HRESULT?

Hopefully by now you've isolated the failure point.  If not, and your Initialize() is happily returning S_OK, then your profiler is apparently loading just fine.  At least it is when you're debugging it.  :-)

Posted by davbr | 5 Comments

Versions of Microsoft .NET Framework, CLR, and Your Profiler

With the recent release of the Microsoft .NET Framework 3.5, not to mention prior versions like Microsoft .NET Framework 3.0 and Microsoft .NET Framework 2.0 (and their service packs), it might not be obvious what versions of the Common Language Runtime (CLR) come alongside them and whether your profiler should care.  So I'm posting this to help clarify the various versions.

Note: I'm not talking about the Microsoft .NET Compact Framework or Microsoft .NET Micro Framework in this post.

Check it out!

Before I go any further, I invite you to learn more about the recently released Visual Studio 2008 and Microsoft .NET Framework 3.5.  Whether you write profilers or not, there's an amazing slew of very cool features you may want to check out.  Some great places to start: this post from ScottGu's blog, this post from Soma's blog, or directly at the Visual Studio page http://www.microsoft.com/vstudio.

Definitions

Ok, so what's the difference between "Common Language Runtime (CLR)" and "Microsoft .NET Framework"?  I think of it as:

CLR + managed libraries and tools = Microsoft .NET Framework

The CLR is the low-level technology (much of it written in unmanaged, native code) that includes the garbage collector, security subsystem, just-in-time compiler, type system, the profiling API (of course :-)), and other similar stuff.  Much of this tends to reside in mscorwks.dll.

If you then add onto that list the many rich managed libraries for implementing graphical user interfaces, web services, accessing Windows OS functionality, etc., as well as the managed language compilers and tools, you get the Microsoft .NET Framework.

Disclaimer: These definitions are how I, geek-Dave, keep things straight in my head.  I'm not in marketing so the names I'm using might not be perfectly accurate (e.g., I'm probably missing terms like "SDK" or "redistributable package").  Please don't take these as Microsoft Official Definitions.

Versions

You can get the various Microsoft .NET Framework versions via Windows Update, by searching on http://download.microsoft.com, or by installing the corresponding version of Visual Studio.  Also, Windows Vista comes with Microsoft .NET Framework 3.0 as an optionally installable component.

Here's how the Microsoft .NET Framework versions >= 2.0 correlate with the CLR versions:

Microsoft .NET Framework version CLR version Ships with Visual Studio Version
2.0 2.0 2005
2.0 SP1 2.0 SP1 2008 (via .NET 3.5 install)
3.0 2.0 (comes with Vista)
3.0 SP 1 2.0 SP1 2008 (via .NET 3.5 install)
3.5 2.0 SP1 2008

Note that the latest CLR version at the time of writing this post is CLR 2.0 SP1 (no such thing as CLR 3.x!).

One exciting note here is that Visual Studio 2008 will actually let you target all three of .NET 2.0, .NET 3.0, and .NET 3.5.

Your Profiler

CLR Versions

Your profiler interacts with the CLR Profiling API, so you likely care which CLR version you're interacting with.  Knowing that the .NET Framework 2.x/3.x versions are all based on CLR 2.x simplifies the picture for you, as you're assured that the profiling API remains compatible across those versions.  However, it's fair to expect that CLR service packs might bring small changes or bug fixes.  CLR 2.0 SP1, for example, contains the following profiling API fixes:

  • SetEnterLeaveFunctionHooks2 returned E_INVALIDARG if any of the specified callback pointers were NULL.  For example, one might wish to specify an Enter probe (non-NULL) but not specify a Leave probe (NULL); this was disallowed.  Fix: SetEnterLeaveFunctionHooks2 now allows one or more of the callback parameters to be NULL.
  • Various fixes around IMethodMalloc::Alloc
  • When a profiler was monitoring Leave calls in order to inspect non-primitive, value-type return values, the profiling API would sometimes specify an invalid value.  This has been fixed.  (More info about this bug was posted here and here.)

Microsoft .NET Framework Versions

Your profiler may also (optionally) take dependencies on the libraries.  For example, if you have an instrumenting profiler that performs IL rewriting on some framework library code, then your profiler should deal gracefully with the cases that the libraries themselves may be different in the various .NET Framework versions.  Other than that, your profiler probably doesn't need to care which version of the libraries is in use.

Long Story Short

Although there are several Microsoft .NET Framework versions out there, your CLR 2.x-based profiler should still be fine.  Of course, the CLR is under active development (those libraries teams aren't the only ones having fun), so look forward to exciting things for your profiler to take advantage of in the future.

Posted by davbr | 7 Comments

Test my Code! (And get paid for it.)

We've got a new job opening listed here.  If your idea of fun is writing debuggers and profilers that automate the testing of the CLR APIs on which they're built, then you'll want to check out this opening.  Be the first on your block to discover the latest and greatest CLR features (and then write code to pound the heck out of it!).  As a bonus, you get to open bugs and assign them to me.  :-).

Posted by davbr | 1 Comments

Enter, Leave, Tailcall Hooks Part 2: Tall tales of tail calls

For most people the idea of entering or returning from a function seems straightforward. Your profiler's Enter hook is called at the beginning of a function, and its Leave hook is called just before the function returns. But the idea of a tail call and exactly what that means for the Profiling API is less straightforward.

In Part 1 I talked about the basics of the Enter / Leave / Tailcall hooks and generally how they work. You may want to review that post first if you haven't seen it yet. This post builds on that one by talking exclusively about the Tailcall hook, how it works, and what profilers should do inside their Tailcall hooks.

Tail calling in general

Tail calling is a compiler optimization that saves execution of instructions and saves reads and writes of stack memory. When the last thing a function does is call another function (and other conditions are favorable), the compiler may consider implementing that call as a tail call, instead of a regular call.

    static public void Main()
    {
        Helper();
    }

    static public void Helper()
    {
        One();
        Two();
        Three();
    }

    static public void Three()
    {
        ...
    }

In the code above, the compiler may consider implementing the call from Helper() to Three() as a tail call. What does that mean, and why would that optimize anything? Well, imagine this is compiled without a tail call optimization. By the time Three() is called, the stack looks like this (my stacks grow UP):

Three
Helper
Main

Each of those functions causes a separate frame to be allocated on the stack. All the usual contents of a frame, including locals, parameters, the return address, saved registers, etc., get stored in each of those frames. And when each of those functions returns, the return address is read from the frame, and the stack pointer is adjusted to collapse the frame of the returning function. That's just the usual overhead associated with making a function call.

Now, if the call from Helper() to Three() were implemented as a tail call, we'd avoid that overhead, and Three() would just "reuse" the stack frame that had been set up for Helper(). While Three() is executing, the call stack would look like this:

Three
Main

And when Three() returns, it returns directly to Main() without popping back through Helper() first.

Folks who live in functional programming languages like Scheme use recursion at least as often as C++ or C# folks use while and for loops. Such functional programming languages depend on tail call optimizations (in particular tail recursion) to avoid overflowing the stack. While imperative languages like C++ or C# don't have such a vital need for tail call optimizations, it's still pretty handy as it reduces the number of instructions executed and the writes to the stack.  Also, it's worth noting that the amount of stack space used for a single frame can be more than you'd expect.  For example, in CLR x64, each regular call (without the tail call optimization) uses a minimum of 48 bytes of stack space, even if it takes no arguments, has no locals, and returns nothing.  So for small functions, the tail call optimization can provide a significant overhead reduction in terms of stack space.

The CLR and tail calls

When you're dealing with languages managed by the CLR, there are two kinds of compilers in play. There's the compiler that goes from your language's source code down to IL (C# developers know this as csc.exe), and then there's the compiler that goes from IL to native code (the JIT 32/64 bit compilers that are invoked at run time or NGEN time). Both the source->IL and IL->native compilers understand the tail call optimization. But the IL->native compiler--which I'll just refer to as JIT--has the final say on whether the tail call optimization will ultimately be used. The source->IL compiler can help to generate IL that is conducive to making tail calls, including the use of the "tail." IL prefix (more on that later). In this way, the source->IL compiler can structure the IL it generates to persuade the JIT into making a tail call. But the JIT always has the option to do whatever it wants.

When does the JIT make tail calls?

I asked Fei Chen and Grant Richins, neighbors down the hall from me who happen to work on the JIT, under what conditions the various JITs will employ the tail call optimization. The full answer is rather detailed. The quick summary is that the JITs try to use the tail call optimization whenever they can, but there are lots of reasons why the tail call optimization can't be used. Some reasons why tail calling is a non-option:

  • Caller doesn't return immediately after the call (duh :-))
  • Stack arguments between caller and callee are incompatible in a way that would require shifting things around in the caller's frame before the callee could execute
  • Caller and callee return different types
  • We inline the call instead (inlining is way better than tail calling, and opens the door to many more optimizations)
  • Security gets in the way
  • The debugger / profiler turned off JIT optimizations

Here are their full, detailed answers.

Note that how the JIT decides whether to use the tail calling optimization is an implementation detail that is prone to change at whim.  You must not take dependencies on this behavior. Use this information for your own personal entertainment only.

Your Profiler's Tailcall hook

I'm assuming you've already read through Part 1 and are familiar with how your profiler sets up its Enter/Leave/Tailcall hooks, so I'm not repeating any of those details here. I will focus on what kind of code you will typically want to place inside your Tailcall hook:

typedef void FunctionTailcall2(

                FunctionID funcId,

                UINT_PTR clientData,

                COR_PRF_FRAME_INFO func);

Tip: More than once I've seen profiler writers make the following mistake. They will take their naked assembly-language wrapper for their Enter2 and Leave2 hooks, and paste it again to use as the Tailcall2 assembly-language wrapper. The problem is they forget that the Tailcall2 hook takes a different number of parameters than the Enter2 / Leave2 hooks (or, more to the point, a different number of bytes is passed on the stack to invoke the Tailcall2 hook). So, they'll take the "ret 16" at the end of their Enter2/Leave2 hook wrappers and stick that into their Tailcall2 hook wrapper, forgetting to change it to a "ret 12". Don't make the same mistake!

It's worth noting what these parameters mean. With the Enter and Leave hooks it's pretty obvious that the parameters your hook is given (e.g., funcId) apply to the function being Entered or Left. But what about the Tailcall hook? Do the Tailcall hook's parameters describe the caller (function making the tail call) or the callee (function being tail called into)?

Answer: the parameters refer to tail caller.

The way I remember it is that the Tailcall hook is like an "Alternative Leave" hook. A function ends either by returning (in which case the CLR invokes your Leave hook) or a function ends by tail calling out to somewhere else (in which case the CLR invokes your Tailcall hook). In either case (Leave hook or Tailcall hook) the hook's parameters tell you about the function that's ending. If a function happens to end by making a tail call, your profiler is not told the target of that tail call. (The astute reader will realize that actually your profiler is told what the target of the tail call is--you need only wait until your Enter hook is called next, and that function will be the tail call target, or "tail callee". (Well, actually, this is true most of the time, but not all! (More on that later, but consider this confusing, nested series of afterthoughts a hint to a question I pose further down in this post.)))

Did you just count the number of closing parentheses to ensure I got it right? If so, I'd like to make fun of you but I won't--I'd have counted the parentheses, too. My house is glass.

Ok, enough dilly-dallying. What should your profiler do in its Tailcall hook? Two of the more common reasons profilers use Enter/Leave/Tailcall hooks in the first place is to keep shadow stacks or to maintain call traces (sometimes with timing information).

Shadow stacks

The CLRProfiler is a great example of using Enter/Leave/Tailcall hooks to maintain shadow stacks. A shadow stack is your profiler's own copy of the current stack of function calls on a given thread at any given time. Upon Enter of a function, you push that FunctionID (and whatever other info interests you, such as arguments) onto your data structure that represents that thread's stack. Upon Leave of a function, you pop that FunctionID. This gives you a live list of managed calls in play on the thread. The CLRProfiler uses shadow stacks so that whenever the managed app being profiled chooses to allocate a new object, the CLRProfiler can know the managed call stack that led to the allocation. (Note that an alternate way of accomplishing this would be to call DoStackSnapshot at every allocation point instead of maintaining a shadow stack. Since objects are allocated so frequently, however, you'd end up calling DoStackSnapshot extremely frequently and will often see worse performance than if you had been maintaining shadow stacks in the first place.)

 

OK, so when your profiler maintains a shadow stack, it's clear what your profiler should do on Enter or Leave, but what should it do on Tailcall? There are a couple ways one could imagine answering that question, but only one of them will work! Taking the example from the top of this post, imagine the stack looks like this:

Helper
Main

and Helper is about to make a tail call into Three().  What should your profiler do?

Method 1: On tailcall, pop the last FunctionID.  (In other words, treat Tailcall just like Leave.)

So, in this example, when Helper() calls Three(), we'd pop Helper().  As soon as Three() is called, our profiler would receive an Enter for Three(), and our shadow stack would look like this:

Three
Main

This approach mirrors reality, because this is what the actual physical stack will look like.  Indeed, if one attaches a debugger to a live process, and breaks in while the process is inside a tail call, the debugger will show a call stack just like this, where you see the tail callee, but not the tail caller.  However, it might be a little confusing to a user of your profiler who looks at his source code and sees that Helper() (not Main()) calls Three().  He may have no idea that when Helper() called Three(), the JIT chose to turn that into a tail call. In fact, your user may not even know what a tail call is. You might therefore be tempted to try this instead:

Method 2: On tailcall, "mark" the FunctionID at the top of your stack as needing a "deferred pop" when its callee is popped, but don't pop yet.

With this strategy, for the duration of the call to Three(), the shadow stack will look like this:

Three
Helper
(marked for deferred pop)
Main

which some might consider more user-friendly. And as soon as Three() returns, your profiler will sneakily do a double-pop leaving just this:

Main

So which should your profiler use: Method 1 or Method 2? Before I answer, take some time to think about this, invoking that hint I cryptically placed above in nested parentheses. And no, the fact that the parentheses were nested is not part of the actual hint.

The answer: Method 1. In principle, either method should be fine. However, the behavior of the CLR under certain circumstances will break Method 2. Those "certain circumstances" are what I alluded to when I mentioned "this is true most of the time, but not all" above.  These mysterious "certain circumstances" involve a managed function tail calling into a native helper function inside the runtime. Here's an example:
static public void Main()
{
    Thread.Sleep(44);
    Helper();
}

It just so happens that the implementation of Thread.Sleep makes a call into a native helper function in the bowels of the runtime. And that call happens to be the last thing Thread.Sleep does. So the JIT may helpfully optimize that call into a tail call. Here are the hook calls your profiler will see in this case:

(1) Enter (into Main)
(2) Enter (into Thread.Sleep)
(3) Tailcall (from Thread.Sleep)
(4) Enter (into Helper)
(5) Leave (from Helper)
(6) Leave (from Main)

Note that after you get a Tailcall telling you that Thread.Sleep is done, (in (3)), the very next Enter you get (in (4)) is NOT the Enter for the function being tail called. This is because the CLR only provides Enter/Leave/Tailcall hooks for managed functions, and the very next managed function being entered is Helper().  So, how will Method 1 and Method 2 fare in this example?

Method 1: Shadow stack works

By popping on every Tailcall hook, your shadow stack stays up to date.

Method 2: Shadow stack fails

At stage (4), the shadow stack looks like this:

Helper
Thread.Sleep
(marked for "deferred pop")
Main

If you think it might be complicated to explain tail calls to your users so they can understand the Method 1 form of shadow stack presentation, just try explaining why it makes sense to present to them that Thread.Sleep() is calling Helper()!

And of course, this can get arbitrarily nasty:

static public void Main()
{
    Thread.Sleep(44);
    Thread.Sleep(44);
    Thread.Sleep(44);
    Thread.Sleep(44);
    Helper();
}

would yield:

Helper
Thread.Sleep
(marked for "deferred pop")
Thread.Sleep (marked for "deferred pop")
Thread.Sleep (marked for "deferred pop")
Thread.Sleep (marked for "deferred pop")
Main

And things get more complicated if you start to think about when you actually pop a frame marked for "deferred pop". In all the above examples, you would do so as soon as the frame above it gets popped. So once Helper() is popped (due to Leave()), you'd cascade-pop all the Thread.Sleeps. But what if there is no frame above the frames marked for "deferred pop"?  To wit:

static public void Main()
{
    Helper()
}

static public void Helper()
{
    Thread.Sleep(44);
    Thread.Sleep(44);
    Thread.Sleep(44);
    Thread.Sleep(44);
}

would yield:

Thread.Sleep (marked for "deferred pop")
Thread.Sleep (marked for "deferred pop")
Thread.Sleep (marked for "deferred pop")
Thread.Sleep (marked for "deferred pop")
Helper
Main

until you get a Leave hook for Helper(). At this point, you need to pop Helper() from your shadow stack, but he's not at the top-- he's buried under all your "deferred pop" frames. So your profiler would need to perform the deferred pops if a frame above OR below them gets popped. Hopefully, the yuckiness of this implementation will scare you straight.  But the confusion of presenting crazy stacks to the user is the real reason to abandon Method 2 and go with Method 1.

Call tracing

The important lesson to learn from the above section is that sometimes a Tailcall hook will match up with the next Enter hook (i.e., the tail call you're notified of in your Tailcall hook will have as its callee the very function you're notified of in the next Enter hook), and sometimes the Tailcall hook will NOT match with the next Enter hook (in particular when the Tailcall hook refers to a tail call into a native helper in the runtime). And the sad fact is that the Enter/Leave/Tailcall hook design does not currently allow you to predict whether a Tailcall will match the next Enter.

As an illustration, consider two simple tail call examples:

Matching Example

    static public void Main()
    {
        One();
        ...(other code here)...
    }

    static public void One()
    {
        Two();
    }

Non-matching Example

    static public void Main()
    {
        Thread.Sleep(44);
        Two();
    }

In either case, your profiler will see the following hook calls

(1) Enter (into Main)
(2) Enter (into One / Thread.Sleep)
(3) Tailcall (from One / Thread.Sleep)
(4) Enter (into Two)
...

In the first example, (3) and (4) match (i.e., the tail call really does call into Two()). But in the second example, they do not (the tail call does NOT call into Two()).

Since you don't know when Tailcall will match the next Enter, your implementation of call tracing, like shadow stack maintenance, must treat a Tailcall hook just like a Leave. If you're logging when functions begin and end, potentially with the amount of time spent inside the function, then your Tailcall hook should basically do the same thing as your Leave hook. A call to your Tailcall hook indicates that the specified function is over and done with, just like a call to your Leave hook.

As with shadow stacks, this will sometimes lead to call graphs that could be confusing. "Matching Example" had One tail call Two, but your graph will look like this:

Main
 |
 |-- One
 |-- Two

But at least this effect is explainable to your users, and is self-correcting after the tail call is complete, while yielding graphs that are consistent with your timing measurements. If instead you try to outsmart this situation and assume Tailcalls match the following Enter, the errors can snowball into incomprehensible graphs (see the nasty examples from the shadow stack section above).

How often does this happen?

So when does a managed function in the .NET Framework tail call into a native helper function inside the CLR? In the grand scheme of things, not a lot. But it's a pretty random and fragile list that depends on which JIT is in use (x86, x64, ia64), and can easily change as parts of the runtime are rev'd, or even as JIT compilation flags are modified by debuggers, profilers, and other environmental factors while a process is active. So you should not try to guess this list and make dependencies on it.

Can't I just turn tail calling off?!

If all this confusion is getting you down, you might be tempted to just avoid the problem in the first place. And yes, there is a way to do so, but I wouldn't recommend it in general. If you call SetEventMask, specifying COR_PRF_DISABLE_OPTIMIZATIONS inside your mask, that will tell the JIT to turn off the tail call optimization. But the JIT will also turn off ALL optimizations. Profilers that shouldn't perturb the behavior of the app should definitely not do this, as the code generation will be very different.

Watching CLR tail calls in action

If you're writing a profiler with Enter/Leave/Tailcall hooks, you'll want to make sure you exercise all your hooks so they're properly tested. It's easy enough to make sure your Enter/Leave hooks are called--just make sure the test app your profiler runs against has a Main()! But how to make sure your Tailcall hook is called?

The surest way is to have a simple managed app that includes an obvious tail call candidate, and make sure the "tail." IL prefix is in place. You can use ilasm / ildasm to help build such an assembly. Here's an example I tried on x86 using C#.

Start with some simple code that makes a call that should easily be optimized into a tail call:

using System;
class Class1
{
    static int Main(string[] args)
    {
        return Helper(4);
    }

    static int Helper(int i)
    {
        Random r = new Random();
        i = (i / 1000) + r.Next();
        i = (i / 1000) + r.Next();
        return MakeThisATailcall(i);
    }

    static int MakeThisATailcall(int i)
    {
        Random r = new Random();
        i = (i / 1000) + r.Next();
        i = (i / 1000) + r.Next();
        return i;
    }
}

You'll notice there's some extra gunk, like calls to Random.Next(), to make the functions big enough that the JIT won't inline them. There are other ways to avoid inlining (including from the profiling API itself), but padding your test functions is one of the easier ways to get started without impacting the code generation of the entire process. Now, compile that C# code into an IL assembly:

    csc /o+ Class1.cs

(If you're wondering why I specified /o+, I've found that if I don't, then suboptimal IL gets generated, and some extraneous instructions appear inside Helper between the call to MakeThisATailcall and the return from Helper. Those extra instructions would prevent the JIT from making a tail call.)

Run ildasm to get at the generated IL

    ildasm Class1.exe

Inside ildasm, use File.Dump to generate a text file that contains a textual representation of the IL from Class1.exe.  Call it Class1WithTail.il.  Open up that file and add the tail. prefix just before the call you want optimized into a tail call (see highlighted yellow for changes):

  .method private hidebysig static int32 
          Helper(int32 i) cil managed
  {
    // Code size       45 (0x2d)
    // Code size       46 (0x2e)
    .maxstack  2
    .locals init (class [mscorlib]System.Random V_0)
    IL_0000:  newobj     instance void [mscorlib]System.Random::.ctor()
    IL_0005:  stloc.0
    IL_0006:  ldarg.0
    IL_0007:  ldc.i4     0x3e8
    IL_000c:  div
    IL_000d:  ldloc.0
    IL_000e:  callvirt   instance int32 [mscorlib]System.Random::Next()
    IL_0013:  add
    IL_0014:  starg.s    i
    IL_0016:  ldarg.0
    IL_0017:  ldc.i4     0x3e8
    IL_001c:  div
    IL_001d:  ldloc.0
    IL_001e:  callvirt   instance int32 [mscorlib]System.Random::Next()
    IL_0023:  add
    IL_0024:  starg.s    i
    IL_0026:  ldarg.0
    IL_0027:  call       int32 Class1::MakeThisATailcall(int32)
    IL_002c:  ret
    IL_0027:  tail.
    IL_0028:  call       int32 Class1::MakeThisATailcall(int32)
    IL_002d:  ret
  } // end of method Class1::Helper
  

Now you can use ilasm to recompile your modified textual IL back into an executable assembly.

    ilasm /debug=opt Class1WithTail.il

You now have Class1WithTail.exe that you can run!  Hook up your profiler and step through your Tailcall hook.

You Can Wake Up Now

If you didn't learn anything, I hope you at least got some refreshing sleep thanks to this post. Here's a quick recap of what I wrote while you were napping:

  • If the last thing a function does is call another function, that call may be optimized into a simple jump (i.e., "tail call").  Tail calling is an optimization to save the time of stack manipulation and the space of generating an extra call frame.
  • In the CLR, the JIT has the final say on when it employs the tail call optimization. The JIT does this whenever it can, except for a huge list of exceptions. Note that the x86, x64, and ia64 JITs are all different, and you'll see different behavior on when they'll use the tail call optimizations.
  • Since some managed functions may tail call into native helper functions inside the CLR (for which you won't get an Enter hook notification), your Tailcall hook should treat the tail call as if it were a Leave, and not depend on the next Enter hook correlating to the target of the last tail call.  With shadow stacks, for example, this means you should simply pop the calling function off your shadow stack in your Tailcall hook.
  • Since tail calls can be elusive to find in practice, it's well worth your while to use ildasm/ilasm to manufacture explicit tail calls so you can step through your Tailcall hook and test its logic.
David has been a developer at Microsoft for over 70 years (allowing for his upcoming time-displacement correction). He joined Microsoft in 2079, first starting in the experimental time-travel group. His current assignment is to apply his knowledge of the future to eliminate the "Wait for V3" effect customers commonly experience in his source universe. By using Retroactive Hindsight-ellisenseTM his goal is to "get it right the first time, this time" in a variety of product groups.
Posted by davbr | 4 Comments

New version of CLRProfiler released (for CLR 2.x only)

If you're having trouble with CLRProfiler crashing the aspnet_wp process, along the lines of this forum post:

http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=1039428&SiteID=1

then you'll want to know that we have officially uploaded a fixed version of CLR Profiler here: 

http://www.microsoft.com/downloads/details.aspx?FamilyID=a362781c-3870-43be-8926-862b40aa0cd0&DisplayLang=en

So please feel free to go there directly to get the latest and greatest.  (The update is for CLR 2.x only).

Posted by davbr | 1 Comments

Enter, Leave, Tailcall Hooks Part 1: The Basics

The CLR Profiling API allows you to hook managed functions so that your profiler is called when a function is entered, returns, or exits via tailcall. We refer to these as Enter/Leave/Tailcall hooks, or “ELT” hooks. In this special multi-part investigative series, I will uncover the truth behind ELT. Today I'll write about some of the basics, NGEN, and a word on what we call "slow-path" vs. "fast-path".

Setting up the hooks

1.     On initialization, your profiler must call SetEnterLeaveFunctionHooks(2) to specify which functions inside your profiler should be called whenever a managed function is entered, returns, or exits via tail call, respectively.

(Profiler calls this…)

  HRESULT SetEnterLeaveFunctionHooks(

                [in] FunctionEnter    *pFuncEnter,

                [in] FunctionLeave    *pFuncLeave,

                [in] FunctionTailcall *pFuncTailcall);

 

     (Profiler implements these…)

typedef void FunctionEnter(

                FunctionID funcID);

typedef void FunctionLeave(

                FunctionID funcID);

typedef void FunctionTailcall(

                FunctionID funcID);

 

OR

 

(Profiler calls this…)

  HRESULT SetEnterLeaveFunctionHooks2(

                [in] FunctionEnter2    *pFuncEnter,

                [in] FunctionLeave2    *pFuncLeave,

                [in] FunctionTailcall2 *pFuncTailcall);

 

     (Profiler implements these…)

typedef void FunctionEnter2(

                FunctionID funcId,

                UINT_PTR clientData,

                COR_PRF_FRAME_INFO func,

                COR_PRF_FUNCTION_ARGUMENT_INFO *argumentInfo);

typedef void FunctionLeave2(

                FunctionID funcId,

                UINT_PTR clientData,

                COR_PRF_FRAME_INFO func,

                COR_PRF_FUNCTION_ARGUMENT_RANGE *retvalRange);

typedef void FunctionTailcall2(

                FunctionID funcId,

                UINT_PTR clientData,

                COR_PRF_FRAME_INFO func);

 

This step alone does not cause the enter/leave/tailcall (ELT) hooks to be called.  But you must do this on startup to get things rolling.

2.     At any time during the run, your profiler calls SetEventMask specifying COR_PRF_MONITOR_ENTERLEAVE in the bitmask.  Your profiler may set or reset this flag at any time to cause ELT hooks to be called or ignored, respectively.

FunctionIDMapper

In addition to the above two steps, your profiler may specify more granularly which managed functions should have ELT hooks compiled into them:

1.     At any time, your profiler may call ICorProfilerInfo2::SetFunctionIDMapper to specify a special hook to be called when a function is JITted.

(Profiler calls this…)

  HRESULT SetFunctionIDMapper(

                [in] FunctionIDMapper *pFunc);

 

     (Profiler implements this…)

typedef UINT_PTR __stdcall FunctionIDMapper(

                FunctionID funcId,

                BOOL *pbHookFunction);

 

2.     When FunctionIDMapper is called:

a.     Your profiler sets the pbHookFunction [out] parameter appropriately to determine whether the function identified by funcId should have ELT hooks compiled into it.

b.     Of course, the primary purpose of FunctionIDMapper is to allow your profiler to specify an alternate ID for that function.  Your profiler does this by returning that ID from FunctionIDMapper .  The CLR will pass this alternate ID to your ELT hooks (as funcID if you're using the 1.x ELT, and as clientData if you're using the 2.x ELT).

Writing your ELT hooks

You may have noticed that corprof.idl warns that your implementations of these hooks must be __declspec(naked), and that you've got to save registers you use. Yikes! This keeps things nice and efficient on the CLR code generation side, but at the expense of making life a little more difficult for profilers. For great low-level details of writing the hooks (including yummy sample code!) visit Jonathan Keljo's blog entry here.

NGEN /Profile

The profiling API makes use of the fact that it can control the JITting of functions to enable features like ELT hooks. When managed code is NGENd, however, this assumption goes out the door. Managed code is already compiled before the process is run, so there’s no opportunity for the CLR to bake in calls to ELT hooks.

The solution is “NGEN /Profile”. For example, if you run this command against your assembly:

ngen install MyAssembly.dll /Profile

 

it will NGEN MyAssembly.dll with the “Profile” flavor (also called “profiler-enhanced”). This flavor causes extra hooks to be baked in to enable features like ELT hooks, loader callbacks, managed/unmanaged code transition callbacks, and the JITCachedFunctionSearchStarted/Finished callbacks.

The original NGENd versions of all your assemblies still stay around in your NGEN cache. NGEN /Profile simply causes a new set of NGENd assemblies to be generated as well, marked as the “profiler-enhanced” set of NGENd assemblies. At run-time, the CLR determines which flavor should be loaded. If a profiler is attached and enables certain features that only work with profiler-enhanced (not regular) NGENd assemblies (such as ELT via a call to SetEnterLeaveFunctionHooks(2), or any of several other features that are requested by setting particular event flags via SetEventMask), then the CLR will only load profiler-enhanced NGENd images--and if none exist then the CLR degrades to JIT in order to support the features requested by the profiler. In contrast, if the profiler does not specify such event flags, or there is no profiler to begin with, then the CLR loads the regular-flavored NGENd assemblies.

So how does NGEN /Profile make ELT hooks work? Well, in a profiler-enhanced NGEN module, each function gets compiled with calls at enter, leave, and tailcall time to a thunk. At run-time, the CLR decides what this thunk does. Either nothing (if no profiler requested ELT hooks), or jmp to the profiler's ELT hook. For example, if a profiler is loaded, requesting ELT notifications, and the CPU is executing near the top of a function inside a profiler-enhanced NGEN module, the disassembly will look something like this:

5bcfb8b0 call mscorwks!JIT_Writeable_Thunks_Buf+0x1b8 (5d8401d8)

And where's the target of that call? Right here:

5d8401d8 jmp UnitTestSampleProfiler!Enter2Naked (023136b0)

As you may have guessed, I happen to have a profiler named "UnitTestSampleProfiler" loaded and responding to ELT notifications, so that thunk will jmp right into my Enter2 hook. When I return from my hook, control goes right back to the managed function that called the thunk.

Fast-path vs. Slow-path

There are two paths the CLR might take to get to your ELT hooks: fast & slow.  Fast means the JIT inserts a call from the JITted function directly into the profiler. (In profiler-enhanced NGEN modules, this translates to the thunk jumping directly to your ELT hook.) Slow means that some fixup must be done before control can be passed to your profiler, so the JIT inserts a call from the JITted function into helper functions in the CLR to do the fixup and finally forward the call to your profiler. (Or, in NGEN-land, the thunks jmp to those CLR helper functions.)

There are also two supported signatures for the ELT hooks: CLR 1.x (set via SetEnterLeaveFunctionHooks) and CLR 2.x-style (set via SetEnterLeaveFunctionHooks2).

If your profiler requests 1.x ELT hooks, then slow-path is used for them all, end of story.

If your profiler requests 2.x ELT hooks, then slow-path is used for them all if any of the following event flags were set by your profiler:

  • COR_PRF_ENABLE_STACK_SNAPSHOT:  “Slow” ensures that the CLR has an opportunity to do some housekeeping on the stack before your profiler is called so that if your profiler calls DoStackSnapshot from within the ELT hook, then the stack walk will have a marker to begin from.
  • COR_PRF_ENABLE_FUNCTION_ARGS: “Slow” gives the CLR an opportunity to gather the function’s arguments on the stack for passing to the profiler’s enter hook.
  • COR_PRF_ENABLE_FUNCTION_RETVAL: “Slow” gives the CLR an opportunity to gather the function’s return value on the stack for passing to your profiler’s leave hook.
  • COR_PRF_ENABLE_FRAME_INFO: “Slow” gives the CLR an opportunity to gather generics information into a COR_PRF_FRAME_INFO parameter to pass to your profiler.

Why do you care? Well, it's always good to know what price you're paying. If you don't need any of the features above, then you're best off not specifying those flags. Because then you'll see better performance as the managed code may call directly into your profiler without any gunk going on in the middle. Also, this information gives you some incentive to upgrade your profiler's old 1.x ELT hooks to the hip, new 2.x ELT style. Since 1.x ELT hooks always go through the slow path (so the CLR has an opportunity to rearrange the parameters to fit the old 1.x prototype before calling your profiler), you're better off using the 2.x style.

Next time...

That about covers it for the ELT basics. Next installment of this riveting series will talk about that enigma known as tailcall.

Posted by davbr | 13 Comments

Creating an IL-rewriting profiler

A frequent topic of discussion between those of us on the CLR Profiling API team at Microsoft and our customers is how to write a profiler that rewrites IL to do cool stuff. Unfortunately, there still is very little documentation on how to do this, and what documentation there is, is rather scattered. I'm not going to say anything new here. But I will try to bring together the scattered info into one place.

Q: Why do I care?

You may want to instrument managed code to insert custom calls into your profiler to measure timings or code coverage, or record execution flow. Maybe you want to perform custom actions, like taking note whenever thread synchronization is invoked (e.g., Monitor.Enter,Leave). One way to do this is to take the original IL of the application you're profiling, and rewrite that IL to contain extra code or hooks into your profiler or managed code you ship alongside your profiler.

Q: I can ship managed code alongside my profiler? Are you saying I can write my profiler in manage code?

No, sorry to get your hopes up. Your profiler DLL must be unmanaged. Your ICorProfilerCallback implementations must be unmanaged (and should not call managed code). However, if you rewrite IL, it's perfectly fine for that IL to call into managed code that you've written and shipped alongside your profiler.

Q: In a nutshell, what's involved?

Well, first off, you're making a profiler. That means you create an unmanaged in-proc COM server DLL. If this much is already new to you, you should probably stop reading this, search MSDN for "ICorProfilerCallback", and grope through the table of contents for background info on how to write a profiler in general.

Keep in mind there are many ways to do this. I'll outline one of the more straightforward approaches here, and the adventurous should feel free to substitute their own ingredients:

  • In your ICorProfilerCallback2::ModuleLoadFinished callback, you call ICorProfilerInfo2::GetModuleMetadata to get a pointer to a metadata interface on that module.
  • QI for the metadata interface you want. Search MSDN for "IMetaDataImport", and grope through the table of contents to find topics on the metadata interfaces.
  • Once you're in metadata-land, you have access to all the types in the module, including their fields and function prototypes. You may need to parse metadata signatures and this signature parser may be of use to you.
  • In your ICorProfilerCallback2::JITCompilationStarted callback, you may use ICorProfilerInfo2::GetILFunctionBody to inspect the original IL, and ICorProfilerInfo2::GetILFunctionBodyAllocator and then ICorProfilerInfo2::SetILFunctionBody to replace that IL with your own.

Q: What about NGEN?

If you want to rewrite IL of NGENd modules, well, it's kind of too late because the original IL has already been compiled into native code. However, you do have some options.  If your profiler sets the COR_PRF_USE_PROFILE_IMAGES monitor event flag, that will force the "NGEN /Profile" version of the modules to load if they're available.  (I've already blogged a little about "NGEN /Profile", including how to generate those modules, here.)  So, at run-time, one of two things will happen for any given module.

1) If you set COR_PRF_USE_PROFILE_IMAGES and the NGEN /Profile version is available, it will load.  You will then have the opportunity to respond to the JITCachedFunctionSearchStarted callback.  When a function from an NGEN /Profile module is about to be executed for the first time, your profiler receives the JITCachedFunctionSearchStarted callback.  You may then set the *pbUseCachedFunction [out] parameter to FALSE, and that will force the CLR to JIT the function instead of using the version that was already compiled into the NGEN /Profile module.  Then, when the CLR goes to JIT the function, your profiler receives the JITCompilationStarted callback and can perform IL rewriting just as it does above for functions that exist in non-NGENd mdoules.  What's nice about this approach is that, if you only need to instrument a few functions here and there, it can be faster not to have to JIT everything, just so you get the JITCompilationStarted callback for the few functions you're interested in.  This approach can therefore improve startup performance of the application while it's being profiled.  The disadvantage, though, is that your profiler must ensure the NGEN /Profile versions of all the modules get generated beforehand and get installed onto the user's machine.  Depending on your scenarios and customers, this may be too cumbersome to ensure.

2) If you set COR_PRF_USE_PROFILE_IMAGES and the NGEN /Profile version is not available, the CLR will refuse to load the regular NGENd version of that module, and will instead JIT everything from the module.  Thus, it's ensured that you have the opportunity to intercept JITCompilationStarted, and can replace the IL as described above.

Q: Any examples?

Here is an MSDN article that talks about making an IL rewriting profiler:
http://msdn.microsoft.com/msdnmag/issues/03/09/NETProfilingAPI

Some blog entries I wrote about using IL rewriting to work around a bug:
http://blogs.msdn.com/davbr/archive/2006/02/27/540280.aspx
http://blogs.msdn.com/davbr/archive/2006/06/07/620925.aspx

Q: Any caveats?

Rewriting IL in mscorlib.dll functions can be dangerous, particularly in functions that are executed during startup initialization of the managed app or any of its AppDomains. The app may not be initialized enough to handle executing some of the managed code that might get called (directly or indirectly) from your rewritten IL.

If you're going to modify the IL to call into some of your own managed code, be careful about which functions you choose to modify. If you're not careful, you might accidentally modify the IL belonging to your own assembly and cause infinite recursion.

And then there's the worst of both worlds: when you need to rewrite IL to call into their own assemblies and you happen to rewriting IL in mscorlib. Note that it's simply unsupported to force mscorlib.dll to reference any other assembly. The CLR loader treats mscorlib.dll pretty specially. The loader expects that, while everyone in the universe may reference mscorlib.dll, mscorlib.dll had better not reference any other assembly. If you absolutely must instrument mscorlib.dll by modifying IL, and you must have that IL reference some nifty new function of yours, you had better put that function into mscorlib.dll by dynamically modifying mscorlib.dll's metadata when it is loaded. In this case you no longer have the option of creating a separate assembly to house your custom code.

Q: Has anyone else tried making an IL-rewriting profiler?

Sure. If you want to learn from other people's experiences, read through the Building Development and Diagnostic Tools for .Net Forum. Here are some interesting threads:

http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=1153030&SiteID=1
http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=1230140&SiteID=1

Posted by davbr | 9 Comments

Forum for Profiling API questions (and more)

The debugging forum has recently had its scope expanded to managed-code tool development in general, including profilers.  So if you're looking for more information on the profiling API or have a question, be sure to check out the forum here: http://forums.microsoft.com/MSDN/ShowForum.aspx?ForumID=868&SiteID=1.
Posted by davbr | 0 Comments

WOW64 and Your Profiler

Has this ever happened to you?

My profiler loads and runs great on my 32 bit box. But when I try to run it on a 64 bit box, it never loads. Shouldn't the WOW just make this all work?!

When it comes to running your profilers on 64 bit boxes, there aren't any special rules dictated by the CLR that you must follow to get it to work. But if you're not careful, or you're unaware of how WOW64 works, it's easy to misconfigure your box.

OS Redirectors

First off, get familiar with WOW64. MSDN has essential information here and in child topics. I'm not going to repeat all that information here, but I will mention a couple things to take note of.

The registry redirector and file system redirector create a logical view under which a 32 bit app executes, so that the 32 bit app feels like it is executing on a 32 bit machine. The registry redirector will show a subset of the registry to a 32 bit app, with some automatic remapping. An important example (there are more!):

32 bit app asks for "HKEY_CLASSES_ROOT\CLSID".
32 bit app gets "HKEY_CLASSES_ROOT\Wow6432Node\CLSID"

This is how the system keeps 32-bit COM objects separate from 64-bit COM objects.

Similarly, file system paths get redirected to make the 32 bit app feel more at home. Example:

32 bit app asks for "%windir%\system32\Blah".
32 bit app gets "%windir%\SysWow64\Blah"

There are exceptions to this, but that's the general rule. For those learning this for the first time, it's worthwhile to emphasize this point.

The 64 bit system tools and DLLs are in system32.
The 32 bit system tools and DLLs are in SysWow64.

At first, this might seem a little reversed from what you'd expect, but you get used to it after a while.

When a 32 bit app starts up, its environment variables (particularly its PATH) are adjusted in a natural way so that the DLLs and other executables it wants are generally in the places it expects to find them.

Register this

Let's put this knowledge into practice. What happens when you run regedit.exe? The answer is, which regedit.exe are you running? Ok, that's just another question. To answer that, from what environment are you running regedit.exe? Darn it, did it again, another question. Let's say you spawn cmd.exe from Start.Run. That gives you a 64 bit command prompt, from the real %windir%\system32\cmd.exe (unless you do some wacky things to your path). That will give you the real 64 bit environment and paths. So if you type regedit.exe from that command prompt, you'll get the real 64 bit %windir%\system32\regedit.exe. That guy will show you the whole registry.

But you can also run a 32 bit command prompt from Start.Run via %windir%\SysWow64\cmd.exe. When you do that and type "regedit.exe" you're gonna get the 32 bit %windir%\SysWow64\regedit.exe, which will show you the 32-bit logical view of the registry, with Registry Redirection in place.

Now what if, instead of running regedit.exe, you run regsvr32.exe? Same deal: depending on your environment, you'll either get the 32-bit or the 64-bit version of regsvr32.exe. And let's say you were running regsvr32.exe to register your profiler DLL. Well, you'll need to make sure the regsvr32 you run corresponds to the "bitness" of your profiler DLL. One of the WOW64 rules says that a 64 bit process cannot load 32 bit DLLs, and a 32 bit process cannot load 64 bit DLLs. So if the bitness of regsvr32 doesn't match the bitness of your profiler DLL, then regsvr32 will fail to LoadLibrary your DLL.

Profilers: The magic ends with you

One of the nice features of the CLR is the option to go for platform independence. You can write a C# app, compile with csc.exe /platform:anycpu, and you get an executable with a pretty cool property. Slap that executable onto an x64 or ia64 machine (with the appropriate CLR installed), and it will run as a native 64-bit app in that machine's 64-bit architecture. Slap that same executable onto an x86 machine, and it will run as a native 32-bit app. For managed applications that don't make assumptions about bitness (e.g., that don't rely on the size of pointers or on native COM objects of a particular bitness being used), this is pretty nifty. See Josh William's informative blog entry for more information.

But profilers don't get to live in this magical world of chocolate fountains, lollipop skyscrapers, and dancing leprechauns. Profilers are native DLLs, and you have to explicitly choose whether to compile your profiler DLL for x86, x64, or ia64. And as you probably know, it's not a simple matter of swapping in a new compiler and saying, "go!". There's a fair amount of platform-specific code you need to write, depending on which features of the CLR Profiling API you use. Using the enter / leave / tailcall hooks is but one example. (See Jonathan Keljo's blog entry on that to get an idea.) And of course there are the little things, too, like several CLR Profiling API types (e.g., FunctionIDs) becoming 64 bits long on 64 bit architectures. Anyway, it's out of the scope of this topic to talk about all the pitfalls you can fall into. But when you embark on compiling for multiple architectures, it should hopefully become pretty clear what changes you'll need to make.

Once you've decided all the architectures you wish to support, you need to provide a version of your DLL for each architecture, and register it appropriately. As a quick recap, here's an easy way to do the registration:

x86 Machine
Just run regsvr32 like you usually do.

x64 or ia64 Machine

If you want to be able to profile 64 bit apps, run 64 bit regsvr32 against your 64 bit Profiler DLL
If you want to be able to profile 32 bit apps (WOW),   run 32 bit regsvr32 against your 32 bit Profiler DLL.
If you want to support both, do both!

Note that, although you may not mix an EXE of one bitness with a DLL of another bitness, you may still spawn child EXEs of any bitness you like from any parent EXE. So you can still author one installation program that copies and registers both 32 bit and 64 bit copies of your profiler DLL.

COR_PROFILER

Ok, so your profiler DLL is registered, but what about those pesky environment variables you're supposed to set in order to get your profiler DLL actually loaded when the managed app runs? This part is easy. Microsoft generally recommends that, if you're creating a COM object and compiling for both 32 and 64 bits, just use the same CLSID in both versions. If you register both versions of the COM object on the same machine, then that same CLSID will show up in both branches of the registry:

under HKEY_CLASSES_ROOT\CLSID for 64 bit apps
under HKEY_CLASSES_ROOT\Wow6432Node\CLSID for 32 bit apps

So if you just set COR_PROFILER to your one and only CLSID, then you don't even need to know beforehand the bitness of the managed application that will be spawned from that environment. 64 bit apps will know to look for your CLSID under the 64 bit version of the CLSID hive, and will find the path to the 64 bit version of your profiler DLL. 32 bit apps running under the WOW will get redirected to the 32 bit version of the CLSID hive, and will find the path to the 32 bit version of your profiler DLL.

Double your bits, double your pleasure

Hopefully this clears things up. Really, there's only one key thing to remember, which is to be aware of the various WOW64 redirectors. Once you understand how they work, it becomes clear how to register your profiler DLL, and where the managed app expects to find what you registered.

Posted by davbr | 9 Comments

Bug! Inspecting Value-Type Returns Workaround #2

12/6/2007 Update: The bug described below has been fixed in CLR 2.0 SP1.  (See this post for more info.)  However, you may still find this info useful as a tutorial on performing IL rewriting.

Previously, I posted about a bug with using the CLR Profiling API to inspect value-type return values. The workaround I described involved using IL rewriting to instrument managed functions to call your own custom managed function to do the inspection. In a cliffhanger worthy of a nighttime soap's season finale, I promised I'd return to describe another workaround where you instrument the IL to call into your own unmanaged code.

Even if you don't care about this bug (because you don't care about inspecting return values), you might still find it useful to watch IL rewriting in action as it injects calls from managed code to your unmanaged profiler code. You'll definitely want to look over that previous post first, as it has some basic IL rewriting details that will not be repeated here.

Also, a word of caution. The technique described below for calling into your unmanaged profiler code is dangerous, extremely difficult to get right, and relies on internal behavior of the runtime that can change from version to version. If you're a writer of applications, and not profilers, this technique cannot be used in your code, so you might as well stop reading now. If you're a writer of a profiler, it's still really easy to mess this up, so pay particular attention to the warnings and rules below. And, as usual for profiler writers, expect the CLR internals to change out from under you with each new version. There's no guarantee the technique or even calling conventions I describe below will remain in future versions of the CLR.

Who moved my data?

The biggest obstacle you're going to face when calling into your profiler's unmanaged code is ensuring you play nicely with the garbage collector. In the first workaround I described (same post), the GC was of no concern. We were just happy managed code calling more happy managed code. If a GC occurred in the middle, and we were looking at an object reference that got moved, no biggie! The GC will have paused our execution, updated our references to point to the current locations of the objects, and set us on our way in blissful ignorance that anything happened. But once you step outside the happy managed world and write unmanaged code to inspect managed values, the world gets a little more complicated.

Now hold it right there, Dave-o, some of you might be thinking. The bug we're working around involves inspecting VALUE TYPES. Those don't live on the GC heap, so who cares if a GC occurs while my code is running?

That's not entirely true.  Value types are copied around to each point of use.  A value type embedded in an instance of a reference type (i.e., Object) will therefore live on the GC heap and move around.  Now it's true that, as a value type is returned from one function to another, a copy must be made.  But in some cases, that copy might be made directly to a field in another object.  You need to keep all this in mind when rewriting IL to avoid any surprises.  Here are some more reasons you might care:

  • Perhaps you'd like all your inspection code to run through the same path, so you'd like to inspect reference types (i.e., Objects) by a similar mechanism.  And, of course, those always live on the GC heap.
  • Perhaps you'd like to do some deep inspection for some value types, and follow pointers inside the value type to the Objects they reference.  If you ignore the existence of the GC, and it moves those Objects, you'll follow the pointers to la-la-land (which, despite the name, is not a happy place).
  • If you ignore how all this stuff works and just have managed code P/Invoke out to your unmanaged code, the JIT may create stubs between the managed code and your unmanaged code, burning extra cycles, and potentially marshalling the arguments in ways you don't expect.
  • Depending on how you pass the value type to your unmanaged code, it might well end up on the GC heap (e.g., if you box it first, as I did in the first workaround).

Secret ingredient

The secret I'll use is the same secret used internally in the CLR. As you can imagine, it could be devastating if some thread were executing inside the bowels of the CLR, manipulating an Object, and then got interrupted by a GC that moved that Object out from under the thread. This mayhem is prevented by a rule used by the GC. The GC will not continue until all threads are "ready". And one way for a thread not to be ready, is for it to be executing unmanaged code that is not the target of a P/Invoke or COM call. In particular, if a thread is running inside the CLR to perform usual runtime duties (e.g., JITting, loading a type, etc.), that's not considered a P/Invoke or COM call, so that thread is not "ready". So that thread will block any GC attempt until that thread is ready.

The CLR takes advantage of this rule all the time, so why not your profiler?

Please stop reading

If any of you application writers are still out there, stop reading.  There's nothing to see here. Move along. Following this technique will doom your application to rely on internals of the runtime that may change from version to version, and will cause deadlocks and foul GC behavior inside your app.

If any of you profiler writers are still out there, you need to be very, very careful with how you use this technique.

Danger!  If you use the following technique to force IL to call your unmanaged code:
  • DO expect your profiler to break in future CLR versions, as it is relying on internal CLR implementation details
  • DO keep the unmanaged code down to a bare minimum, with only simple operations like reading and writing memory (but no heap allocations!)
  • DO relinquish control back to the runtime as soon as possible
  • DO NOT take any locks or perform any kind of thread synchronization
  • DO NOT call back into the runtime (e.g., don't use ICorProfilerInfo(2), metadata APIs, etc.)

In case you forget how easy it is to cause deadlocks, you might want to look at my previous post on stackwalking, particularly the section "GC helps you make a cycle".

Rewrite that IL!

Ok. If you're still with me, let's say I want to rewrite IL to call into my unmanaged inspection function in a way that will block the GC. How do I do this? Well, there are 3 possible IL instructions to use to make a function call: call, callvirt, and calli. The first two use a metadata token to identify the target of the call. But calli is special. It lets you use a plain old function pointer. Crack open your copy of the ECMA CLI Specification. Surely you have one at your fingertips by now! If not, it's currently on msdn at http://msdn.microsoft.com/netframework/ecma/.

Partition III, 3.20 describes the miracle that is calli. Its assembly format looks like this:

	calli callsitedescr

And its effect on the IL stack is as follows:

	…, arg1, arg2 … argN, ftn -> …, retVal (not always returned)

Setting up the IL stack is the easy part. I just load the args onto the IL stack, followed by that ftn dude. The ftn is simply an address to my unmanaged inspection function (i.e., good old-fashioned function pointer).

Where it gets interesting is that little callsitedescr you see in the assembly format for the calli instruction. What's that thing?! It's a metadata token that references a signature for the function. In particular, this can be a StandAloneMethodSig, as described in partition II, 23.2.3. Part of this signature includes the "calling convention" for my profiler's unmanaged inspection function. Partition II, 15.3 lists them out:

CallConv ::= [ instance [ explicit ]] [ CallKind ]
...
CallKind ::=
default
| unmanaged cdecl
| unmanaged fastcall
| unmanaged stdcall
| unmanaged thiscall
| vararg

Where we start to get tricky is our choice of the calling convention. The CallKinds "default" and "vararg" are what are normally used when managed code calls managed code. When managed code calls unmanaged code, usually one of the other "unmanaged *" CallKinds are used. Specifying one of the "unmanaged *" CallKinds implicitly tells the JIT that I want all the usual baggage with a P/Invoke or COM call: generate stub code to marshal parameters as necessary, and let the GC know that it's ok to do its thing while we're executing inside the unmanaged call.

So the trick is to use the default CallKind when calling into my profiler's unmanaged code. This forces the JIT not to add any stub code, so the GC will remain blocked for the duration of my call. Armed with this knowledge, let's say I'm going to try to call an unmanaged function with this prototype to do my inspection:

    void UnmanagedInspectValue(void * pv)

I should note that, in the real world, I wouldn't get very far with an inspection function like this. I'm going to want another parameter to identify the type of pv, so I can better inspect it. Maybe my inspection function will be capable of doing rich inspection of a set of known types, and then default to dumping the value in binary or string form for other types. For my purposes here, though, I'm keeping it simple, and leaving the prototype at one parameter.

Given that simple prototype, here is an example StandAloneMethodSig that uses the default CallKind and describes my inspection function:

COR_SIGNATURE unmanagedInspectValueSignature[] = 
{
	IMAGE_CEE_CS_CALLCONV_DEFAULT,        // Default CallKind!
	0x01,                                 // Parameter count
	ELEMENT_TYPE_VOID,                    // Return type
	ELEMENT_TYPE_PTR, ELEMENT_TYPE_VOID   // Parameter type (void *)
};

For this to work right, I'll need to make sure my unmanaged code uses the proper calling convention used by compiled managed code. On x86, this means using fastcall (well, sort of).

void __fastcall UnmanagedInspectValue(void * pv)
{
    // pv points to the value I will inspect
    // ...cool inspection code here...
}

This is where it might get a little confusing. First of all, I should be clear on the distinction between the CallKinds, and the native calling convention my unmanaged inspection function will actually use. You may recall my list of CallKinds above. One of those is actually named "unmanaged fastcall". I'm not saying I should use that CallKind in my StandAlondMethodSig. I really do want to use the CallKind default. But when I define my unmanaged function that will be called by the IL I rewrite, that unmanaged function needs to use the fastcall calling convention, because that's the convention used when managed code calls managed code.  And I'm trying to appear to the JIT as if that's exactly what I'm doing (even though I'm really having managed code call my unmanaged inspection function).

That brings me to the second fuzzy thing here. My unmanaged inspection function isn't really going to be fastcall. But it's pretty close! (Am I driving you mad yet?) Microsoft's CLR implementation uses a calling convention similar to fastcall for managed calls, but not exactly. The main difference arises when you have to pass more parameters than will fit into the two registers dedicated for fastcalls (ECX, EDX). Normal __fastcall has the remaining parameters pushed onto the stack right-to-left. But the CLR pushes those remaining parameters left-to-right. For more details, check out ECMA Partition II, 15.5.6.1 (be sure you're reading the Microsoft-specific version of the ECMA spec, or you won't find that section; again, this is currently on msdn at http://msdn.microsoft.com/netframework/ecma/).

So, to be completely perfect, I really should write my inspection function in assembly--or have an assembly wrapper that accepts the parameters as passed by the CLR, and then passes them to a C function that uses a more typical calling convention. However, since in my case I only have one parameter anyway, I can cheat and just call it __fastcall, and the compiler is none the wiser. Just so long as I remember that, once I add a third parameter to my function, the gig is up!

Although the original bug that spawned this post (and thus my workaround) is x86-only, it's worthwhile to mention that, on x64, the native calling convention I'd have to use is just the standard x64 calling convention used everywhere else.  In this calling convention, the first four parameters are passed in registers; the rest on the stack.  The CLR does not diverge from this in its implementation of managed calls, so that makes things a little simpler. You can read more about the x64 calling convention at http://msdn2.microsoft.com/en-us/library/ms235286.aspx.

You can also read about this sneaky trick of using calli to call native code directly in ECMA (MS-specific), Partition II, 15.5.6.3.

Now that you have some more context, it feels like it's time for another warning. The fact that the CLR uses this pseudo-fastcall convention for calling managed code is an internal implementation detail. The fact that calli works against arbitrary function pointers, and calls through to the underlying unmanaged code while continuing to block the GC is an internal implementation detail. There is no guarantee that these things will always be. And if you think you're going to call all sorts of nifty code in your unmanaged inspection function, scroll up to the yellow warning above and refresh your memory. That said, profiler writers are often at the bleeding edge and take all sorts of dependencies on internals as a matter of course. Profilers often run code at dangerous times and need to be careful about what they do and when. It's simply my job to remind you that this is one of those times.

What have I got so far?

Let me pause for a breather to summarize my progress. I'm rewriting IL in the profiled code to call my unmanaged inspection function. So far I have:

    ldc.i4 UnmanagedInspectValue (This is my fcn pointer; use the 8-byte ldc.i8 on 64 bits)
    calli callsitedescr

Geez, Dave-o, after all this, all you've got is a push and a calli?!

Yeah, well, that callsitedescr was tricky.

Boxing is for wimps?

So what's next? I know the prototype for my unmanaged inspection function, and I know the IL I'll use to call it. Well, I obviously need some IL to pass the return value that I wish to inspect to my function. And actually, I could just use the code from the previous post detailing the first workaround (where I just dup and box the return value when it's already on the IL stack for me). That would produce this rewrite:

Original:

    ret

New and improved:

    dup
    box (TypeRef / TypeDef token)
    ldc.i4(8) UnmanagedInspectValue
    calli callsitedescr
    ret

But you know what? Boxing is really the wimpy way out (please don't tell Mr. Tyson I said so). It's easy to code up, but there's overhead involved in creating a copy of the value on the GC heap, just so I can get a pointer to it. Really, I just want to do an "address-of" on that baby, and pass that in to my function. Boxing the value just for that is using a cannon to kill a mouse. Unfortunately, IL doesn't have some easy way for me to just take an address of an item on the IL stack. What is one to do?

Real programmers use locals

I can store the value into a local, and then take the address of the local and push that onto the IL stack. That would produce the following IL rewrite:

Original:

    ret
New and improved:

    dup
    stloc(.s) indx     
// Copy return value into local
    ldloca(.s) indx    
// Push address to local onto stack
    ldc.i4(8) UnmanagedInspectValue (fcn pointer)
    calli callsitedescr
    ret

Hmmm, what's that indx thing? In IL locals are referenced by their 0-based ordinal. I'd like to use a brand new local, so all I need to do is figure out the highest ordinal th