|
|
-
This is a follow-up post for JIT ETW tracing in .NET Framework 4. These are some of the possible strings that might show up in FailReason field of the MethodJitInliningFailed event. These are reasons that come from or are checked for by the VM (as compared to the JIT) and are listed in no particular order:
- "Inlinee is NoMetadata" - this means the inlinee is not normal managed code. The most common case is dynamic methods. Another possibility is one of the many kinds of stubs the CLR uses internally for things like PInvoke marshalling, delegates, remoting, etc.
- "Inlinee is debuggable" - the inlinee was compiled or loaded in such a way that it told the runtime not to optimize its code (C#'s /o- switch is one example).
- "Profiler disabled inlining globally" - a profiler passed COR_PRF_DISABLE_INLINING.
- "Profiler disabled inlining locally" - a profiler passed COR_PRF_MONITOR_JIT_COMPILATION and then set *pfShouldInline to FALSE in ICorProfilerCallback::JITInlining.
- "Inlinee is not verifiable" - the inlinee does not have SkipVerification permission and is not verifiable. This is done so that verification exceptions are easier to debug.
- "Inlinee is marked as no inline" - the inlinee is explicitly marked with MethodImplOptions.NoInlining, or the VM or JIT previously determined that the method would never be a good inline candidate and marked it as such to speed subsequent JITs that might try to inline the same inlinee.
- "Inlinee requires a security object (calls Demand/Assert/Deny)" - the inlinee is marked with mdRequireSecObject. With our current implementation, such methods need their own call frame so the corresponding Assert or Deny will end at the return of the method.
- "Inlinee is MethodImpl'd by another method within the same type" - an implementation limitation such that it can't find the right method to give to the JIT to inline (but don't worry, it still finds the right one to call). See IMetadataEmit::DefineMethodImpl for how this is done at the IL level. It is my understanding that the C# language does not allow this, but VB.NET does.
- "Targeted Patching - Method lacks TargetedPatchingOptOutAttribute" - this relates to a new feature in CLR 4 where NGEN images are more version resilient. In a nutshell, we hope to be able to apply a patch or fix to mscorlib.dll and not have to recompile all the other native images on the machine that depend upon it. This should only apply to methods in the .NET framework class libraries because they are the only assemblies that can opt into this feature. For more information, see this previous blog post: Improvements to NGen in .NET Framework 4.
- "Inlinee has restrictions the JIT doesn't want" - although this is in the code, in our current implementation this will never happen. It indicates that the VM needs to place a restriction on the inlinee (i.e. the inlinee can only be inlined if certain conditions are met), but the JIT doesn't allow any restrictions, so the VM has to refuse the inlining.
|
-
Introduction
One argument often made by those who dislike managed code is along the lines of “managed code can never be as fast as native code, because managed code has to do array bounds checks.” Of course, this isn’t precisely true – it would be more accurate to say that “managed code must ensure that any indexing outside of an array’s bounds raises an appropriate exception.” If a compiler can prove statically that an array index operation is safe, it doesn’t need to generate a dynamic test.
We’re not starting from scratch here. There’s been a lot of academic (and industrial) research on bounds check elimination, and various managed code systems have implemented some subset of these techniques. Putting “array bounds check elimination” into bing.com yielded a large number of relevant papers, many of which I’ve read and enjoyed; I’d imagine a competitor’s search site would do the same J.
This blog post will explore what the CLR’s just-in-time compilers do and do not do in this area. I’ll of course highlight the good cases, but I’m also going to be brutally honest, and expose many examples where we could potentially eliminate a range check, but don’t. The reader (and, for that matter, the author, who didn’t implement this stuff himself) should keep in mind an important constraint: these are, in fact dynamic JIT compilers, so any extra optimization that slows the compiler down must be balanced against the gains of that optimization. Even when we run the JIT “offline”, via the NGEN tool, users are sensitive to compiler throughput. So there are many things we could do, but all take CLR developer effort, and some of them use up our precious compilation time budget. That excuse being made, it’s up to us to be clever and figure out how to do some of these optimizations efficiently in the compiler, and we’ll certainly try to do more of that in the future.
The JIT compilers for x86 and x64 are currently quite different code bases, and I’ll describe the behavior of each here. The reader should note however, that we intend to unify them at some point in the not-too-distant future. The x86 JIT is faster in terms of compilation speed; the x64 JIT is slower, but does more interesting optimizations. Our plan is to extend the x86 codebase to generate x64 code, and incorporate some of the x64 JIT’s optimizations without unduly increasing compilation time. In any case, performance characteristics of JITted code on x64 platforms is likely to change significantly when this unification is achieved.
When I show examples where we don’t eliminate bounds checks, I will when possible give advice that will help you stay within boundary of idioms for which we can. I’ll discuss things we might be able to do in the future, but I’m not in a position to give any scheduling commitments on when these might be done. I can say that any reader feedback on prioritization will be taken into account.
Code Gen for Range Checks
Before we start considering when we eliminate range checks, let’s see what the code generated for a range check looks like. Here is bounds-check code generated by the CLR’s x86 JIT for an example array index expression a[i]:
IN0001: 000003 cmp EDX, dword ptr [ECX+4] // a in ECX, i in EDX
IN0002: 000006 jae SHORT G_M60672_IG03 // unsigned comparison
In the first instruction, EDX contains the array index, and ECX + 4 is the address of the length field of the array. We compare these, and jump if the index is greater than or equal to the length. The jump target, not shown, raises a System.IndexOutOfRangeException. A sharp-eyed reader might wonder: the semantics require not only that the index value is less than the array length, but also that it is at least zero. Isn’t that two checks? How did they get away with only one comparison and branch? The answer is that we (like many other systems) take advantage of the wonders of unsigned arithmetic – the x86 “jae” instruction interprets its arguments as unsigned integers (it’s the unsigned equivalent of “jge”, if that’s more familiar to some readers). The type of the length of an array, and an expression used to index into an array, is Int32, not UInt32. So the maximum value for either of these is 231-1. Further, we know that the array length will be non-negative. So if we convert the array length to a UInt32, it doesn’t affect its value. The index value, however, might be negative. If it is, casting its bit pattern to UInt32 yields a value that is at least 231. So both cases, when the index value is negative, or when it is larger than the array length, are handled by the same test.
In an NGEN image, we try to separate out code we expect to never be executed (code that is so “cold” that it’s at absolute zero!), hoping to increase working set density, especially during startup. We expect bounds-check failures to be in this category, so we put the basic blocks for failure cases on cold pages.
Bounds-check removal cases
Now we’ll examine some test cases, starting with some simple ones.
Simple cases
The good news is that we do eliminate bounds checks for what we believe to be the most common form of array accesses: accesses that occur within a loop over all the indices of the loop. So, there is no dynamic range check for the array access in this program:
static void Test_SimpleAscend(int[] a) {
for (int i = 0; i < a.Length; i++)
a[i] = i; // We get this.
}
Unfortunately, we do not eliminate the range check for the descending version of this loop:
static void Test_SimpleDescend(int[] a) {
for (int i = a.Length - 1; i >= 0; i--)
a[i] = i; // We DO NOT get this.
}
Some older programmers learned to write loops like this, because on some early architectures (e.g., DEC PDP-8, if memory serves) there was a hardware addressing mode that did auto-decrement, but not auto-increment. They may have passed this habit down to middle-aged programmers (of which I am one), and so on. There’s also a somewhat more currently-valid argument that hardware generally supports a comparison to zero without requiring the value zero to be placed in a register. In any case, while the JIT compiler should arguably eliminate the bounds check for the descending form of the loop, we don’t today, and the cost of the bounds check probably outweighs any of the other advantages. So:
· Advice 1: if you have the choice between using an ascending or descending loop to access an array, choose ascending.
I’ve put the array access on the left-hand side of an assignment in both these examples, but it works independently of the context in which the array index expression appears (as long as it’s within the loop, of course).
Do we track equalities with the length of a newly allocated array?
Here is a case in which the x86 JIT does not eliminate the bounds check:
static int[] Test_ArrayCopy1(int n) {
int[] ia = new int[n];
for (int i = 0; i < n; i++)
ia[i] = i; // We do not get this one.
return ia;
}
No excuses here: there’s no reason not to get this, the JIT compiler ought to know that n is the length of the newly allocated array in ia. In fact, the author of such code might think he was doing the JIT compiler a favor, since comparison with a local variable n might seem cheaper than comparison with ia.Length (though this isn’t really true on Intel machines). But in our system, at least today, this sort of transformation is counterproductive for the x86 JIT, since it prevents the bounds check from being eliminated. We may well extend our compiler(s) to track this sort of value equivalence in the future. For now, though, you should follow this piece of practical advice:
· Advice 2: When possible, use “a.Length” to bound a loop whose index variable is used to index into “a”.
The x64 JIT does eliminate the range checks here, by hoisting a test outside the loop, comparing n with ia.Length. If this check fails, it throws an IndexOutOfRangeException. This is somewhat problematic, since without this optimization the program would execute ia.Length iterations of the loop before throwing an exception, and strict language semantics would require those to be executed if they could possibly have a side-effect visible outside the method (which this example does not in fact have – though proving it requires your compiler to do enough escape analysis to know that the allocated array that is written to has not leaked outside the method). This semantic ambiguity is the subject of some internal debate, and we’ll eventually reach consensus on how/when to incorporate such tests in a unified JIT,or whether we need to ensure strict semantics, perhaps by generating multiple copies of loop bodies, as we’ll discuss below. (It’s interesting to note that the hoisted test and throw would be justified by assuming the CompilationRelaxationsAttribute defined in section I.12.6.4 of the ECMA CLI specification for bounds-check error exceptions everywhere – whereas the specification requires it to be given explicitly.) In any case, we should emphasize that, as far as we know, this is a “theoretical” concern only – we don’t know of any actual customer code whose correctness is affected by this issue.
Redundant array accesses
OK, while we’re slightly embarrassed by the previous “multiple names for the length” case, let’s cheer ourselves up with something we do well. We’re pretty good at eliminating redundant bounds checks. In the method:
static void Test_SimpleRedundant(int[] a, int i) {
k = a[i];
k = k + a[i];
}
bounds-check code is generated for the first instance of “a[i]”, but not the second. In fact, the x86 JIT treats it as a common subexpression, and the first result is re-used. And this works not just within “basic blocks” – it can work across control flow, as demonstrated by:
static void Test_RedundantExBB(int[] a, int i, bool b) {
k = a[i];
if (b) {
k = k + a[i];
} else {
k = k - a[i];
}
}
As before, the first “a[i]” gets a bounds check, but the two subsequent occurrences of “a[i]” do not. The x86 JIT also treats the expression as a common subexpression, re-using the result from the first read of “a[i]”.
It is not the case that bounds check elimination only works in the x86 JIT when the result is a common subexpression. Consider this variation of the first case:
static void Test_RedundantNotCse(int[] a, int i, int j) {
k = a[i];
a[j] = i;
k = k + a[i];
}
The JIT compiler obviously can’t tell whether “i” and “j” will have the same value at runtime. But it can tell that they might, and that if they do, the “a[i]” on the last line will return the value written there on the second line. So we cannot treat the “a[i]” expressions on the first and last lines of the body as common subexpressions. But the assignment on the second line can’t affect the length of the array “a,” so in fact the bounds check for the first line “covers” the “a[i]” on the third line – the generated code accesses the array without a bounds check (in both JITs).
Arrays as IEnumerable
Arrays implement the IEnumerable interface, which raises a reasonable question: if you enumerate over the elements of an array using C#’s foreach construct, do you get bounds checks? For example:
static int Test_Foreach(int[] ia) {
int sum = 0;
foreach (int i in ia) {
sum += i;
}
return sum;
}
Happily, we do eliminate the bounds checks in this case. However, there is a little quirk here: of the cases listed, this one is the only one that is sensitive to whether the original source program (C#, in this case) was compiled to IL with the /optimize flag. The default for csc, the C# compiler, is not to optimize, and in this mode it produces somewhat more verbose IL for the range check that doesn’t fit the pattern that the JIT compiler looks for. So:
· Advice 3: if you’re worried about performance, and your compiler has an optimization flag, uh, use it!
Arrays in global locations; concurrency
Here’s a case where we don’t eliminate the bounds check, but where we aren’t too embarrassed by this failure:
static int[] v;
…
static void Test_ArrayInGlobal() {
for (int i = 0; i < v.Length; i++)
v[i] = i;
}
At first glance, this seems exactly the same as our first, simplest example, Test_SimpleAscend. The difference is that Test_SimpleAscend took an array argument, whereas Test_ArrayInGlobal’s array is accessed via a static variable, accessible to other threads. This makes static elimination of the bounds check for “v[i]” at the very least dicey. Let’s say we did, and that “v” initially holds an array of length 100. On the iteration when “i” reaches (say) 80, we check “i < v.Length”, and it’s still true. Now another thread sets “v” to an array whose length is only 50. If we go ahead with the array store without a dynamic check, we’re writing off the end of the array – type-safety and security are lost, game over. (Obviously, the same reasoning would apply for an array held in any location accessible to multiple threads – an object field, element of another array, anything not local to the running thread.)
So we don’t do this, for good and solid reasons. If we cared enough, there is a technique that would allow us to eliminate these bounds checks. But it would require us to couple otherwise-unrelated optimizations. As it happens, the code for accessing a static variable in the presence of app domains can be moderately costly, so it’s good to treat those as common-subexpression candidates, and the x86 JIT does in this case (the x64 JIT does not). So the optimizer in essence synthesizes a local variable to hold the array. If we do this, then we are back in the Test_SimpleAscend situation, and the bounds-check elimination is legal. But doing the bounds-check elimination requires that the static variable be read once into a local. So it’s at least a bit complicated.
Parallel arrays
Next we consider a case that involves what are sometimes called “parallel arrays” (in the sense of their structure, not in the sense that they will be used by multiple threads):
static int Test_TwoArrays(int[] ia1, int[] ia2) {
// The programmer knows a precondition: ia1.Length == ia2.Length
int sum = 0;
for (int i = 0; i < ia1.Length; i++) {
// Below we eliminate the ia1 check, but not the one for ia2.
sum += (ia1[i] + ia2[i]);
}
return sum;
}
Much as with Test_ArrayCopy1, the x64 JIT hoists a test comparing ia2.Length and ia1.Length, immediately throwing the bounds-check exception if the test fails. If the test succeeds, range checks for both array accesses in the loop are eliminated. The same comments about the semantic issues with such a test apply. The x86 JIT takes a more “purist” approach: it does not hoist a test, so it only eliminates the bounds check for the access to the array ia1 whose Length bounds the index variable.
We could resolve the two approaches. The mechanisms proposed for this sort of problem in the research literature have the common property that they require, at least in some cases, generating code for the loop multiple times, under different assumptions, and synthesizing some sort of test to determine which version of the loop should be executed – this is essentially the test that the x64 JIT is already creating. Generally, bounds check exceptions are rare – if the programmer wrote the code above, he or she had some reason to believe that the index expression “ia2[i]” was safe. So we could synthesize a test on that basis. In our case above, if the compiler proved that neither argument variable “ia1” or “ia2” was modified in the loop, then a test “ia2.Length >= ia1.Length” (the one the x64 JIT generates) outside the loop would allow us to execute an optimized version of the loop, with no bounds checks for either array access. If this test failed, however, we’d need to execute an unoptimized version of the loop to be completely semantically correct. You’d have to evaluate this test carefully, since it’s code that doesn’t appear in the original program. In particular in this case, you’d have to worry about whether either of “ia1” or “ia2” were null. If they are, you want the null pointer exception to occur at the appropriate point in execution, not in some code the compiler made up. So the synthesized test would have to include null tests, and take the unoptimized path if either argument is null.
As we’ve discussed, the x64 JIT generates the test, but not the unoptimized version of the loop – it throws the exception “early” in that case. Under the “purist” viewpoint, this is incorrect because if the test fails, the semantics require the program to execute some number of loop iterations before throwing the exception, and those iterations might have side effects. In many cases, we might be able to prove that the loop body does not have side effects, and therefore use the x64 JIT’s strategy with semantic blessing. For example, a loop that computed the sum of the loop elements into a local would side-effect only that local variable – if the exception causes control flow to leave the method, the value of that local becomes meaningless.
Many other patterns are amenable to this sort of synthesized test. An alternative form of Test_TwoArrays might have passed the shared length of both arrays as a separate argument, and used that as the loop bound. We could do something similar, synthesizing a test of that loop bound vs. both array lengths.
Explicit Assertions
Another suggestion that has been made is to allow the programmer to provide the relevant test in the form of a contract assertion (of a flavor that would be executed in all execution modes, not just in a debug mode). This would essentially provide semantic “permission” to fail immediately if the test is violated, avoiding the need to have an unoptimized version of the loop. There are many things to be said for this sort of proposal: they can allow bounds checks to be eliminated in situations more complicated than those for which the compiler could easily infer a test, and the invariants they express are often useful program documentation as well. Still, I also worry somewhat about such proposals. In many common cases, it’s easy enough to infer the test, so we should avoid requiring the programmer to add assertions in the easy cases. More importantly, if the programmer adds an assertion expecting it to eliminate a bounds check, how does the tool chain indicate whether he or she has been successful? And, if not, why not? These sorts of issues merit some more thought.
Still another path would be to have a custom annotation like [OptimizeForSpeedNotSpace], allowing the programmer to tell us that the performance of this method is important enough that we should apply optimizations that we wouldn’t generally apply because they increase code size – i.e., especially aggressive inlining, loop unrolling, loop body replication/specialization for the reasons discussed here, or for other forms of specialization.
The right strategy in this area is obviously a little muddled. Constructive feedback is welcome!
Copy loop
Here’s another example, somewhat similar to the Test_TwoArrays case:
static int[] Test_ArrayCopy2(int[] ia1) {
// An array copy loop operation.
int[] res = new int[ia1.Length];
for (int i = 0; i < res.Length; i++) {
res[i] = ia1[i];
}
return res;
}
As you might expect from previous examples, since we use the length of “res” as the loop bound, we eliminate the bounds check for the access to “res”. But we do not eliminate the check for the access to “ia1”. As in Test_ArrayCopy1, to eliminate this we’d need to do a better job of tracking equivalences of array lengths with local variables or other array lengths. We don’t do this today, but it’s certainly a plausible enhancement we might do. The x86 JIT leaves the bounds check in for the access “ia1[i]”, while the x64 JIT hoists a bounds-check out of the loop, as discussed above (and with the same difficulties discussed above).
While it would be nice for us to eliminate the bounds checks cases like this, if you’re copying arrays, there are many reasons to use the built-in Array.Copy routine rather than writing an explicit copy loop like those that appear in these examples:
· Advice 4: when you’re copying medium-to-large arrays, use Array.Copy, rather than explicit copy loops. First, all your range checks will be “hoisted” to a single check outside the loop. If the arrays contain object references, you will also get efficient “hoisting” of two more expenses related to storing into arrays of object types: the per-element “store checks” related to array covariance can often be eliminated by a check on the dynamic types of the arrays, and garbage-collection-related write barriers will be aggregated and become much more efficient. Finally, we will able to use more efficient “memcpy”-style copy loops. (And in the coming multicore world, perhaps even employ parallelism if the arrays are big enough!)
Multi-dimensional Arrays
The CLR, and C#, support real multi-dimensional arrays – in contrast to C++ or Java, which (directly) support only one-dimensional arrays. To get two-dimensional arrays, you have to simulate them, either through classes that represent the 2-d array as a large 1-d array, and do the appropriate index arithmetic, or as an “array-of-arrays.” In the latter case, even if they are allocated originally to form a “rectangular” 2-d array, it’s hard for a compiler to prove that the array stays rectangular, so bounds check on accesses to the “inner” arrays are hard to prove.
With true multi-dimensional arrays, the array lengths in each dimension are immutable (just as the length of a regular 1-d array is). This makes removing of bounds checks in each dimension more tractable. A related advantage is that indexing calculations become easier when the array is known to be “rectangular.” (With a good optimizer and appropriately aggressive inlining, C++ template-class-based simulations of multidimensional arrays can get similar indexing calculation code.)
Unfortunately, we aren’t yet able to remove any range checks for accesses in multi-dimensional arrays, even in simple cases like this:
static int Test_2D(int[,] mat) {
int sum = 0;
for (int i = 0; i < mat.GetLength(0); i++) {
for (int j = 0; j < mat.GetLength(1); j++) {
sum += mat[i, j];
}
}
return sum;
}
The “mat.GetLength(k)” method returns the length of “mat” in the kth dimension. We’ll clearly need to eliminate bounds checks for multi-dimensional array accesses if we want to generate reasonable code for, say, a matrix multiplication.
· Advice 5: Until we get this right, I would suggest that .NET users do what many C++ numerical programmers do: write a class to implement your n-dimensional array. This would be represented as a 1-dimensional array, and the relevant accessors would convert n indices into 1 via appropriate multiplications. We almost certainly wouldn’t eliminate the bounds check into the 1-d array, but at least we’d only do one check!
Conclusions
First, let’s accentuate the positive: we do eliminate bounds checks in some very common cases, and the costs of bounds checks usually aren’t that great when we don’t eliminate them. And, as I mentioned at the beginning, we have to keep in mind that the compiler we’re talking about is a dynamic JIT compiler, so we must carefully balance adding extra optimization that slows the compiler against the gains of that optimization. Still, if we don’t eliminate a bounds check that we should have in a small, tight loop that’s important to the performance of your program, I doubt you’ll find these excuses very satisfying. I hope this blog post convinces you that we’re well aware of the problems. The future almost certainly holds some mechanism for applying extra compilation effort to methods whose performance matters a lot, either by doing the extra work in some form of offline build-lab compilation, or by using profile-directed feedback, user annotations of hot methods, or other heuristics. When we can do extra compiler work, bounds-check elimination will be one of the problems we address.
|
-
If you care about performance at a very low level, at one point you’ve asked yourself why the compiler, JIT, or runtime did or did not inline a certain method. Unless you worked on the compiler, JIT, or runtime, you really had no way of telling, other than trial and error (sort of like asking how a magic 8 ball comes up with its answers). We, on the JIT team, decided to fix that in CLR 4. Now you the end-user or programmer can see why the JIT, and in some cases the runtime decided to disallow inlining or tail calls. We also tell you when the JIT succeeded in inlining or tail calling a certain method. All of this wonderful information is made available by Event Tracing for Windows.
First a disclaimer, I’m not an expert on ETW events, and this post isn’t about ETW events in general, it is only about a couple of new events exposed by the JIT to enable performance junkies to hurt themselves. J
For Windows XP, Widows Server 2003, Windows Vista and Windows 7, you already have everything you need on your machine. For Windows 2000, you’ll need to download the Microsoft Windows 2000 Server Resource Kit. According to the gurus around here previous to Windows Vista registering an ETW provider was not cheap, so the runtime only does it when requested to. On Vista and newer OSes, it is cheap enough to do all the time, so you only need to request it on pre-Vista OSes.You request it by setting the following environment variable before running the application you are interested in:
SET COMPLUS_ETWENABLED=1
To start logging ETW events do this:
logman start clrevents -p {e13c0d23-ccbc-4e12-931b-d9cc2eee27e4} 0x1000 5 -ets
There are lots more options to tweak here, but the important part is the GUID (the CLR ETW provider GUID), the mask 0x1000 (the JitTracingKeyword), and the level 5 (everything). More information about logman.exe can be found at http://technet.microsoft.com/en-us/library/bb490956.aspx. After you’ve started ETW, run your scenario, and then stop ETW as follows:
logman stop clrevents -ets
This will create clrevents.etl. To decode it further run this:
tracerpt clrevents.etl
This will create 2 files: dumpfile.csv and summary.txt. The former has all the events, the latter gives a nice summary of the events. On Vista, tracerpt will generate dumpfile.xml instead of dumpfile.csv. I ran this on a Vista machine so I’m going to deal with the XML format, but the csv format is similar.
Here is a sample MethodJitInliningSucceeded event:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="Microsoft-Windows-DotNETRuntime" Guid="{e13c0d23-ccbc-4e12-931b-d9cc2eee27e4}" />
<EventID>185</EventID>
<Version>0</Version>
<Level>5</Level>
<Task>9</Task>
<Opcode>83</Opcode>
<Keywords>0x1000</Keywords>
<TimeCreated SystemTime="2009-04-14T14:31:52.168851900Z" />
<Correlation ActivityID="{00000000-0000-0000-0000-000000000000}" />
<Execution ProcessID="15476" ThreadID="16936" ProcessorID="3" KernelTime="90" UserTime="60" />
<Channel />
<Computer />
</System>
<UserData>
<MethodJitInliningSucceeded xmlns='myNs'>
<MethodBeingCompiledNamespace>Factorial</MethodBeingCompiledNamespace>
<MethodBeingCompiledName>Main</MethodBeingCompiledName>
<MethodBeingCompiledNameSignature>void (class System.String[])</MethodBeingCompiledNameSignature>
<InlinerNamespace>Factorial</InlinerNamespace>
<InlinerName>Main</InlinerName>
<InlinerNameSignature>void (class System.String[])</InlinerNameSignature>
<InlineeNamespace>Factorial</InlineeNamespace>
<InlineeName>fact</InlineeName>
<InlineeNameSignature>unsigned int32 (unsigned int32)</InlineeNameSignature>
<ClrInstanceID>13</ClrInstanceID>
</MethodJitInliningSucceeded>
</UserData>
<RenderingInfo Culture="en-US">
<Level>Verbose </Level>
<Opcode>JitInliningSucceeded </Opcode>
<Keywords>
<Keyword>JitTracingKeyword </Keyword>
</Keywords>
<Task>CLR Method </Task>
<Message>MethodBeingCompiledNamespace=Factorial;
MethodBeingCompiledName=Main;
MethodBeingCompiledNameSignature=void (class System.String[]);
InlinerNamespace=Factorial;
InlinerName=Main;
InlinerNameSignature=void (class System.String[]);
InlineeNamespace=Factorial;
InlineeName=fact;
InlineeNameSignature=unsigned int32 (unsigned int32);
ClrInstanceID=13 </Message>
</RenderingInfo>
</Event>
We’ll focus in on the UserData element because it looks the prettiest. First you have 3 sets of triples: MethodBeingCompiled, Inliner, and Inlinee crossed with Namespace, Name, and Signature. I think the namespace, name and signature are fairly obvious, but the reason they are separate instead of pretty printed into a simple element is a matter of performance. ETW is supposed to be lightweight. Computing these strings is not lightweight, but we felt we made a nice trade-off (just enough information to be useful, but don’t spend time on making it prettier than it needs to be). MethodBeingCompiled is the method the VM asked the JIT to generate code for. Inliner refers to the method that the JIT is trying to generate code for. If the Inliner is not the same as the MethodBeingCompiled it is because the JIT decided to try and inline the code for Inliner into the code for MethodBeingCompiled rather than generate a call to Inliner. However, just because you see a method showing up as the Inliner, it doesn’t mean the JIT has actually succeeded in inlining it. Finally the Inlinee refers to the method that the JIT is trying to inline (rather than generate a call to).
If this was a MethodJitInliningFailed event it would have 2 additional elements: FailAlways and FailReason. If FailAlways is true, it is a hint back to the JIT and VM that inlining will always fail for the given Inlinee, regardless of the situation, and so subsequent compiles will be able to abort the inlining attempt faster (this is what FailReason of “It is marked as 'INLINE_NEVER'” means). The FailReason is a free-form text field. Originally this came from some internal code we had to help us in our debugging. The text is often cryptic, and is not localized. Part of the reason we did not localize it was performance. The other part is that these strings are often so cryptic many English speakers will have a hard time with them. For us we often end up treating them like a GUID: we search the sources for the places where that reason is returned to figure out why a particular inline failed. So if English is not your native language, don’t fret too much, you’re probably not missing much. That covers the 2 most common events: MethodJitInliningSucceeded and MethodJitInliningFailed.
Now we move onto the only other 2 events for this keyword: MethodJitTailCallSucceeded and MethodJitTailCallFailed. The first 9 elements are the same (except replace Inliner with Caller, and Inlinee with Callee). The next element is the TailPrefix element. This element tells whether the tail. IL prefix was present. For x86 this currently will always be true because the x86 JIT does not generate tail calls as an optimization. See other blog posts for why tail calls are a good optimization that the x64 and IA64 JIT often performs even without the tail. il prefix. We find this useful because, although some compilers emit the tail. prefix in IL, they do not have any source syntax to expose it, so the programmer does not know looking only at the source what was in the IL.
For MethodJitTailCallSucceded, you get a TailCallType. This is an enum, which has names that should be localized down in the Message element. Due to a bug in Beta 1 you will only get the numeric values. Also the pre-Vista versions of tracerpt do not do the translation from value to string. So on post Beta 1 builds with Vista or newer, you should the pretty names, otherwise expect the numbers. I will list both, especially because the number still appear in the UserData element. OptimizedTailCall, 0, means a typical tail call, where the outgoing arguments are pushed into the incoming parameter slots, the epilog is executed but instead of returning, it ends with a jump to the new method. RecursiveLoop, 1, means a recursive tail call, where the JIT sees that the method calls itself, and replaces the call with a jump back to the start of the method (thus skipping even the prolog and epilog). HelperAssistedTailCall, 2, means that the tail call cannot be accomplished directly and must go through a helper function call. These helper-assisted tail calls are slower than normal calls and only used as a last resort when the program needs to do a tail call in order to prevent stack overflows. Thus currently you should never see a HelperAssistedTailCall with TailPrefix=false.
For MethodJitTailCallFailed, you don’t get TailCallType element, but you do get FailReason, which is similar to MethodJitInliningFailed’s FailReason. It is also cryptic english-only text. In future posts we hope to explain some of the more common reasons.
So for the few people that really need to squeeze out every last bit of performance from your code, you can now see a little bit further into how the JIT compiles your code. Occasionally this will enable you to refactor a method in such a way as to improve inlining or tail calls. If you do this remember though that every runtime has different heuristics and algorithms, and just because a methods was or wasn’t inlined in one version doesn’t mean the same will be true on a different version. If a method absolutely cannot be inlined, use MethodImplOptions.NoInlining. If it absolutely must be inlined then inline it manually (i.e. copy and paste the code). That recommendation should never change. If however you’re looking for some simple performance boosts, you can try these events and possibly learn something. One example we found internally is that moving a try/catch or try/finally up or down the call chain can often have a big performance impact because it might enable/disable inlining inside a very important loop.
Grant Richins
CLR Codegen Team
Update 10/8/2009 - replace em-dash in logman command line with a normal dash that you can paste into a command window and it will work.
|
-
First a little background reading before going into tail call improvements in CLR 4 - David Broman did an excellent job at covering the basics in his post here: http://blogs.msdn.com/davbr/archive/2007/06/20/enter-leave-tailcall-hooks-part-2-tall-tales-of-tail-calls.aspx. He also captured a mostly complete list of the restrictions as they stood for CLR 2 here: http://blogs.msdn.com/davbr/pages/tail-call-jit-conditions.aspx.
The primary reason for a tail call as an optimization is to improve data locality, memory usage, and cache usage. By doing a tail call the callee will use the same stack space as the caller. This reduces memory pressure. It marginally improves the cache because the same memory is reused for subsequent callers and thus can stay in the cache, rather than evicting some older cache line to make room for a new cache line.
The other usage of tail calls are in recursive algorithms. We all know from our computer science theory classes that any recursive algorithm can be made into an iterative one. However, just because you can doesn’t mean you always want to. If the algorithm is naturally tail recursive, why not let the compiler do the work for you? By using a tail call, the compiler effectively turns a tail recursive algorithm into a loop. This is such an important concept that the JIT has a special path just for turning functions that call themselves in a tail recursive manner into loops. This is a nice performance win because beyond the memory improvements mentioned previously, you also save several instructions by not executing the prolog or epilog multiple times. The compiler is also able to treat it like a loop and hoist out loop invariants so they are also only executed once instead of being executed on every iteration.
In CLR 2 the 64-bit JIT focused solely on the ‘easy’ cases. That meant that it generated code that did a tail call whenever it could because of the memory benefits. However, it also meant that sometimes when the IL requested a tail call using the “tail.” instruction prefix, the JIT would not generate a tail call because it wasn’t easy. The 32-bit JIT had a general purpose tail call mechanism that always worked, but wasn’t performant enough to consider it an optimization, so it was only used when the IL requested a tail call.
For CLR 4 the fact that the x64 JIT would sometimes not honor the “tail.” prefix prevented functional languages like F# from being viable. So we worked to improve the x64 JIT so that it could honor the “tail.” prefix all of the time and help make F# a success. Except where explicitly called out, x86 and IA64 remain unchanged between CLR 2 and CLR 4, and the remainder of this document refers solely to the x64 JIT. Also this is specific to CLR 4, all other versions of the runtime or JIT (including service packs, QFEs, etc.) might change this in any number of ways.
The 64-bit calling convention is basically a caller-pops convention. Thus the ‘easy’ cases basically boiled down to where the JIT could generate code that stored the outgoing tail call arguments in the caller’s incoming argument slots, execute the epilog, and then jump to the target rather than returning. The improvements for CLR 4 came in two categories: Fewer restrictions on the optimized/easy cases and a solution for the non-easy cases. The end result is that on x64 you should see shorter call stacks in optimized code because the JIT generated more tail calls (don’t worry this optimization is turned off for debug code), and functional programmers (or any other compiler or language that relies on tail calls and uses the “tail.” prefix) no longer need to worry about stack overflows. The down side of course is that if you have to debug or profile optimized code, be prepared to deal with call stacks that look like they’re missing a few frames.
Reducing restrictions on the easy cases
The more often we can tail call, the more likely programs will benefit from the optimization. From the second link above you can see that there were a lot of restrictions on when the JIT could perform a tail call. We looked at each one of these and tried to see if we could reduce or eliminate them.
- The call/callvirt/calli is followed by something other than nop or ret IL instructions.
We reduced this restriction by also allowing a single pop opcode. This allowed the case where a method that returned void ended with a call to some other method that had a non-void return. The callee was obviously being invoked for its side-effect rather than its return value. In most cases the return value is just sitting in rax or xmm0, and does no harm by being left there, so in those cases, the tail call is legal and becomes an ‘easy’ case. However, note that having an explicit tail call, with the tail. prefix is still unverifiable in this situation, so in most cases a compiler or code generator should omit the tail. prefix and just rely on the JIT optimizing the call into a tail call.
- The caller or callee returns a value type.
We changed the JIT to recognize this case as well. Specifically the X64 calling convention dictates that for value types that don’t fit in a register (1,2,4, or 8 byte sized value types), that the caller should pass a ‘hidden’ argument called the return buffer. The CLR 2 JIT would pass a new return buffer to the callee, and then copy the data from the callee’s return buffer to the caller’s return buffer. The CLR 4 runtime now recognizes when it is safe to do so, and then just passes the caller’s return buffer into the callee as its return buffer. This means even when we don’t do a tail call, the code has at least one less copy, and it enables one more case to be ‘easy’!
- The caller is a shared generic method.
The above restriction is now removed!
- The caller is varargs
- The callee is varargs.
These have been partially removed. Specifically, we treat the caller as only having its fixed arguments and the callee as having all of its arguments. Then we apply all of the other rules. It is also worth mentioning that the CLR deviates from the native X64 calling convention by adding another ‘hidden’ argument called the vararg cookie. It stores the GC properties of the actual call (important for the variadic part). This hidden argument is also counted as part of the signature for the purposes of the other rules.
- The runtime forbids the JIT to tail call. (There are various reasons the runtime may disallow tail calling, such as caller / callee being in different assemblies, the call going to the application's entrypoint, any conflicts with usage of security features, and other esoteric cases.)
This part applies to all platforms. We were able to reduce the number of cases where the runtime refused a tail call due to security constraints. Mainly the runtime stopped caring so much about the exact assembly and instead looked at the security properties. If performing the call as a tail call would not cause an important stack frame to disappear from the call stack, it is now allowed. Previous to this change tail calls across modules or to unknown destinations (due to calli or callvirt) were rare, now they are significantly more common.
- The callee is invoked via stub dispatch (i.e., via intermediate code that's generated at runtime to optimize certain types of calls).
The above restriction is now removed!
- For x64 we have these additional restrictions:
- …
- For all of the parameters passed on the stack the GC-ness must match between the caller and callee. ("GC-ness" means the state of being a pointer to the beginning of an object managed by the GC, or a pointer to the interior of an object managed by the GC (e.g., a byref field), or neither (e.g., an integer or struct).)
The above x64 restriction is now removed!
A solution for the non-easy cases
Previously even if the IL had the “tail.” prefix, if the call did not match the requirements to be one of the optimized easy tail calls cases, it was turned into a regular call. This is disastrous for recursive algorithms. In many cases this prevented F# programs from running to completion without getting stack overflows. So for CLR 4 we added an alternate code path that allowed tail call like properties for these specific scenarios, where it was needed, but it was not easy.
The biggest problem to overcome is the x64 calling convention. Because of the calling convention if a tail call callee needs more argument space than the caller has, the code is wedged between a rock and a hard place. The solution involved a little stack trickery. The JIT generates a mostly normal call instead of a tail call. However instead of calling the target directly it calls through a special helper which works some magic on the stack. It logically unwinds the stack to be as if the caller has returned. Then it adds a fake method to the stack as if it had been called, and then jumps to the callee. This fake method is called the TailCallHelper. It takes no arguments, and as far as unwind data and the calling convention are concerned calls the callee with no outgoing arguments, but has a dynamically resizing locals area (think _alloca for C/C++ programmers, or stackalloc for C# programmers). Thus this method is still caller-pops, but the tail call helper can dynamically change how much to pop based on a given callee.
This stack magic is not cheap. It is significantly slower than a normal call or a tail call. It also consumes a non-trivial amount of stack space. However the stack space is ‘recycled’ such that if you have a tail recursive algorithm, the first tail call that isn’t ‘easy’ will erect the TailCallHelper stack frame, and subsequent non-easy tail calls may need to grow that frame to accommodate all the arguments. Hopefully once the algorithm has gone through a full cycle of recursion the TailCallHelper stack frame has grown to the maximum sized needed by all the non-easy calls involved in the recursion, and then never grows, and thus prevents a stack overflow.
As you can see the way the non-easy case is handled is also not fast. Because of that, the JIT never uses the non-easy helper unless the IL requires it with the “tail.” prefix. This was also the driving force behind trying to reduce the number of restrictions on the ‘easy’ cases.
Grant Richins
CLR Codegen Team
[Updated 6/25/2009 to make a few minor clarifications based on internal feedback]
|
-
.NET Framework 4 is our first release since we shipped FX 3.5 SP1 (FX 4 beta 1 is now available here: http://msdn.microsoft.com/en-us/vstudio/dd582936.aspx). FX 3.5 SP1 contained major changes to NGen – features that improved startup performance, security, NGen time and compilation working set – described at length in this MSDN article: http://msdn.microsoft.com/en-us/magazine/dd569747.aspx.
In FX 4 we shifted our primary focus from startup performance to framework deployment. As many of you are aware, the performance win from pre-compiling your application using NGen comes at a cost – the time taken to generate the NGen images on the end-user machine. In .NET 4 we’ve lowered that cost substantially in two ways. First, we’ve made NGEN multiproc-aware. In many cases, your assemblies (and ours) will now NGEN about twice as fast! In addition, we’ve substantially reduced the number of situations in which we need to regenerate NGen images. Our new Targeted Patching feature means that for many .NET Framework patches, we can now modify just the affected assemblies and not have to re-NGEN any other dependent assemblies. Combined, these two features mean you and your users will spend a lot less time running NGEN. Best of all, you don’t have to do anything to get these benefits – they’ll happen automatically when you use .NET 4.
We thought you may want to learn a little more about how these features work and the situations in which you’ll get these benefits. In addition, since .NET 4 is a SxS release, there is a bit of complexity under the covers to make sure assemblies get NGen-ed against the matching runtime. In general things will just work the way you would guess or expect them to, but below, I’ll go into some more detail about exactly what happens in various interesting cases.
If you have any comments or questions about any of these, we’d love to hear from you. Most of these features are available in the FX 4 beta 1 build.
· NGen SxS: Making NGen work correctly when 2 major versions of the CLR are installed side-by-side
· Multi-proc NGen: Enabling use of multiple cores/processors during NGen to make it faster
· Targeted Patching: First step towards making NGen images less “fragile” – avoids having to recompile all managed assemblies when a FX dependency is patched
· No NGen in partial trust: Deprecated support for generating and loading NGen images in partial trust applications
NGen SxS
To a large extent FX 4 is the first SxS release for NGen (the NGen infrastructure in FX 1.0 and 1.1 was rudimentary). We’ve now architected NGen such that the 4.0 ngen.exe tool can be used to generate NGen images for both 2.0 and 4.0 MSIL assemblies without having to pass any special arguments. NGen uses the same logic that is used at application runtime (this logic resides in the shim) to determine the CLR version to load and run an application against. Thus,
%WINDIR%\Microsoft.NET\Framework\v4.0.xxxxx\ngen.exe install <4.0 assembly> will generate an NGen image compiled against the 4.0 runtime.
%WINDIR%\Microsoft.NET\Framework\v4.0.xxxxx\ngen.exe install <2.0 assembly> will generate an NGen image compiled against the 2.0 runtime (if .NET Framework 2.0/3.0/3.5 is installed on the machine. Note that applications built against .NET Framework 2.0 won’t run against .NET Framework 4.0 by default, and thus by default we don’t NGen 2.0 assemblies unless CLR 2 is installed).
%WINDIR%\Microsoft.NET\Framework\v4.0.xxxxx\ngen.exe install <2.0 assembly> /ExeConfig:<Path to a 4.0 EXE> will generate an NGen image compiled against the 4.0 runtime.
%WINDIR%\Microsoft.NET\Framework\v4.0.xxxxx\ngen.exe install <2.0 EXE with a config file that indicates 4.0 as the preferred runtime> will generate an NGen image compiled against the 4.0 runtime.
%WINDIR%\Microsoft.NET\Framework\v4.0.xxxxx\ngen.exe install <2.0 assembly> /ExeConfig:<Path to a 2.0 EXE with a config file that indicates 4.0 as the preferred runtime> will generate an NGen image compiled against the 4.0 runtime.
There was also work required to make the 2.0 NGen Service (clr_optimization_v2.0.50727_32|64) work SxS with the 4.0 NGen Service (clr_optimization_v4.0.xxxxx_32|64). Only one service (the latest) is active at any time – installing FX 4 disables the 2.0 service and it is re-enabled if/when FX 4 is uninstalled.
Multiproc NGen
NGen is now aware of multiple processors/cores and can compile up to 6 assemblies in parallel (the parallelism is at the level of assemblies). To avoid impacting foreground activities, NGen only runs in this aggressive mode when assemblies are being compiled synchronously i.e. the NGen Service still runs on one processor. Synchronous NGen commands (such as ngen.exe install <assembly>, ngen.exe update, ngen.exe ExecuteQueuedItems) will now use multiple processors/cores whenever possible (we also factor in amount of RAM when determining how many cores/processors to use). Since the parallelism is at the level of assemblies, the most effective way to enable use of multiple processors for NGen is to do the following:
ngen.exe install /queue:1 <MyImportantAssembly#1>
ngen.exe install /queue:1 <MyImportantAssembly#2>
…
ngen.exe install /queue:1 <MyImportantAssembly#N>
ngen.exe install /queue:3 <MyAssembly#N+1>
….
ngen.exe install /queue:3 <MyAssembly#M>
ngen.exe ExecuteQueuedItems 1 //Synchronously compiles all important (priority 1) assemblies during set up using multiple cores whenever possible; other (priority 3) assemblies will be compiled in the background by the NGen Service at machine idle-time.
As part of this work we reworked FX 4 set up to use the NGen pattern above.
Targeted Patching
NGen images thus far have been nothing more than “cached JIT-compiled code and CLR data structures” – as a consequence they’re completely fragile; any change to the underlying CLR or to any managed dependency invalidates them and requires them to be regenerated. For example, any change to the CLR or to basic assemblies like mscorlib and System that all/most assemblies depend upon invalidates all NGen images installed on the machine. Since a machine could have several hundreds of NGen images, the cost of regenerating them after FX servicing events is high. In CLR 4 we took a first step towards making NGen images less fragile to avoid the cascaded cost associated with .NET Framework updates. In particular, FX updates that only involve fixes to bodies of existing methods (that aren’t generic, and aren’t inlined across NGen image boundaries) will no longer require recompiling dependent NGen images (the old NGen images can be used with the new serviced dependency). Although this may sound trivial (and beg the question why the system didn’t have this attribute to begin with J), since NGen images had never been architected to be anything but serialized JIT-ed code and CLR data, accomplishing this turned out to be major feat. From reworking hardbinding, doing major performance work to recover the perf impact, to writing an IL post-processing tool that normalizes metadata tokens, detects Targeted Patching-compatible vs. incompatible changes, and flags compatible changes such that existing NGen images can rewired to the updated dependency, plugging that tool into our build system, and revising NGen cross-module inlining rules, this is a major effort that several of us have been working on for several months.
Some of the work for this (such as the changes to hardbinding and corresponding the performance work) are included in the beta 1 build, but this feature will really come online once we start servicing FX 4. You can find out more in the Channel 9 video on Targeted Patching.
NGen in partial trust
In FX 2 NGen images can be generated by running commands such as ngen.exe install <Path to assembly on intranet share> and the generated image loaded in applications running in partial trust. We believe NGen images aren’t used in partial trust applications very much and our current model was broken, so we’ve disabled loading of NGen images in partial trust in FX 4. In the future (post CLR 4) we intend to rework this as part of simplifying the overall NGen story (i.e. change the model where the only way to generate NGen images is to issue commands from an elevated command prompt). If you’re using NGen in a partial trust application today, we’d like to hear from you!
Surupa Biswas
CLR Codegen Team
|
-
Long time, no blog.
Since the NetFX 3.5 Service Pack is available, now, I figured I’d put up a quick rundown of what we (the CLR CodeGen team) contributed to the package. I’m not going into nitty-gritty details, but just to give you an idea of what’s in it, and perhaps inspire you to go install it. A quick note: Unless otherwise noted, all changes impact both x86 & x64.
NGen infrastructure rewrite: the new infrastructure uses less memory, produces less fragmented NGen images with much better locality, and does so in dramatically less time. What this means to you: Installing or servicing an NGen image is much faster, and cold startup time of your NGen’ed code is better.
Framework Startup Performance Improvements: The framework is now better optimized for startup. We’ve tweaked the framework to consider more scenarios for startup, and now layout both code & data in the framework’s NGen images more optimally. What this means to you: Even your JIT code starts faster!
Better OS citizenship: We’ve modified NGen to produce images that are ASLR capable, in an effort to decrease potential security attack surface area. We’ve also started generating stacks that are always walkable using EBP-chaining for x86. What this means to you: Stack traces are more consistent, and NGen images aren’t as easily used to attack the system.
Better 32-bit code quality: The x86 JIT has dramatically improved inlining heuristics that result in generally better code quality, and, in particular, much lower “cost of abstraction”. If you want to author a data type that only manipulates a single integer, you can wrap the thing in a struct, and expect similar performance to code that explicitly uses an integer. There have also been some improvements to the ‘assertion propagation’ portion of the JIT, which means better null/range check elimination, as well as better constant propagation, and slight better ‘smarts’ in the JIT optimizer, overall. What this means to you: Your managed code should run slightly faster (and sometimes dramatically faster!). Note to 64 bit junkies: We’re working on getting x64 there, too. The work just wasn’t quite there in time.
Anyway, go forth & download!
-Kev
|
-
I was recently shown the following code and asked why the loop calling SafeAccess executed significantly faster than the second loop calling UnsafeAccess:
static int [] intarray = new int [5000];
static void SafeAccess(int a, int b)
{
int temp = intarray[a];
intarray[a] = intarray[b];
intarray[b] = temp;
}
static Unsafe void UnsafeAccess(int a, int b)
{
fixed (int* pi = &intarray[0])
{
int temp = pi[a];
pi[a] = pi[b];
pi[b] = temp;
}
}
static Unsafe void Main(string[] args)
{
for (int i = 0; i < testCount; i++)
{
SafeAccess(0, i);
}
for (int i = 0; i < testCount; i++)
{
UnsafeAccess(0, i);
}
}
Safe Loop:
I examined the code generated by the 64-bit JIT compiler for the SafeAccess loop (which was inlined into Main by the JIT). Vance Morrison posted a useful article describing how to accomplish this from within Visual Studio: http://blogs.msdn.com/vancem/archive/2006/02/20/535807.aspx
00000642`801501f0 418b08 mov ecx,dword ptr [r8]
00000642`801501f3 8b02 mov eax,dword ptr [rdx]
00000642`801501f5 418900 mov dword ptr [r8],eax
00000642`801501f8 890a mov dword ptr [rdx],ecx
00000642`801501fa 4883c204 add rdx,4
00000642`801501fe 493bd1 cmp rdx,r9
00000642`80150201 7ced jl 00000642`801501f0
There are 7 instructions and 4 memory accesses per loop iteration, with no range checks remaining inside the loop body after optimization. In this case there is no performance cost incurred for safety.
Unsafe Loop:
By contrast, the unsafe version is a mess. UnsafeAccess is larger MSIL (50 bytes vs 31) because Unsafe array accesses require more MSIL instructions than safe ones. Given an array and index on the evaluation stack, safe array accesses require only a single 1-byte instruction: ldelem. The C# compiler generates a much more complex sequence for Unsafe accesses:
IL_000c: /* 06 | */ ldloc.0 // &array[0]
IL_000d: /* D3 | */ conv.i
IL_000e: /* 02 | */ ldarg.0 // index
IL_000f: /* D3 | */ conv.i
IL_0010: /* 1A | */ ldc.i4.4
IL_0011: /* 5A | */ mul
IL_0012: /* 58 | */ add
IL_0013: /* 4A | */ ldind.i4
Ignoring the first and third instructions, which are used to get the array and index, there are six instructions (and bytes) required to load an array element. These extra instructions make UnsafeAccess larger than SafeAccess. When determining which methods should be inlined by the JIT, one of the most highly weighted factors is the size of the inlinee method. In this case UnsafeAccess was rejected for inlining, and because of this, the range check at &intarray[0] could not be removed. In fact the unsafe loop variant actually caused more runtime range checks to occur than the safe variant!
Finally, the presence of a pinned variable inhibits many optimizations in the 64-bit JIT. As a result, the generated code for UnsafeAccess is far worse than that of the safe variant. Keep in mind that the following excerpt shows only the UnsafeAccess method itself, and does not even include the the loop in Main, as the SafeAccess example above does.
image00000000_00e40000!Arrays.UnsafeAccess(Int32, Int32):
00000642`80150260 4883ec38 sub rsp,38h
00000642`80150264 448bc1 mov r8d,ecx
00000642`80150267 48c744242000000000 mov qword ptr [rsp+20h],0
00000642`80150270 48b9102e352000000000 mov rcx,20352E10h
00000642`8015027a 488b09 mov rcx,qword ptr [rcx]
00000642`8015027d 488b4108 mov rax,qword ptr [rcx+8]
00000642`80150281 4885c0 test rax,rax
00000642`80150284 7641 jbe 00000642`801502c7
00000642`80150286 488d4110 lea rax,[rcx+10h]
00000642`8015028a 4889442420 mov qword ptr [rsp+20h],rax
00000642`8015028f 4d63c8 movsxd r9,r8d
00000642`80150292 488b442420 mov rax,qword ptr [rsp+20h]
00000642`80150297 468b0488 mov r8d,dword ptr [rax+r9*4]
00000642`8015029b 4863d2 movsxd rdx,edx
00000642`8015029e 488b442420 mov rax,qword ptr [rsp+20h]
00000642`801502a3 8b0c90 mov ecx,dword ptr [rax+rdx*4]
00000642`801502a6 488b442420 mov rax,qword ptr [rsp+20h]
00000642`801502ab 42890c88 mov dword ptr [rax+r9*4],ecx
00000642`801502af 488b442420 mov rax,qword ptr [rsp+20h]
00000642`801502b4 44890490 mov dword ptr [rax+rdx*4],r8d
00000642`801502b8 48c744242000000000 mov qword ptr [rsp+20h],0
00000642`801502c1 4883c438 add rsp,38h
00000642`801502c5 f3c3 rep ret
Conclusion:
Unsafe array accesses have a lot of potential problems: correctness, GC heap fragmentation due to pinning, and as we have just seen, performance. I hope that this example will help developers understand that safety does not necessarily incur a runtime cost. Before attempting to evade a ‘safety tax’ it is a good idea to check if you are currently paying one. The first step in doing that is viewing disassembly of the optimized code
-Matt Grice
|
-
By Fei Chen
How are value types implemented in the 32-bit CLR?
Value types are the closest thing in the common language runtime model to C++ structures. An instance of a value type is simply a blob of data in memory that contains all the fields in the instance. The main difference between an instance of a value type and an instance of a reference type is that the former does not contain the type ID in its blob (see the example below), because the type information for value types is only needed at compile time.
struct PointStruct() { int x; int y; } // The memory needed for the instance of this value type is 8 bytes, with 4 bytes for each integer field.
class PointClass() { int x; int y; } // The memory need for the instance of this reference type is 12 bytes, with the first 4 bytes containing the type ID, followed by 4 bytes for each integer field.
Being a contiguous blob of data in memory, value type instances are referenced internally in the CLR using the pointer to the beginning of the blob of memory.
A value type instance can live in one of two different places – a value type local variable, or a value type field in a reference type. When a value type local variable is declared, the prolog of the jitted code reserves a piece of stack memory (large enough to hold the instance of this value type), and the pointer to this stack location is used in all the places in the jitted code where this local variable is referenced. In the case of a value type field embedded in a reference type, the memory on the heap for this object contains the memory needed for its value type field. See this example:
class PointWithColorClass() { int color; PointStruct point; }
The size of the above object is 16 bytes. The first 4 bytes is the type ID; the next 4 bytes is the color field; and the last 8 bytes is for the point value type field.
The pointer to the beginning of point field is used in all the places in the jitted code where this field is referenced. (Note that this time the pointer points to a location on the heap, instead of the stack.)
Common operations on value types
There are only a few common operations on value types. Here is how they are implemented internally.
· Field access
Given that a value type instance is referenced by the pointer to the beginning of its blob, accessing its fields is nothing more than adjusting this pointer with the corresponding field offset. In other word, if a value type local variable p of type PointStruct lives at [EBP-8], then a “mov eax, [EBP-8]” instruction reads the field x and a “mov [EBP-4], eax” instruction writes the field y.
· Initialization
Zero initialization of a value type instance is done by calling memset on this piece of memory with 0s.
· Assignment
Assignment from an instance of a value type to another is done by calling memcpy between these two pieces of memory.
· Calling the instance method
CLR supports calling instance methods on value types. This is internally done by passing the pointer to the instance as a first parameter to the target method. This should sound similar to people who are familiar with C++ instance method calls, where the “this” pointer is passed as the first parameter.
Since JIT owns the code generation for both the caller and the callee methods, it knows how to generate correct code for the value type instance method. In other words, it expects the first parameter to be the pointer to the blob of the value type instance.
· Passing as an argument by-value
Passing a value type instance as a by-value argument requires making a stack copy and then passing the pointer to this copy to the target method. Consider what needs to be done at the call site of Foo() in the following example:
static void Foo(int i, string s, PointStruct pointArg) { … }
static void Main() { PointStruct point; Foo(1, “one”, point); }
What happens at the call site can be described using the following pseudo code in C++ syntax:
PointStruct stackCopyOfPoint; // This is a stack local variable.
stackCopyOfPoint = point; // (or think of it this way) memcpy(&stackCopyOfPoint, &point, 8);
Foo(1, “one”, &stackCopyOfPoint);
The stack copy is necessary for maintaining the by-value semantics so the callee only sees the copy and hence has no way to affect the original one.
· Passing as an argument by-reference
Passing a value type instance as a by-reference argument is easy. Just pass the pointer. Now the callee and the caller see the same instance. So any change done inside the callee will affect the caller.
· Returning a value type
Returning a value type requires the caller to provide the storage. The pointer of the return storage buffer is then passed in as a hidden parameter to the callee. The callee is responsible of filling in this buffer. Consider this example,
static PointStruct Bar() { … }
static void Main() { Bar(); }
What actually happens at the call site can be described using the following pseudo code:
PointStruct tempPoint; // A temporary stack local created to hold the return value.
Bar(&tempPoint);
In the case of an embedded value type field, consider this example:
static PointStruct Bar() { … }
static void Main() { PointWithColorClass obj; obj.point = Bar(); }
What actually happens at the call site is:
Bar(&(obj.point));
Inefficiencies in the code generation with regards to value types in .NET 2.0
Code generation for value types in .NET 2.0 has several inefficiencies.
1) All value type local variables live entirely on the stack.
2) No assertion propagation optimization is ever performed on value type local variables.
3) Methods with value type arguments, local variables, or return values are never inlined.
While the original intent of supporting value types in the CLR was to provide a means for creating “lightweight” objects, the actual inefficiencies in the code generation make these “lightweight” objects not-so-light.
For bullet 1), the following code would mean 3 stack operations, one for each field access:
static void MyMethod1(int v) {
PointStruct point; // point will get a stack location.
point.x=v; point.y=v*2;
Console.WriteLine(point.x); // All 3 field accesses involve stack operations.
}
Wouldn’t it be nice if the jitted code stored both fields of point into registers and avoided allocating stack space for this value type local variable altogether?
For bullet 2), the following code would mean 19 useless memcpy’s.
static void MyMethod2() { PointStruct point1, point2, …, point20; point1.x = point1.y = 5;
point2 = point1; point3 = point2; … point20 = point19;
Console.WriteLine(point20.x + point20.y); }
Wouldn’t it be nice if the JIT could apply copy-propagation to these value type local variables and morph the above code to “Console.WriteLine(point1.x + point1.y)” instead?
For bullet 3), a simple field getter of a value type turns into an expensive method call:
struct PointStruct() { int x; int y; public int XProp { get { return x;} } }
static void MyMethod3() { PointStruct point; point.x = point.y = 5;
Console.WriteLine(point.XProp); } // point.XProp is a method call which is never inlined.
Currently the JIT does not perform any assertion propagation to local variables whose addresses have been taken. Common operations on value types, however, do involve taking their addresses.
Improving value type code generation in CLR v.Next
Improving code generation with regards to value types has always been a top customer ask according to MS Connect: http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=93858.
Over the past year or so, the JIT team has been working on significant improvements to value type code generation, as well as the inlining algorithm. In summary, all of the above limitations are being eliminated.
The new inliner will allow inlining methods with value type arguments, local variables or return value. This solves the issue in bullet 3).
An algorithm called “value type scalar replacement” has been implemented to address the issues in bullets 1) and 2). This algorithm is based on the observation that a value type local variable logically can be viewed as a group of independent local scalars, each representing a field in this value type local variable, if
a) there is no operation in the current method that causes any interaction between these fields;
and
b) The address of this value type local variable is never exposed outside of the current method.
When the above conditions are met, the MyMethod1() listed above can be safely transformed to
static void MyMethod1(int v) {
int x; int y; // Was “PointStruct point;”. Now replaced by x and y.
x=v; y=v*2; Console.WriteLine(x);
}
by replacing the value type local variable point with a group of independent integer local variables, namely x and y.
And the MyMethod2() listed above will be transformed to
static void MyMethod2() { int x1, y1, x2, y2, …, x20, y20; x1 = y1 = 5;
x2 = x1; y2 = y1; x3 = x2; y3 = y2; … x20 = x19; y20 = y19;
Console.WriteLine(x20 + y20); }
Furthermore, the assertion propagation algorithm and the constant folding algorithm will be applied to these scalars, since none of them have their address taken. As a result, the code will be reduced to:
static void MyMethod1(int v) { Console.WriteLine(v); }
static void MyMethod2() { Console.WriteLine(10); }
In addition, the register allocation algorithm will home the local variable v into a machine register, so no stack operation will occur in MyMethod1().
Not all value type local variables can be replaced by scalars, however. Local variables with their address taken, and exposed outside of the current method, cannot be replaced. Consider this example where SomeBigMethod() is an instance method in PointStruct that is not inlined.
static void MyMethod4() { PointStruct point; point.SomeBigMethod(); }
The address of point is taken and passed as the “this” pointer to SomeBigMethod(). What SomeBigMethod() does with this pointer is totally out of the control of MyMethod4(). In this case, point is not replaced by scalars. Another way to expose the address of a value type local variable is to pass it as a by-reference argument to another method. Taking the address of a value type local variable and storing it in a static variable, or in an object, also exposes the address.
The JIT in CLR v.Next will be able to perform value type scalar replacement optimization on the following kinds of value types whenever it thinks it will be beneficial:
1) The value type contains no more than 4 fields.
2) The types of the fields in the value type are either primitive types or object references.
3) The value type must be using [StructLayout(LayoutKind.Sequential)], which is the default.
Guidelines for using value types in the CLR
The decision around whether to use value types, or not, should be based primarily on the semantics of the program. Value types should be used when the pass-by-value semantics are the most natural, and the most frequently used in the program.
After the decision has been made to use the value type, it is time to think about the performance implications, and to determine how to help the JIT generate the best possible code. Always keep in mind that the by-value nature of value types means that a lot of copy operations might be happening under the covers. Also, nearly every operation related to a value type will be a memory operation (either operated on the stack or on the heap) if this value type is not replaced by scalars.
Developers should examine the jitted code of their hot methods under the debugger to make sure the value type stack local variables are indeed homed in registers.
Try not to create value types that contain more than 4 fields. Try not to create non-inlineable value type instance methods and call them in the hot path, because doing so will cause the address to be exposed. When a temporary value type instance is needed, try using value type local variables rather than the value type fields embedded in an object, because the latter are never replaced with scalars.
|
-
by Brian Sullivan
In Visual Studio you can set a breakpoint at any line in your source code. When you run your program Visual Studio will break and stop execution when it reaches your breakpoint. At this point you can right click on your source code and select Go To Disassembly. You will see the assembly language instructions that were created by the JIT compiler for this method.
The JIT compiler generates either Debug code or Optimized code
If this is your first time looking at the JIT assembly code you undoubtedly are looking at the Debug code generated by the JIT compiler.
This Debug code is not the high quality optimized code that the JIT compiler can generate.
Instead it is basic assembly code that does not have any optimizations applied to it. All local variables references are references into the local stack frame storage because in Debug code the JIT does not enregister any local variables. Another property of Debug code is that nop instructions are inserted before some source code statements.
You probably don’t want to spend much time looking at the Debug JIT code. Instead you probably want to look at Optimized JIT code. By looking at Optimized JIT code you can confirm that the assembly code for your important methods are well optimized. And if they are not optimized as well as you want, you can experiment to see if making some source code changes can improve the Optimized JIT code.
There are several settings that you will need to change in order for you to see the Optimized code generated by the JIT compiler.
1. The default configuration is Debug and you want to select the Release configuration.
· Select the Release configuration
2. Make sure that PDB files get created for Release builds.
· Set Generate debug info to pdb-only.
More information on why this is necessary:
Ideally you would have wanted it to be the case that simply changing the configuration in the Solution Configuration window to ‘Release’ would be sufficient. However by default the ‘Release’ builds do NOT build a program data base (PDB) file for your program. The (PDB) file is essential when using a debugger as it holds the mapping of the source code lines and local variables for your program.
To fix this go to the properties for the project (right click on the project in the Solution Explorer), select the ‘Build’ tab and click the ‘Advanced’ button all the way at the bottom (you may need to scroll to see it). In the dialog box that comes up there will be a line ‘Debug Info’. It should be set to ‘pdb-only’. This tells the compiler to still compile with optimizations, but to also create the pdb file.
3. Configure the Debugging Options in Visual Studio to allow the JIT to generate optimized code and to allow you to debug the optimized code.
Go to Tools => Options => Debugging => General
· Make sure that box labeled ‘Suppress JIT optimization on module load’ is Unchecked.
· Make sure that the box labeled ‘Enable Just My Code’ is Unchecked.
More information on why this is necessary:
The Visual Studio IDE makes debugging optimized code harder than you would like.
Whenever you launch a managed program under Visual Studio using (Start-Debugging or F5), it will by default, force the JIT to create Debug code. This is true even when you have selected the 'Release' configuration. The reason for this is to improve the debugging experience, but it also makes it impossible to see the Optimized code that you will get whenever your program is not running under the Visual Studio debugger. Your alternative of launching the program using (Start Without Debugging or Ctrl+F5) does allow the JIT to create optimized code but Visual Studio won’t set any of your break points in your code. Thus you normally won't be able to stop and examine the assembly code for a method. We actually want do some debugging with the Optimized JIT code.
Another problem is that Visual Studio 2005 has a new feature called ‘Just My Code’ in which the debugger will not step into any code that it does not believe is being developed. Instead it will step over the code. This is also to improve the debugging experience, as most typical users do not want to step into Microsoft code such as the Collections classes, etc… However any optimized code is also not considered as 'Just My Code’ so it too will be step over and is skipped. Again this makes it impossible to actually stop and see the optimized code.
Note that these are global settings as they affect all solutions; however I have found I don’t miss any of these ‘features’. That is because any code that is compiled for the ‘Debug’ configuration is still not optimized by the JIT so you will still get good debugging there.
In my next blog entry I will demonstrate some of the optimizations that we perform in the JIT compiler.
|
-
I work in the CodeGen test team and wanted to share a recent customer experience that was related to ngen. One of our Customer Service and Support (CSS) engineers in France contacted us regarding an installation delay with the latest Microsoft Exchange Server 2007 update rollup. Apparently the patch installer was spending 2 hours generating up-to-date NGen images.
The exchange installer package issues an “ngen update” command at the very end that causes all Exchange assemblies affected by the update to be recompiled synchronously. There is a known issue with this approach that had affected some customers whose machines had been updated with a .NET Framework 2.0 patch just prior to installing the Exchange patch. The .NET Framework 2.0 patch issues an “ngen.exe update /queue” command which schedules background assembly compilation at machine idle time. Then when the synchronous NGen command is issued by the Exchange installer package, it triggers compilation of all managed assemblies that don’t have up-to-date NGen images, including the .NET Framework ones. We originally thought that this might have been the reason behind the long time taken to regenerate the NGen images on this customer’s machine too. However, that was ruled out by the CSS engineer after further investigation.
We then requested the CSS engineer for the two ngen log files created by ngen and its service found at the .Net Framework redistributable installation directory (in this case, ngen.log and ngen_service.log found at %WINDIR%\Microsoft.NET\Framework64\v2.0.50727) and found an interesting anomaly. The ngen compilation worker might sometimes encounter an assembly that is already up-to-date, and the time taken to perform this validation is typically less than a second (the compilation worker exits as soon as it finds that the assembly already has a valid NGen image). But in this customer’s case, each of these assemblies was taking 10-14 seconds to determine that they already had a valid image (see below). These extra 10-14 seconds when multiplied across many different assemblies was resulting in a significant increase in patch install time.
08/24/2007 10:36:21 [5660]: Compiling assembly Microsoft.Exchange.Data.Common, Version=8.0.681.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35 ...
08/24/2007 10:36:33 [5660]: Assembly Microsoft.Exchange.Data.Common, Version=8.0.681.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35 is up to date.
Around the same time, one of our developers was helping out on a Certificate Revocation List (CRL) issue report by another customer and asked the CSS engineer to check the “Software Restriction Policies” (http://technet.microsoft.com/en-us/library/bb457006.aspx#EAAA) enabled on the customer’s machine. Sure enough, the delay was due to an Internet Explorer setting called “Check for publisher’s certificate revocation” (Options, Advanced, Security) that was checked on the customer’s machine, and was resulting in network requests to connect to crl.microsoft.com and look up the certificate revocation list at NGen time. The customer’s machine was behind a locked down firewall and did not have access to the internet and the reason for the 10-14 delay for each assembly. This option was unchecked in the IE setting and the install time was reduced to the expected 10 minutes from the 2-hour time. It was considered safe to uncheck this security option in IE since the machine was in a tightly controlled environment. Short summary of what we learned from this experience: Issuing a synchronous NGen update command in the patch installer may not be the best option. It might be better to schedule an NGen update for machine idle-time using “ngen.exe update /queue”, and to (optionally) issue separate NGen install commands for perf-critical assemblies whose NGen images need to be regenerated right away.
|
-
One of the topics we often get questions on is about when it makes sense to invest the extra effort to pre-compile assemblies via NGen instead of simply relying on the JIT compiler to generate native code on the fly at application runtime. I thought I would try to answer that question in our first "real" post.
The JIT is optimized to balance code generation time against code quality, and it works well for many scenarios. NGen is essentially a performance optimization; so just like for anything else that's performance-related, you'll want to measure performance with vs. without NGen, and then decide whether it really helps your specific application and scenario. Here are a few general guidelines.
When to use NGen:
- Large applications: Applications that run a lot of managed code at start up are likely to see wins in start up time when using NGen. Microsoft Expression Blend for example, uses NGen to minimize start up time. If a large amount of code needs to be JIT-compiled at start up, the time needed to compile the IL might be a substantial portion of the total app launch time (even for cold start up) . Eliminating JIT-compilation from start up can therefore result in warm as well as cold start up wins.
- Frameworks, libraries, and other reusable components: Code produced by our JITs cannot be shared across multiple processes; NGen images, on the other hand, can be. Therefore any code that is likely to be used by multiple applications at the same time is likely to consume less memory when pre-compiled via NGen. Almost the entire Microsoft .NET Framework for instance, uses NGen. Also Microsoft Exchange Server pre-compiles its core DLLs that are shared across various Exchange services.
- Applications running in terminal server environments: Once again NGen helps in such a scenario because the generated code can be shared across the different instances of the application – that in turn increases the number of simultaneous user sessions the server can support.
When not to use NGen:
- Small applications: Small utilities like caspol.exe in the .NET Framework aren't NGen-ed because the time spent JIT-compiling the code is typically a small portion of the overall start up time. As a matter of fact, since NGen images are substantially bigger in size than the corresponding IL assemblies, using NGen might actually result in increased disk I/O and hurt cold start up time.
- Server applications: Server applications that aren't sensitive to long start up times, and don't have shared components are unlikely to benefit significantly from NGen. In such cases, the cost of using NGen (more on this below) may not be worth the benefits. SQL-CLR for example, isn't NGen-ed.
If it seems that your application is likely to benefit from NGen, here are a few things you might want to keep in mind:
- NGen is triggered by issuing commands at install time to the ngen.exe tool in the .NET Framework redistributable. To find out more about hooking NGen into set up, see here: http://blogs.msdn.com/astebner/archive/2007/03/03/how-to-ngen-files-in-an-msi-based-setup-package-using-wix.aspx.
Note: Currently you need admin privileges to issue commands via ngen.exe.
- Just running ngen.exe on your assemblies and then re-running your performance scenarios to measure the wins is very likely not going to give you a true indication of the wins. Basically, it takes some effort to get the best performance out of NGen. For detailed guidelines on how to get the best performance via NGen, please take a look at this article in the MSDN magazine: http://msdn.microsoft.com/msdnmag/issues/06/05/CLRInsideOut/default.aspx.
- Using NGen has implications for servicing your application. Your patch installer will need to issue NGen commands to ensure that the images invalidated by the patch get regenerated. .NET Framework 2.0 supports an "ngen.exe update /queue" command that should work well in most cases (this will regenerate all invalid NGen images at machine idle time). If specific assemblies are perf-critical and need to have up-to-date NGen images immediately after the patch install, you can issue separate NGen commands for those ("ngen.exe install <CriticalDLL>"), followed by the "ngen.exe update /queue" command.
- Using NGen implies your application's disk footprint will increase (NGen images are fairly big when compared to MSIL assemblies).
- In a handful of scenarios, cold start up and/or application execution can get slower when you use NGen, so you'll definitely want to measure the performance with vs. without NGen for all scenarios, and validate that it's a win overall for your application to be using NGen.
Finally, another important thing to keep in mind is that NGen isn't a magic solution for application start up time. NGen can speed up application start up by eliminating the time needed for JIT compiling the code executed at start up.
However, regardless of whether you do or don't use NGen, it is important to ensure that your application is architected to be performant. So for instance, if you care about start up time, you should try to minimize the amount of code that needs to get loaded and executed on your application's start up path.
Below are some good resources on measuring and optimizing application start up performance: http://blogs.msdn.com/vancem/archive/tags/Perf/default.aspx
http://msdn.microsoft.com/msdnmag/issues/06/02/CLRInsideOut
http://msdn.microsoft.com/msdnmag/issues/06/03/WindowsFormsPerformance/
And some for NGen:
NGen internals: http://msdn.microsoft.com/msdnmag/issues/05/04/NGen/default.aspx
NGen Service (NGen-ing your assemblies in the background):
http://blogs.msdn.com/davidnotario/ (David isn't on our team any longer)
Thanks for visiting our blog!
|
-
This is the first blog post from the code generation feature team working on the Common Language Runtime (CLR). We're the group of individuals that make it possible to generate native code for all binaries that run on top of the Microsoft .NET Framework. Since all managed applications today are distributed in a format known as MSIL (for Microsoft Intermediate Language), the CLR is responsible for compiling MSIL to directly-executable machine code. That's where we come in -- currently we support compiling against 3 different machine architectures -- x86, x64, & IA64, and in 2 different compilation modes -- dynamic compilation using the Just-in-time compiler (JIT), and ahead-of-time/pre-compilation using NGen (for Native Generation). Two examples of design challenges in our space include balancing the quality of the generated code with the time taken to generate it, and balancing the desire to update NGen images soon after MSIL binaries are patched with the desire to keep patch install time at a minimum.
Expect posts about our JIT & NGen back-ends over the coming months on this blog. Also, please feel free to let us know what codegen-related topics you're interested in reading about.
|
|
|
|