A first hand look from the .NET engineering teams
This post announces an updated preview of the .NET team’s new 64-bit Just-In-Time (JIT) compiler. It was written by Mani Ramaswamy, Program Manager for the .NET Dynamic Code Execution Team.
Note: RyuJIT CTP3 is available here: http://blogs.msdn.com/b/dotnet/archive/2014/04/03/the-next-generation-of-net.aspx.
The developer preview of RyuJIT, CTP1, received a thunderous response (so much so we had to post a FAQ soon after). Two questions commonly asked were when would there be an update and when would it support feature X or Y that is in the existing 64-bit .NET JIT compiler. CTP2 answers both questions. This release of RyuJIT has equivalent functionality of existing JIT64: there aren’t any feature differences between RyuJIT and the existing JIT64 at this point. RyuJIT generates code that’s on average better than the existing JIT64, while it continues to maintain the 2X throughput wins over JIT64.
The two main features which weren’t supported in CTP1 were “opportunistic” tail calls and Edit & Continue. With CTP2, both of these features are supported. Additionally, a host of other features have been added to achieve functional parity with JIT64. Along the way, we’ve (the .NET Code Generation team) also added a number of performance tweaks and optimizations so that code generated using RyuJIT is generated fast (the throughput metric) and runs fast (the code quality metric).
But why stop there? We have thrown every test at our disposal at RyuJIT and it has come out with flying colors – whether it be running common server software using IKVM.NET (a Java Virtual Machine implemented in .NET), or complex ASP.NET workloads, or even simple Windows Store apps. Thanks to everyone who tried out the first CTP of RyuJIT and filed bug reports – we’ve fixed every single one of them, and at this point, RyuJIT doesn’t have any known bugs.
We continue to look at ways to improve the overall quality of RyuJIT, and will likely discover a few more bugs along the way. From the enthusiastic response we got from the first CTP, we’ll surely hear back with a few more bugs from our early adopters, i.e. you.
When it comes to performance, CTP1 demonstrated that RyuJIT handily beats JIT64 on throughput (how fast the compiler generates code) by a factor of 2X. We’ve been careful to maintain our throughput wins, and this CTP should yield similar throughput numbers. With CTP1, the focus was on throughput and to get some early feedback, and not so much on code quality (how fast the generated code executes).
While with CTP1, we were in the same ball park as JIT64, we were still 10-20% slower on code quality, with some outliers. With CTP2, we’ve addressed that – at this point, on average we should be at par or beating JIT64 on code quality. If during your evaluation, you find a benchmark where RyuJIT is trailing JIT64 performance significantly, please reach out to us – by the time we’re done, RyuJIT should be producing code that’s better than what JIT64 produced. This is not to say that there couldn’t be a few micro-benchmarks where JIT64 produces more optimal code, but rather to say that on average RyuJIT should be on par or better, and in the few (rare) cases it does trail JIT64 performance, it trails by only a few percentage points. We tried out many common code quality benchmark suites internally, and found that RyuJIT code quality on average is better than the existing .NET JIT64 compiler – thus if you do find an outlier, we’re most interested.
The chart below shows our performance, relative to JIT64’s across a number of benchmarks, some very small, others fairly large. Positive numbers indicate RyuJIT performing better than JIT64. Negative numbers indicate the opposite. The gray section is the limit of “statistical noise” for each benchmark, so any bar that is within the gray area indicates effectively identical performance. Check the CodeGen blog within a day or two for a detailed description of the methodology and specifics about the benchmarks we’re running. Overall, we’re doing quite well, with only a handful of losses, and some very nice wins!
While we needed to first get all the functionality and quality metrics lined up and achieve parity on performance (code quality) with JIT64 (we’re already 2X faster on throughput, in case you forgot), our re-architecture puts us in a great place for optimizing .NET dynamic code execution scenarios. Over the next few months, you will continue to hear from us as we move forward on quality and performance.
The process to install RyuJIT remains simple and is the same as that for CTP1. While it’s not for production code yet, we look forward to hearing from you on functionality, quality and performance. Send feedback and questions to firstname.lastname@example.org. Even if you’ve figured out the issue or a workaround yourself we want to hear from you.
You can download the RyuJIT installer now. RyuJIT only works on 64-bit editions of Windows 8.1 or Windows Server 2012 R2. Also, RyuJIT doesn’t impact NGen to keep your system isolated.
After installation, there are two ways to turn on RyuJIT. If you just want to enable RyuJIT for one application, set an environment variable: COMPLUS_AltJit=*. If you want to enable RyuJIT for your entire machine, set the registry key HKLM\SOFTWARE\Microsoft\.NETFramework\AltJit to the string "*". Both methods cause the 64-bit CLR to use RyuJIT instead of JIT64. And both are temporary settings—installing RyuJIT doesn’t make any permanent changes to your machine (aside from installing the RyuJIT files in a directory)
Stay tuned to this blog for more on RyuJIT as it progresses from a CTP to become the One True .NET JIT Compiler. If you want to dive more deeply into the geeky details behind RyuJIT’s development, you can do that at the .NET CodeGen blog. For example, here's more detail on performance numbers. We’ll answer questions and explain design decisions there. And remember to send us mail at email@example.com after you download and install the CTP. We want to hear from you!
@Pop Catalin: Without making any promises, we're doing our best to be able to do that. Changing a JIT compiler is kind of scary. Back when I was working on C++ we'd ship a new C++ compiler, people picked up the new compiler, then they recompiled their code, then they tested their code, then they shipped the updated application. In a JIT world, we ship a new JIT and everyone's existing application is magically compiled with the new JIT. It's like trying to change a flat while the car is driving. It can definitely be done, it's just hard and requires a lot of coordinated effort.
What about a way to tag a method or class as "hot". This is often not statically known, so the dev can help out.
In a hot method, inlining and loop unrolling heuristics would be turned up greatly. We no longer need to be conservative here because the developer told the JIT not to be.
This seems like quite a small engineering investment to make. So the only question is whether you want engineers to make that decision.
Given that the JIT will likely never get this completely right, and your resources are limited, I think this feature should be implemented. Devs will thank you for a reliable way to force this.
Did you consider dynamic range check elimination yet? wikis.oracle.com/.../RangeCheckElimination Loops are split into 3 loops where the middle loop has no range checks. 99% of the time is spent in the middle loop.
If this was implemented, statically removing range checks becomes less important.
As André Slupik said, a option for enabling rather expensive optimizations, maybe as Assembly and Method Level Attribute or based on a profiling step which marks methods that can profit from a further optimization, would be nice.
@xor88 & Suchiman: a) Prior to RyuJIT, we really didn't have any opts we could add and have any certainty that things would be better. We do have "MethodImpl.AggressiveInlining". We'll definitely keep this one in mind. b) We added Loop Cloning, where the dynamic bounds check occurs, and if it passes, then we go down a bounds-check-free path, otherwise we do the slow version. I'll have to dig out a small example and dump it on the codegen blog. It's there in CTP2. It winds up bloating code, so we're rather conservative about it, and it's not particularly well tuned yet, but it's definitely there, and shows some really promise.
Surprising that you require 2012 R2, when straight 2012 is just now widely available in the wild, where I really want to test it!
Would like to see some SIMD(!) Yes, voted it up already. Please everyone do so too(!)
I don't understand how to enable the compiler. Can someone post step-by-step instructions? That would be much appreciated!
@Kevin Frei, any plans on branch prediction (en.wikipedia.org/.../Branch_prediction) and loop-invariant-hoisting (en.wikipedia.org/.../Loop-invariant_code_motion) optimizations.
In case you guys didn't noticed the most viewed question on StackOverflow //stackoverflow.com/q/11227809
@LKeene: If you have an application that can be invoked from a command line window,
1) Open a commandline window (windows key->q, type cmd on the search text box, click on "Command Prompt")
2) Change directory to the path where you application is from your command prompt
3) Type the following on the command prompt "SET COMPLUS_AltJit=*" and enter
4) Run your application from the command line
This will ensure that your application will use RyuJIT.
If you want to go back to the default 64 bit JIT for other managed applications, close the application and close the Command Prompt or enter "SET COMPLUS_AltJit=" in the Command Prompt.
Let me know if you want step by step instructions on how to set the registry key which will cause all managed applications to run using RyuJIT (not recommended for a production machine). Or if you want to check via a debugger that the RyuJIT is getting loaded.
Hope this helps.
@Kevin Frei: "I know. And any excuse I might try and make will just sound whiny, so I'll just leave it at that :-)"
No problem, I hope I didn't sound too insistent about the issue. It's just that it seems a bit surprising that such a small (and not uncommon) piece of code can generate code with so many issues (I think I count 4 distinct ones).
@AzureSky: "Would like to see some SIMD(!) Yes, voted it up already. Please everyone do so too(!)"
I think I up-voted on the Connect bug about this but I'm seriously considering down-voting the bug. It seems to me that the righter high count of up-votes is the result of a "SIMD is cool" style thinking rather than the result of more down to earth thinking: "SIMD actually works given the optimization constraints of a JIT compiler". Otherwise you can end up in situations like one I've recently seen in the VC++ compiler - doing 2 additions with a single SIMD instruction and then (unnecessarily) spilling the registers to memory and killing the perf.
@Lakshan: Thank you!
And yes, if you could, please post step-by-step instructions on setting the registry so that the RyuJit option is set globaly (as well as the debug info to verify that it's running). Much appreciated!
@mattias - It could have been compiled using IronJS, or translated to c#
@Jerry: We're intimately aware of the branch predictor hardware. Being aware of it doesn't really make it something that we can optimize for. Intel's branch predictor is _incredibly_ complex (and incredibly capable). There are a few places where we try to take branch prediction into account, but doing it for general purpose code is just not particularly useful, because the hardware handles it quite well. Loop Invariant Code Motion is already in place in RyuJIT. We hoist loop invariant expressions out of loops. While we don't always detect what's loop invariant, the core algorithm is in place. To make it work better, we just have to make the core analysis better, which makes everyone happy, not just folks who want LICM to work :-).