NGen Primer

NGen Overview

I thought it would be useful to provide a primer on the NGen tool and pre-jitting your code for performance reasons.  In particular, there are some gotchas you must be aware of when authoring your product.  In this entry, I'm going to cover some background material on paging (which you can skip if you are an expert already).  Then we'll cover the workings of the NGen tool, some servicing implications, and finally some future directions.

Before we get started, let me keep up a Microsoft tradition and include the key takeaways right here.  If you get nothing else out of this topic or can't read the whole thing, make sure you absorb the following:

  • NGen is important to getting faster startup through better page sharing and working set reduction

  • NGen is primarily interesting for client scenarios which need the faster startup to be responsive

  • NGen is not recommend for Asp.Net because the assemblies NGen produces cannot be shared between App Domains

  • NGen for V1.0 and V1.1 was primarily designed for the CLR (where we've seem dramatic wins), and while it can be used for shared libraries and client apps:

  • Always measure to make sure it is a win for your application

  • Make sure your application is well behaved in the face of brittleness and servicing

Bottom line recommendation:  keep your eye on the technology, experiment with it, but plan to wait for a future version before really pulling it into your application.

Paging Primer

Windows uses a virtual address space on your machine, so for a 32-bit system you get from 0 to 4GB of addressable memory for each process.  Windows code is typically compiled into a Portable Executable file (PE file), which contains sections of code and data marked with page attributes like read, write, and execute.  When the OS loads such a file into a process, it maps the memory from your file into physical pages that can be addressed by the process.  So far so good?

On the x86, calls to methods are typically in the form of "call address", where address is an absolute value from 0 to 4GB, and tells the CPU the precise location it should transfer to.  This poses a problem for the compiler, because it means that when the user's file is loaded, it needs to know precisely where all of the methods it will call inside that file live (not just relative to the start of the file, but the absolute address in the entire process).  There are two things that kick in here to aid you:

  Base Address

This is the address you specify as a developer (either through your compiler (eg: /baseaddress in VB.Net or C#) or using the rebase tool) where you want your executable to be loaded.  The compiler will now assume the file will get loaded there, and can now predict the absolute address of every method in the file.

   

  Relocs

Just in case your file can't be loaded to that base address, say if someone is already loaded there, the compiler will emit a set of relocs in the file that tell the OS where absolute addresses are located in the image.  If the file gets relocated to a new place in the process, the OS will now fix-up the addresses -- essentially adjust them to the new home of the code or data.  This allows flexibility, but is also expensive; keep reading to find out why.

Besides allowing the compiler to stitch together your program, a base address gives you a predictable location for your file to get loaded every time it is executed.  This is important, because if you have sections of the file (say all of your executable code) that are read only, then we'd like to be efficient as possible on the machine and share those pages between processes.  The OS accomplishes this if your pages are marked for read only and sharable.  So if you have the code for strcpy from msvcrt.dll at location 0x70124800, then the one page of physical memory where that code lives can be viewed in all of the processes on the machine that also need it, provided those process have loaded the msvcrt.dll to the same address. 

User Process 1 Kernel Mapped Pages User Process 2
               
               
               
               
               
               
               
               
Same shared page is mapped into both user processes:
 
Writeable pages are unique per process:
     
 
               
               
               
               
               
               
               
               

See the advantage?  Overall system memory pressure goes down with shared pages because only one physical page is used no matter how many times you load it.  Also, speed of loading code goes up, because chances are the system already has that file loaded in some other process on the machine.  This is typically referred to as a "warm startup", because the OS has already loaded many of the pages you need, and doesn't have to go out to disk to get them.  So bottom line, sharing of pages between processes is a GOOD THING.

I mentioned that having to relocate a file away from its base address is a BAD THING for your shareable data.  This loss of sharing is the reason.  If you cannot load at your preferred base address, then those addresses in those otherwise sharable pages are now wrong.  So the OS has to make a copy of the page for your process, mark it write, and then fix-up all of the invalid values.  This is bad because it takes both more time to do this (slower load times) and more space (for the extra unshared pages).

I should point out that some pages are, of course, intended to be per-process.  Your global data for example wouldn't make much sense if you were sharing it with another running instance of your application!  But in general we try very hard to reduce the number of pages in the system because of the high cost of the extra memory pressure.

Back to Managed

Ok, all of this background is interesting, but what does this have to do with Managed code and the CLR?  First, we also use the PE file format for managed code, so your VB.Net application will be stored in the same file format as kernel32.dll.  This allows managed executables to appear anywhere you would normally expect.  For example if you want to do a CoCreateInstance on your managed code, or do a LoadLibrary directly, you can do so.  This file format choice means we have to follow the same rules for assigning base addresses.  And guess what?  We made the metadata and IL your compiler generates read only + sharable so we could use the same memory management benefits you get with unmanaged code.

Now think about what the JIT compiler does for a minute.  It just-in-time compiles your program one method at a time.  That means we allocate, on the fly, some memory and write the necessary native code for your program out to that location.  When we need to call a method, we know where we put it in the absolute address range, so we can do the same "call address" you saw for unmanaged code.  The advantage of the JIT is that it can literally stitch your program together as you go, and it only compiles the code that you actually execute.  But since this is happening on the fly, all of those pages where this code is allocated are for that process only.  We get none of the sharing advantages you got with unmanaged code in read only + sharable pages, and it also takes time to run that compiler.  We did some experiments early on in the Runtime as proof of concept for our managed C++ compiler which included recompiling Word as an IL image.  It worked great!  But it was slow.  Office is a big application, and using the JIT for this case didn't put our best foot forward.

User Process 1 Kernel Mapped Pages User Process 2
               
               
               
               
               
               
               
               
Shared pages are mapped into both user processes:
 
Writeable pages are unique per process:
     
JIT'd code lives in writable pages:
...   ...
 
               
               
               
               
               
               
               
               

Wouldn't it be great if you could get the same page sharing advantage as unmanaged code, and not have to run the JIT every time for a big application like Office?  That's the NGen tool, and we'll drill into that in the next section.

NGen Tool OverView

NGen stands for "Native Image Generator".  The tool allows us to run the JIT compiler on all of your IL in an assembly (a PE file) at one sitting, and cache the results out to disk.  Now when you want to load and run that assembly, we can find it in the cache and load it just like an unmanaged image.  Because the code is read only + sharable, you get the same benefits of page sharing. 

So what precisely is in that image that gets created?  Let's look at the contents:

Header

All PE files contain the standard set of headers, and an NGen image is no different.

Native Code Obviously this is the key thing we are trying to get into the image, and does make up the bulk of the image size.  The code persisted is 100% native at this point, so that the JIT does not need to get involved to execute it.
No Metadata or IL The current NGen produced image does not have a copy of the metadata or the IL in it.  This is significant, because it means that you will need to have both the original IL Assembly and the NGen image loaded at once.  In general we try to avoid touching the metadata and IL at runtime, but you can't always avoid it.  Two examples are late bound programming (eg: Reflection, which needs name information form the metadata) and JIT'ing of non-NGen'd code (the IL is read to see if it can be inlined).
Fix-up Tables The CLR requires more than just code to execute.  It must have access to key data structures which describe things like Class and Method layouts.  These are only known at runtime.  We want to reduce overall writable pages in a process.  To accomplish this, NGen stores a table of pointers to this data which will be allocated at run time.  This allows NGen to generate one version of the code that will work unmodified for all processes, because there is a predictable location in the image where you can find the pointer to the dynamically allocated data (essentially a slot in this table).  However, this technique has the down side of (1) slowing down startup to fill out the table, and (2) generating sub-optimal code which must use a pointer indirection to get the data it needs.  Finally, it also means that we cannot simply persist the output of the JIT compiler itself while it runs, because the actual code generated is different in the two scenarios.

Even with some of the trade offs mentioned here, we've seen some remarkable performance wins from this technique (and it only gets better each new release).  There are, however, some considerations you need to make before you jump on board the NGen bandwagon.  We'll cover those now.

Performance Win?

Measure, measure, measure.  You should always verify that this is a win for you.  First, you should be writing either a shared library (like the BCL itself) or a client application that would really benefit from this kind of win.  You must go try your app with and without to make sure it is worth the effort.  It may not always be.  For example in a Server scenario, where the application runs a long time, you can amortize the cost of jitting over the run of your server.  Combine that with lack of sharing across AppDomains and NGen isn't a win.

To Cache or Not to Cache

Generating an NGen image takes time.  You will be compiling all of your code at once into the final binary.  The larger the file, the longer this takes.  We do this for the .NET Framework during installation, and you can see the pause.  In our case it makes sense: all of your applications will run that much faster because of this.  You should decide if your application can handle this kind of wait.  You may not want to do this for dynamic web content in a browser for example.  Who wants to wait for the compile to finish for it to come up once?  And will you ever run the same program as is again?

Brittleness

The MSDN documentation gives you the command line arguments and usage of the tool (which comes with the distribution).  You should read very carefully through the section on brittleness.  As an example, the ngen'd image is tightly coupled to the version of the Framework you compiled against.  If that version is serviced (we ship a Service Pack for example), then your image will not be loaded, and your application will automatically fall back to jitting.  To be clear:  your code will still run, but it will not take advantage of the speed improvements you measured.  This is something we are spending a lot of time addressing in the next version of the product.

The NGen To Do List

So you've decided to ngen your image.  Now what?  This section contains some steps you should be taking:

Start with MSDN Make sure you read all of the documentation on MSDN.  I will only pull out highlights here.

Picking a Base Address

Pick a good set of base addresses for your PE files.  The NGen'd image will get placed right behind your IL image in the process.  You need to allocate enough space between your IL PE files for this image to be loaded.  The general guideline is to allocate at least 3x the original size of the IL image (so for example if your IL assembly was 1 MB large, you should allocate a 3MB total range for that assembly plus it's ngen'd image). You should take a look at the size of your NGen'd images and verify you have enough space, not only for what you ship, but for some reasonable amount of growth if you ship a bug fix release of the file.

When to NGen?

You need to pick when you want to invoke the tool.  For the distribution, we invoke ngen as a final step during setup.  This is the best approach in most cases, because your application will start fast from the first time it is run.  However, this will consume space on the user's machine, so if you think a particular application, or component, that you ship may not be run often (or at all), then you might consider deferring ngen to when the application starts the first time.  For example, you could schedule a windows timed task to compile it at night after the first time the code is run.

Servicing

When you release bug fixes to customers in your managed code, you will need to regenerate the ngen'd images as well.  This is pretty simple to do, just run the ngen command again.  But you need to make sure it is covered with the setup/patching feature you are shipping.

Uninstall

Remember to use ngen /delete to remove your unneeded assemblies from the cache when you uninstall your application.  Currently the CLR will remove all assemblies tied to a version of the framework on uninstall of the .NET FX, but it doesn't try to figure out when you've uninstalled just your application.

Servicing Hints

As mentioned above, there are brittleness issues with ngen in V1.0 and V1.1 (aka Everett).  So you need to plan out what you will do in the face of those things changing.  As an example, we will release a service pack of the CLR at some point, and your cached ngen images will no longer load.  Your code will still work, but it will run under the jitter which will be slower (you did measure to verify you needed ngen, right?).

Right now fixing this is tricky.  Expect us to improve this situation in the future, but for now, here are some ideas on how you can address this:

Setup/Patching

Make sure your setup and patching programs are doing the right thing.  If you ship a fixed version of your IL code, you need to re-run ngen on those files for it to be up to date.

Poor Man's Service You can periodically run a scheduled task to check your images and re-ngen them as required.  If you already have some kind of nightly enterprise script running on client machines, as an example, this would be a fine time to do maintenance.  Note:  if your images are already up to date, the NGen tool will simply report that and exit instead of doing a lot of unnecessary work.
Rocket Science If you are really motivated, you could go find the list of natively loaded PE files in your process (use the Win32 PSAPI API or walk the PEB) to see if your NGen'd image was actually loaded in the process.  If it wasn't, most likely it means you need to fix it up, and your app could do so for the next run.  I might prototype this at some point, but suffice to say it isn't a trivial thing to do.

Future Directions

At this point you've probably looked through the list of Servicing Hints and thought to yourself:  "Wow that's kinda ugly!"  And you're right.  NGen for Version 1.0 and 1.1 was primarily designed and engineered for internal use by the CLR itself.  When we install SP's of our stuff, we force a re-ngen of all of the core components, which keeps that part of your app running fast.

Going forward, Ngen is still a key foundation for our performance story.  It gives you the working set wins (better page sharing, quicker loading) that are required for starting your application faster.  It also allows for more aggressive optimizations in the compiler.  If we tried doing really aggressive optimizations every time you ran the JIT, you'd actually run slower just waiting for the compiler to finish. 

Expect in the future that we will be addressing the clumsiness and the servicing issues so your life is easier.  Here just a few things we're thinking about:

ngen /repair

We'll be talking about a feature called "ngen /repair" at the October 2003 PDC in LA next month which dramatically simplifies fixing up the cached images.

New API's There are some cleaner ways we could expose the fact that your application is out of date.  It would make it simpler to write your app if it could query this state, or force it to correct automatically.  We are considering these designs now.
Double Loads As mentioned above, the current CLR loads both the IL image and your NGen image in the process.  This double loading is inefficient because it makes the OS loader do more work (slower startup time).  Look for us to try to avoid this in the future.
Indirections As mentioned above, NGen images still contain a lot of fix-up tables for dynamic data structures.  This causes the startup to be slower (while those tables are fixed up) and generates sub-optimal code which must use the indirection of the table to get at the data.  Look for us to get more aggressive and avoid a lot of this.

And finally in closing, make sure to re-read those key take aways.

More Information

There are some important links you may be interested in reading:

MSDN Documentation
Gregor's Perf Talk
Jan's Perf Talk
Rico's Perf Talk