This is a subject that has been covered before and I have no intention of writing the ultimate post on the subject. Still I think this is something that every good developer should know.
My colleagues and I are quite often asked about the necessity of knowing how the Garbage Collector (GC) works. After all, it isn't really necessary to know what's going on behind the scenes, right..? A developer shouldn't have to bother about how the framework works..? Or should he..?
Well you may be a pretty good bus driver even if you know nothing about engines. Basic stuff like not stepping on the break and the gas at the same time is probably enough. If you want to do F1 racing, on the other hand, you will want to know as much as you can about your car. So, if you're content doing small scale web applications or not so efficient winforms applications, then you probably don't have to bother. If you want to do front-line work, then you've got to get your hands dirty.
In order to be effective the managed heap is trying to figure out which objects it can disregard and only check up on occasionally. Some objects may span the entire lifecycle of the application and others may be very short lived, so if an object stays alive for a longer period of time the GC will check up on it less frequently. This is accomplished by dividing the managed heap into three generations. Generations 0, 1 and 2. An object will begin in generation 0 and if it is in use for a long period of time it will travel through generations 0 to 2. A possible metaphor for this scenario would the following:
Like I stated above; when an object is created it is put in Generation 0. The initial size limit of Generation 0 is determined by the size of the processor cache. This is dynamically changed depending on the allocation rate of the application. Once Generation 0 reaches its limit it will go through all the items in Generation 0, tag the obsolete objects for collection and remove them. This is called the Mark and Sweep phase. Everything that survives this sweep will be compacted and moved to Generation 1. The size limit of Generation 1 is also determined by the allocation rate of the application, and once it is reached a Generation 1 collection will occur. This will first mark and sweep all items in Generation 1, moving the surviving items to Generation 2, and then mark and sweep the items in Generation 0.
The healthy ratio between GC's in the different generations is approximately 100 - 10 - 1, so for 100 Generation 0 GC's you normally have 10 GC's of Generation 1 and 1 of Generation 2. A normal Generation 0 GC usually takes a few milliseconds. Performing a GC of Generation 1 rarely takes more than 30 milliseconds, but a Generation 2 collection can take quite some time depending on the application.
To top it off we also have what is called "The Large Object Heap" (LOH). This is where anything larger than 85.000 bytes will be stored. The LOH is collected each time a new segment needs to be reserved (see below), but it is not compacted like generations 0, 1 and 2. When you perform a collection on the LOH you will also perform a GC on the other generations.
Now, if the limit is at 85.000 bytes, won't most of my objects end up on the LOH? - Well not necessarily. For example: A dataset may contain lots and lots of data, but the dataset object only contains references to other objects. The data in the columns are each stored in individual strings and as long as they're not larger than 85 KB you're clear.
The managed heap will reserve memory in segments. The size of the segments depend on the configuration. If <gcServer enabled="true"/> you will reserve memory in 64 MB segments, otherwise you'll be doing it in 32 MB segments. LOH are reserved in 16 MB segments. Only Generation 2 and the LOH will span several segments.
Let's say we're performing a full GC, including the LOH. This is what happens:
This topic could off course be a lot bigger, but here are some quick suggestions.
Well off course Generation 1 is better than Generation 2, but you should aim for keeping only a select few objects in Generation 2. Those should be variables that are defined at the beginning of the application lifecycle and released at the end. In an ideal world all other variables should be of the hit-and-run variety and never leave Generation 0.
This has been said before. It will be said again, and again, and again. You should almost never call GC.Collect() manually. And by almost I mean once in a lifetime, not once per application, and certainly not once per function call. I occasionally call GC.Collect() for testing purposes just to see if memory has been released. Normally you would never call it. The GC is self-balancing and by calling GC.Collect() you are disrupting that balance. Think of it as tampering with the eco-system, pouring sugar in the gasoline or whatever metaphor you prefer. :-)
If you can stay below the 85.000 byte limit, then do so. If not, consider reusing the object. When it comes to large objects it's better to use one for a long time than to use many for short periods of time.
When your object has a finalizer the finalize method will be called when the object is no longer alive. So far so good. Unfortunately your object will be passed into the next generation, since it's not yet ready to be collected. This means that all objects with a finalizer will at least end up in Generation 1. Most likely in Generation 2.
Well, as I said: There is a lot more to cover on this. There are books to be written and songs to be sung, but you've got to draw the line somewhere. I will most certainly cover more of this in future posts.
/ Johan