This is a subject that has been covered before and I have no intention of writing the ultimate post on the subject. Still I think this is something that every good developer should know.

Why do I need to know this?

My colleagues and I are quite often asked about the necessity of knowing how the Garbage Collector (GC) works. After all, it isn't really necessary to know what's going on behind the scenes, right..? A developer shouldn't have to bother about how the framework works..? Or should he..?

Well you may be a pretty good bus driver even if you know nothing about engines. Basic stuff like not stepping on the break and the gas at the same time is probably enough. If you want to do F1 racing, on the other hand, you will want to know as much as you can about your car. So, if you're content doing small scale web applications or not so efficient winforms applications, then you probably don't have to bother. If you want to do front-line work, then you've got to get your hands dirty.

 

The different generations

In order to be effective the managed heap is trying to figure out which objects it can disregard and only check up on occasionally. Some objects may span the entire lifecycle of the application and others may be very short lived, so if an object stays alive for a longer period of time the GC will check up on it less frequently. This is accomplished by dividing the managed heap into three generations. Generations 0, 1 and 2. An object will begin in generation 0 and if it is in use for a long period of time it will travel through generations 0 to 2. A possible metaphor for this scenario would the following:

  • You're asked to call a customer. You're pretty good with numbers so you keep the phone number in your head. (Generation 0)
  • There is no reply and so you need to remember the phone number a little longer. Meanwhile you're given additional numbers to call.
  • You know from past experience that your limit for keeping numbers in your head is 10, so once you reach that critical point you decide to write down the numbers you still need on post-it notes (Generation 1)
  • As the day goes on you keep transferring numbers from memory to post-its.
  • Your desktop has now reached maximum capacity. It is cluttered with post-it notes, so you sort through them, determining which notes you still need. You throw away the ones that are obsolete and write down the remaining to your address book. (Generation 2)

Like I stated above; when an object is created it is put in Generation 0. The initial size limit of Generation 0 is determined by the size of the processor cache. This is dynamically changed depending on the allocation rate of the application. Once Generation 0 reaches its limit it will go through all the items in Generation 0, tag the obsolete objects for collection and remove them. This is called the Mark and Sweep phase. Everything that survives this sweep will be compacted and moved to Generation 1. The size limit of Generation 1 is also determined by the allocation rate of the application, and once it is reached a Generation 1 collection will occur. This will first mark and sweep all items in Generation 1, moving the surviving items to Generation 2, and then mark and sweep the items in Generation 0.

The healthy ratio between GC's in the different generations is approximately 100 - 10 - 1, so for 100 Generation 0 GC's you normally have 10 GC's of Generation 1 and 1 of Generation 2. A normal Generation 0 GC usually takes a few milliseconds. Performing a GC of Generation 1 rarely takes more than 30 milliseconds, but a Generation 2 collection can take quite some time depending on the application.

 

The Large Object Heap

To top it off we also have what is called "The Large Object Heap" (LOH). This is where anything larger than 85.000 bytes will be stored. The LOH is collected each time a new segment needs to be reserved (see below), but it is not compacted like generations 0, 1 and 2. When you perform a collection on the LOH you will also perform a GC on the other generations.

Now, if the limit is at 85.000 bytes, won't most of my objects end up on the LOH? - Well not necessarily. For example: A dataset may contain lots and lots of data, but the dataset object only contains references to other objects. The data in the columns are each stored in individual strings and as long as they're not larger than 85 KB you're clear.

 

Memory segments

The managed heap will reserve memory in segments. The size of the segments depend on the configuration. If <gcServer enabled="true"/> you will reserve memory in 64 MB segments, otherwise you'll be doing it in 32 MB segments. LOH are reserved in 16 MB segments. Only Generation 2 and the LOH will span several segments.

 

What happens during GC?

Let's say we're performing a full GC, including the LOH. This is what happens:

  • The objects on the LOH are Marked. Each item in the LOH is checked for references. If none are found it's ready to be collected.
  • The LOH is Swept. All marked objects are released from memory.
  • The LOH is not compacted.
  • Generation 2 is Marked.
  • Generation 2 is Swept.
  • Generation 2 is compacted. (Imagine removing a few books from your bookcase, and then pushing the remaining books together freeing up continuous space at the end.)
  • Generation 1 is Marked.
  • Generation 1 is Swept.
  • Everything  that survived the sweep is compacted.
  • The pointer for where Generation 2 ends is updated. Everything that survived the sweep is now Generation 2.
  • Generation 0 is Marked.
  • Generation 0 is Swept.
  • Everything that survived the sweep is compacted.
  • The pointer for where Generation 1 ends is updated. Everything that survived the sweep is now Generation 1.

 

A few quick tips

This topic could off course be a lot bigger, but here are some quick suggestions.

Try to stay out of Generation 1

Well off course Generation 1 is better than Generation 2, but you should aim for keeping only a select few objects in Generation 2. Those should be variables that are defined at the beginning of the application lifecycle and released at the end. In an ideal world all other variables should be of the hit-and-run variety and never leave Generation 0.

Don't call GC.Collect()

This has been said before. It will be said again, and again, and again. You should almost never call GC.Collect() manually. And by almost I mean once in a lifetime, not once per application, and certainly not once per function call. I occasionally call GC.Collect() for testing purposes just to see if memory has been released. Normally you would never call it. The GC is self-balancing and by calling GC.Collect() you are disrupting that balance. Think of it as tampering with the eco-system, pouring sugar in the gasoline or whatever metaphor you prefer. :-)

Avoid large objects

If you can stay below the 85.000 byte limit, then do so. If not, consider reusing the object. When it comes to large objects it's better to use one for a long time than to use many for short periods of time.

Don't use finalizers

When your object has a finalizer the finalize method will be called when the object is no longer alive. So far so good. Unfortunately your object will be passed into the next generation, since it's not yet ready to be collected. This means that all objects with a finalizer will at least end up in Generation 1. Most likely in Generation 2.

Well, as I said: There is a lot more to cover on this. There are books to be written and songs to be sung, but you've got to draw the line somewhere. I will most certainly cover more of this in future posts.

/ Johan