Believe it or not I actually spend a good amount of time thinking about my little quizzes hoping to come up with a small piece of code that illustrates a bunch of different things in a simple enough way that many people feel they can jump right in and comment.  And also there should be a nice curve ball in there somewhere.  Performance Quiz #7 is no different.  :)

Now before I jump into the analysis I want to remind you that this is just a micro-benchmark.  The job of these things is to magnify one or more phenomena so that it is easily visible for tracking, analysis, or teaching.  We can't really conclude a whole lot about best practices overall from a micro-benchmark.  But with that in mind, let's have a look at what's going on.

First let me answer the questions directly:

Q1: Which is faster?

The Generics version is considerably faster, see below.

Q2: How much faster?

On my development machine (it's running a recent post Beta2 CLR build) I get these numbers:

Generic foreach time: 2.42414s 
Arraylist foreach time: 6.04955s 

So the generic version is about 2.5 times faster on this machine.

Q3: Why is it faster?

I think the data is better at showing why the ArrayList is slower but it's maybe just as important to discuss things that are not why it is slower.  Let's look at this table of inclusive times.  Here I'm showing only function that are at least 1.5% of the time or greater -- I picked that number somewhat arbitrarily because it seemed to cut off enough of the low level stuff while leaving the bits that were useful for analysis.  This is a single run of the executable that does both tests so you can see them compared to each other in context (you could do the same looking at two runs side-by-side but I found this easier)

 Exclusive  
Percent
  Inclusive  
Percent
           Function Name (some names removed to make it easier to read)
         -     99.64 __CorExeMain
         -     99.63    ...
         -     99.21      ExecuteEXE(struct HINSTANCE__ *)
         -     99.21        SystemDomain::ExecuteMainMethod(struct HINSTANCE__ *,unsigned short *)
         -       1.60          SystemDomain::InitializeDefaultDomain(int)
         -     97.59          Assembly::ExecuteMainMethod(class PtrArray * *)
         -     97.59            ClassLoader::RunMain(...)
         -     97.59              ...
        -     97.59                ...
         -     97.59                  ...
         -     97.59                    ...
         -     97.59                      ...
         -     97.56                        NS.Test.Main(string[])
      4.93    26.72 ***                          NS.Test.GenericTest()
    18.86    18.86                            System.Collections.Generic.List`1.Enumerator.MoveNext()
      1.78      2.85                            System.Collections.Generic.List`1.GetEnumerator()
      8.99    69.23 ***                          NS.Test.ArrayListTest()
    11.23    11.26                            Virtual Dispatch Stub (unusual nesting due to tail jmp)
      9.39    23.29                            System.Collections.ArrayList.ArrayListEnumeratorSimple.get_Current()
    13.90    13.90                              Virtual Dispatch Stub (unusual nesting due to tail jmp)
    17.06    20.00                            System.Collections.ArrayList.ArrayListEnumeratorSimple.MoveNext()
      2.11      2.11                              An indirection buffer in mscorwks that I don't understand :)
      0.52      4.56                            System.Collections.ArrayList.GetEnumerator()
      0.94      1.64                              System.Collections.ArrayList.ArrayListEnumeratorSimple..ctor(...)
      0.10      1.87                              JIT_NewFast(...)
      0.03      1.64                                FastAllocateObject(class MethodTable *)
      0.04      1.61                                  Alloc(unsigned int,int,int)
      0.05      1.54                                    WKS::GCHeap::Alloc(...)

OK looking at this we can see some interesting stuff is going on here.  First off you can see that the Generic solution is only 26.72% of the run time while the arraylist version is 69.23% of the run time.  Together that's 95.94% of the time, the rest of the time we can attribute to the startup process and some taxes, notably 1.6% initializing the one and only appdomain.  If we divide those relative fractions we'll again see that the generic version is about 2.5 times faster.  So far so good (when the wall clock time is similar to the profiler time I'm a happy guy).

OK but now why?

Well let's eliminate some things.

Is it boxing?  No.  It should be no surprise that boxing isn't the dominant factor here.  After all the arraylist contains only 20 elements so we boxed exactly 20 times, compared to the 200,000,000 read operations we did the boxing is just noise.  

Now you say: "OK Rico, you're splitting hairs, I didn't mean boxing, I meant unboxing of course."  Well there's plenty of that in this example but remember unboxing an integer is basically a test to see if the boxed object is in fact an integer type and them copying out the integer.  That's not something to be terribly afraid of, no heap allocations for that.  Is Unboxing the key factor here?  I'd say not.

What about all those allocations, many people noticed that the arraylist version is allocating enumerators like crazy.  That must be the problem.  10,000,000 enumerator allocations can't be good right?  Surely that's the dominant cost?  Well, it turns out it isn't.  This guy here JIT_NewFast tells the tale, only 1.87% of the run time was spent in allocations.  Now since this guy is single threaded, if any collections happened they would be triggered by that call so the collections are part of that 1.87%.  In fact if you look at the performance counters while this program is running you'll see that the average time in the garbage collector is only really about one half of one percent (0.5%) so most of that time is just doing the allocations and not collecting. 

Why is that?  Well when the collector runs on this program it finds that exactly one object in generation zero is reachable (the now current enumerator) there are so few roots that it is triviality itself to walk them all (there's probably like 10ish of them total) and since dead objects don't have to be visited all we have to do to collect is move the one live enumerator to the start of generator zero and we're done.  So we move something like 16 bytes on every collect.  That's pretty optimal!

So what *is* killing us?

Well it's a combination of things but the short answer is there are many virtual function in the arraylist version and the List<T> doesn't have those.  When we re-did our collection classes for generics in Whidbey we drastically simplified the enumeration pattern.  It doesn't need virtual dispatches (or allocations) now and it's much more friendly to the inliner.  The main savings come from a streamlined execution path -- it's just less instructions.  So the main factor is simply this: the Generics way is less code.

Note: The C# foreach construct did the right thing in both cases but in the Generics case the implemenation of the enumeration pattern is simply better so when C# does the right thing the resulting code is better.  

Now don't use this as an argument to avoid foreach -- some people like to do that.  The fact is this is a microbenchmark and if we were doing any real work in that loop the overhead would be considerably lower as a fraction of the total execution.  Really I only caution against using foreach on ArrayLists and then only if the iteration is very common.  I forced that situation here with 10,000,000 loops over a small list.

Why didn't we fix ArrayList while we were at it?  Ugh... we can't.  If we "fixed it" we would break existing applications.  Because the class isn't sealed it's not possible for us to change the virtualness of the key functions involved so we have to live with that mistake.  But I think you'll find that generally List<Object> offers better performance than ArrayList across the board (but measure!).

Are those allocations a problem at all?  Well yes they are but they basically induce a processor tax because touching all that memory induces more cache misses.  The cost manifests itself in terms of an overall higher average cycles per instruction executed for the same code but I wouldn't say that's the dominant factor.  The cost associated with the virtual dispatch stubs is really showcasing the true source of the problem here.

Sadly our profiler didn't automatically recognize the virtual dispatch stubs... I annoted those manually.  Oh well, there's always room for improvement.

Q4: What price do we pay for this speed?  Quantify it if you can.

Well in this case the cost is very low because List<int> was built into mscorlib anyway.  But suppose you were using a generic list in your own system, how much would that cost?

To measure this I build a version of a test application that created 200 different List<T> instances over 200 different enum types.  Then I did the same with 400 instances.  I took the workingset of both of those two and subtracted to get the overhead of adding 200 List<T> to my test hardness.  I measured working set with a simple vadump command.  The key bits of code look like this:

  enum e1 { a };
  static List<e1> l1 = new List<e1>();
  enum e2 { a };
  static List<e2> l2 = new List<e2>();
  ...


The difference for 200 instances was 1676k of shared pages of which 28k were private pages.  So dividing that by 200 we get 8581 bytes for a List<T> instance (and the enum type, I didn't seperate the cost although that could be done too) of which 143 are private bytes.

To pay for this overhead in terms of saved boxes, figure an average of about 16 bytes saved per boxed item (this accounts for internal pointers in the boxed object, some allocation overhead and the pointer to the box, we could haggle over whether the overhead is more like 12 bytes or more like 16 bytes but it's my article *grin*).  So 536 instances stored in the collection would cover that -- lets round it to 500 because that's easier to remember.

So if you are going to store more than about 500 items at once you will likely come out ahead space-wise if you use a generic, if not you can still get better performance by using List<Object> instead of ArrayList in most cases.

Of course you'll still have to measure your own scenarios. :)