Before I worked on the CLR I spent several years in MSN.  One of the big projects I worked on was "Sidewalk" which among other things offered a yellow pages service.  A lot of this work was done under IIS with ISAPI extensions and you can imagine the fun involved in writing a lot of native code that needs to deliver great sustained performance.  It was very very difficult.  In my opinion one of the best features of ASP.NET is that performance doesn't tend to decay over time.  I bet a lot this would never have happened if we had been running on say Windows Server 2003 as the system heap is much better but it's still an interesting story. 

In Sidewalk 3.0 [Autumn 1998] we had a problem with sustaining our performance over long runs. After 10-12 hours of heavy stress we were much slower than our initial speed and we needed to go at least 72 hours, preferably much more. Heap size was growing over time, and at first we thought we were fighting a leak, but a lot of analysis revealed that the speed degrade was due to heap fragmentation -- actual allocated bytes were not growing very quickly at all.  There were leaks but they were comparatively tiny -- we could have run for months at a time with leaks of that magnitude.

Luckily, we found that the fragmentation was on private heaps associated with one of our components. This component (an OLEDB provider) was free-threaded and as a result we had already done some interesting things to the heap to help performance (we were running on Windows NT4 sp3 I believe). The way things sat, was that we had one logical heap made up of 16 private heaps we could use for allocations. We would hash the incoming thread ID and then allocate from the appropriate private heap. As a result of this method, contention on the heaps was virtually nil (we usually ran with < 16 threads). Since the thread releasing the memory might not be the same as the thread allocating the memory, we had to save the heap handle secretly in every allocation (4 byte overhead) but since the allocations were pretty big that wasn't a problem.  So we were in good shape contention wise and given the pointer to memory, we could tell what heap to free it out of.

To address the fragmentation problem, we changed this scheme just a little bit:

Instead of having 16 heaps, we doubled it. 32 heaps -- 2 sets of 16. The first 100,000 allocations happen from the first set of heaps, as usual. On the 100,001st allocation we cut over to the second set of heaps. These heaps are "pristine". Now there is a memory transition happening... allocations are happening on the 2nd set of heaps but frees are happening on the 1st set of heaps (remember memory is always freed where it was allocated by necessity). As a consequence, the older heaps are being "tidied up". By the time we reach allocation number 200,001 the original heaps are nice and tidy, virtually pristine in fact as very few objects have a lifetime long enough that they haven't already been freed. We can repeat this process indefinately as on the 300,001st allocation, just as the 1st set is getting bad again, we switch once more to the 2nd set and so on.

Using this scheme memory fragmentation was eliminated and our top speed became our sustained speed (i.e. virtually no overhead associated with the dual heap system). Our working set actually plunged compared to the original fragmented heap mechanism, and you can see that the actual physical memory required of the dual heap solution is not that much more than an ideal single heap... at any given moment really only one "heap's worth" of memory is in use in a working set sense.

I selected the number 100,000 based on our number of allocations/sec and the measured rate of slowdown.

It took me over a week to diagnose just this one bug, although it took me only about 10 minutes to fix it once I knew what was going on.  Aren't you glad our latest offerings do this better?