The garbage collection that ate the process

Published 19 December 05 03:16 PM

I had an interesting case the other day that got a lot of us puzzled. As usual, there will be no names and no pack drill. Nothing that can be linked back to an individual customer here.

The customer was getting terrible performance on a multi-CPU blade that was running their asp.net application on Windows 2003 server. It worked great on their dev systems and performed pretty well on a laptop but crawled on their live server. This caused a certain amount of unhappiness, as you might imagine.

I had a feeling that this was GC related. I tend to run with hunches because that generally works out well. Many of my colleagues will always follow logic but I regard hunches as experience that has not been formerly codified. Anyway, the Perfmon counters showed that the process was spending 95%+ of its time in GC which certainly wouldn’t be helping performance. I asked for the dumps expecting to see fragmentation and a very full finalisation queue. Instead, the heaps were very neat without too much on them and there was no work for the GC at all. The latter made some sense as the process was doing nothing but GC so there wouldn’t be a great deal to do. The allocation rates (i.e objects allocated per second) were high but they didn’t look so high that we would see this sort of problem.

I had already compared the Perfmons from a system that was working well with the failing system and the object allocation rate was slightly higher on the faster system, presumably because it was doing more work. That was certainly evidence that the GC wasn’t being overrun. So empirical testing showed that the process scaled negatively in direct proportion to the number of CPUs in the system; that is to say, more CPUs gave less throughput.

My colleague Doug cracked the case in the end and it was a stroke of luck that we happened to have a hang dump which contained a smoking gun. There was a call to SetProcessorAffinityMask which was setting the process to run on one CPU only. This might seem to explain why the process was not scaling better on multiple CPUs but the sequence of events is a little more interesting than that. Here is what happened:

1. ASP worker process wakes up and loads the CLR (server version)

2. The CLR spots that we have 4 CPUs and accordingly creates 4 GC threads and 4 heaps. Workstation CLR wouldn’t have done this but server has different priorities.

3. The custom code loads.

4. It called SetProcessorAffinityMask disabling 3 of the CPUs for that process

There are now 4 GC threads competing for one CPU. They are fairly aggressive and each get about 1/4 of the CPU. 4*25% = 100%. Nothing else in that process could run much on that CPU so the rest of the system was running on the other CPUs, pushing their utilisation up to a level where the first CPU’s utilisation was not that exceptional.

It turned out that the API call was coming from bit of third party library code that had been included for reasons not directly related to the application functionality. Fixing that problem improved performance by 1000%. The API call had been ignored on the single CPU test systems because they only had

There have been a few times in the past where we have seen odd behaviours which were caused by code that set things process wide or system wide. I like to think of them as being boorish guests in someone else’s house. You wouldn’t want a guess to change the locks in your house or give out keys to people that you don’t know – so a hosted component shouldn’t mess with security under normal circumstances. A guest is welcome to eat a meal or two but shouldn’t eat all the food so go easy on resources. A guest shouldn’t mess with things that are not intended to be shared so don’t touch the state of other components. You often see these sorts of problems with components that were designed for a desktop application but which ended up in a server component. The rules are different and subtle and they will bite you if you break them J

Signing off

Mark

by marklon

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

# Ethan said on December 20, 2005 3:17 AM:
I have a similar case.
In my case GC takes over the process, but the time spent in GC shows values much lower, about 4 percent. Strange, but I found that by freeing large amount of memory (like closing VS), makes GC to releases the process. Otherwise I can wait from 15 minutes to several hours. Each time I have at least 100 MB of free memory (1 GB total).
Is there a minimum amount of free memory for GC to work properly?
# marklon said on December 20, 2005 4:23 AM:
The GC needs about 50 MB of contiguous memory to work. You get an out of memory exception if it can't find it.

Hmmm. If the time in GC is only 4%, I suspect that something else is eating the process. Have you attached a debugger, done a !runaway and looked at the stacks of the busy threads?
# Ethan said on December 25, 2005 11:49 AM:
I used process explorer and saw that there is only one thread that eats all cpu. I attached windbg to the process and watched the stack of cpu eating thread. It was doing GC.
That time I had about 70 threads running.
From the moment I have added thread pool, the problem was gone, for now.
# marklon said on December 26, 2005 7:21 AM:
Interesting. Can you describe the changes that you made in more detail please? If it is confidental then you can email me at MarkLon@Microsoft.com.
# snaveen said on January 18, 2006 10:59 AM:
How do you handle GC collection in 64 Bit system? The virtual memory is greater 2 Gb and in server GC the collection (gen 2) take really long and the other threads might be suspended which could time out the client requests.
# marklon said on January 18, 2006 12:55 PM:
That is rather a big question to answer in a comment though I may blog about it later.

It isn't as bad a problem as you might imagine. There is 1 GC thread per CPU and one managed heap per CPU. While the heaps can grow bigger, no one heap is generally going to be huge.

Does that help?

Mark
# If broken it is, fix it you should said on March 12, 2007 11:31 AM:

I thought I'd share a support story with you from a very interesting case I have. My customer is running

# If broken it is, fix it you should said on March 12, 2007 11:32 AM:

I thought I'd share a support story with you from a very interesting case I have. My customer is running

Leave a Comment

(required) 
(optional)
(required) 

  
Enter Code Here: Required

This Blog

Syndication

Page view tracker