The garbage collection that ate the process
I had an interesting case the other day that got a lot of us puzzled. As usual, there will be no names and no pack drill. Nothing that can be linked back to an individual customer here.
The customer was getting terrible performance on a multi-CPU blade that was running their asp.net application on Windows 2003 server. It worked great on their dev systems and performed pretty well on a laptop but crawled on their live server. This caused a certain amount of unhappiness, as you might imagine.
I had a feeling that this was GC related. I tend to run with hunches because that generally works out well. Many of my colleagues will always follow logic but I regard hunches as experience that has not been formerly codified. Anyway, the Perfmon counters showed that the process was spending 95%+ of its time in GC which certainly wouldn’t be helping performance. I asked for the dumps expecting to see fragmentation and a very full finalisation queue. Instead, the heaps were very neat without too much on them and there was no work for the GC at all. The latter made some sense as the process was doing nothing but GC so there wouldn’t be a great deal to do. The allocation rates (i.e objects allocated per second) were high but they didn’t look so high that we would see this sort of problem.
I had already compared the Perfmons from a system that was working well with the failing system and the object allocation rate was slightly higher on the faster system, presumably because it was doing more work. That was certainly evidence that the GC wasn’t being overrun. So empirical testing showed that the process scaled negatively in direct proportion to the number of CPUs in the system; that is to say, more CPUs gave less throughput.
My colleague Doug cracked the case in the end and it was a stroke of luck that we happened to have a hang dump which contained a smoking gun. There was a call to SetProcessorAffinityMask which was setting the process to run on one CPU only. This might seem to explain why the process was not scaling better on multiple CPUs but the sequence of events is a little more interesting than that. Here is what happened:
1. ASP worker process wakes up and loads the CLR (server version)
2. The CLR spots that we have 4 CPUs and accordingly creates 4 GC threads and 4 heaps. Workstation CLR wouldn’t have done this but server has different priorities.
3. The custom code loads.
4. It called SetProcessorAffinityMask disabling 3 of the CPUs for that process
There are now 4 GC threads competing for one CPU. They are fairly aggressive and each get about 1/4 of the CPU. 4*25% = 100%. Nothing else in that process could run much on that CPU so the rest of the system was running on the other CPUs, pushing their utilisation up to a level where the first CPU’s utilisation was not that exceptional.
It turned out that the API call was coming from bit of third party library code that had been included for reasons not directly related to the application functionality. Fixing that problem improved performance by 1000%. The API call had been ignored on the single CPU test systems because they only had
There have been a few times in the past where we have seen odd behaviours which were caused by code that set things process wide or system wide. I like to think of them as being boorish guests in someone else’s house. You wouldn’t want a guess to change the locks in your house or give out keys to people that you don’t know – so a hosted component shouldn’t mess with security under normal circumstances. A guest is welcome to eat a meal or two but shouldn’t eat all the food so go easy on resources. A guest shouldn’t mess with things that are not intended to be shared so don’t touch the state of other components. You often see these sorts of problems with components that were designed for a desktop application but which ended up in a server component. The rules are different and subtle and they will bite you if you break them J
Signing off
Mark