A customer called in. They had a Web Service running on a single IIS6. Memory usage would slowly increase and not be released. As a workaround they'd currently set the application pool to recycle at 500 MB, causing a few failed requests upon each restart.
I thought I'd describe how I went about troubleshooting this scenario.
The customer had already taken a memory dump using DebugDiag, so obviously I asked them to upload it. It was a huge dump. 1.64 GB! I opened it up in windbg and ran the following command:
I haven't mentioned !eeheap before. It gives me a nice summary of what's on the managed heap(s).
So, as we can see the managed memory is pretty evenly distributed in the 4 managed heaps. If you look down at the bottom you'll see that total memory usage is at 88 MB. (This is a figure I would have gotten by running !dumpheap -stat as well. So in case you're wondering; that would have worked just as fine.) Anyway, if only ~90 MB is in the managed heap, then the remaining 1.55 GB must be on the native heap.
Troubleshooting the native heap is a bit harder. The sos extension and windbg is a really nice team so working without them can be tough. Fortunately there is a nice feature included in DebugDiag that lets it perform an automatic analysis. This is usually the best way to get information from a native leak. Sometimes you may want to get a little additional information from windbg as well, but most of the time DebugDiag's auto analysis will get you a long way.
DebugDiag comes with a handy little dll called LeakTrack. LeakTrack attaches to the process in question and monitors memory allocations and their related call stacks. Since the dll is injected into the process it can have a slight impact on performance, but if you want to troubleshoot the process you really have no other options.
There are two ways to start monitoring a process with LeakTrack:
LeakTrack will begin monitoring from the moment you inject it. Obviously, it won't be able to get any data for events that have already occurred, so injecting LeakTrack when memory usage is already at it's peak won't really do much good. Instead I usually recycle the process and attach LeakTrack as soon as possible.
This time I chose to let the customer create a "Memory and Handle Leak"-rule using the wizard. This rule then created dumps at certain intervals which the customer uploaded to me.
So the next step is to let DebugDiag analyze the dumps. Here's how you do it:
Once DebugDiag has finished analyzing the dump you will get a new browser window with a report. The report I got showed the following right at the top:
So what does this mean? It means that LeakTrack has monitored the allocations made by a certain third party dll. It has also monitored if that memory has been properly released. All in all the .dll now has almost 150 MB of allocations that are unaccounted for. It would be safe to assume that this is our culprit. Further down in the report we have a link called "Leak analysis report". Clicking it will bring us to the following graphical representation:
As you can see the .dll in question is right at the top. But wait a minute! According to the graph mscorsvr has ~9 MB of outstanding allocations as well. Does this mean that mscorsvr is leaking too, though not as much?
The answer is no. As I mentioned before; LeakTrack will monitor all allocations and check that the memory is properly released as well. All we know is that mscorsvr has allocated 8,75 MB of memory that hasn't been released yet. There's nothing to suggest it won't release it eventually. The odds of the stray 148 MB eventually being released, on the other hand, are a lot slimmer in my opinion.
Since the dll responsible for the allocations is third party I can't find out exactly what causes the leak within the third party dll. What I can find out is what calls in the customer's code lead us down the leaking code path in the dll. LeakTrack monitors the callstacks related to memory allocations. Looking at the report I was able to find some nice callstacks that the customer should keep an eye on. It might be possible for them to tweak their code slightly in order to avoid the scenario.
Well, this is actually more or less where the road ends for me. Had the problem lied within a Microsoft component it would have been up to us to fix it. Had it been within the customer's code I would have been able to use the customers symbols to get a more detailed callstack and provide him with pointers on how to resolve it as well, but since this is a third party component the next step is to forward it to the third-party vendor.
/ Johan