If broken it is, fix it you should

Using the powers of the debugger to solve the problems of the world - and a bag of chips    by Tess Ferrandez, ASP.NET Escalation Engineer (Microsoft)

Back to Basics - How do I get the memory dumps in the first place? And what is SOS.dll?

Back to Basics - How do I get the memory dumps in the first place? And what is SOS.dll?

Rate This
  • Comments 20

Windbg.exe and its friends can be installed from http://www.microsoft.com/whdc/devtools/debugging/default.mspx

Once you have them installed on a machine, you can simply copy the directory where they are installed (usually c:\program files\debugging tools for windows) to any machine that you need them on. No other installation is really necessary.

Before we even start with how you get the dumps, you might be interested in what a memory dump actually is...

A memory dump is a snapshot of a process or a system at a given time. There are various types of memory dumps with varying degree of data included.

User vs. Kernel Dumps

If you take a memory dump of a process, you have taken a user dump. If you need a memory dump of a whole system, you take a kernel dump. I'm going to skip the discussion on kernel dumps completely because for the problems I deal with (hung, crashing processes or processes with memory leaks and exceptions), I don't generally need to know what the operating system is doing at the time, so a kernel dump is a waste of space and .net debugging in kernel dumps is almost impossible.

Degrees of data included

Usually dumps are referred to either mini dumps or full dumps even though this notation is not really correct, since what we refer to as full dumps are really mini dumps with extra information.

Either way, a full dump is usually a dump taken with the /ma (a for all) switch, which means full memory data, handle data, unloaded module information, basic memory information, module information, thread and stack information including thread time information. In essence, you get everything you can want and more in one file. The size of a full dump is the same as the private bytes used by the process.

A mini dump on the other hand is usually a dump taken with the /mdi switch which means module, thread, stack, any memory that is referenced by a pointer on a stack, and some read-write segments. Mostly these are used to look at what threads are executing at the time the dump was taken. The size of a mini dump is usually only a few MB, so its benefit is that its very fast to write and of course doesn't take up much space, but you cant get much .net data from it.

The switches I am talking about are switches to the .dump command in windbg.exe, to learn more about the different options, look up .dump in the windbg help files.

As a side note: When you run an application and get an application failure where you are asked if you want to send the data to Microsoft, what you normally send is a mini dump, so no real personal data is sent. I know many people avoid sending this data, fearing that big brother will now know everything about you, if you are one of these people, fear no longer:) it's a good thing to send this data; you are helping eliminating bugs so you won't have to run into them later. And the people and applications that look at these dumps don't care who you are or about anything personal, they are just looking to solve the problems.

How do you get the dumps?

There are a few tools that you can use to generate dumps. Some of you might have heard about debugdiag or error reporting or dr. Watson. But my tool of choice is windbg or a script file that comes with it called adplus.vbs which basically scripts windbg's command line equivalent cdb.exe.

If you are attached to a process with windbg.exe you can create a memory dump by typing .dump /ma c:\mydumpfolder\mydump.dmp at windbg's command line.

What is more common, when we ask for dumps from customers, is that we automatically create dumps with adplus since it's much easier and nicer to deal with.

Adplus takes a number of arguments, I'm not going to bore you with all of them, but just show you some of the more common ones.

-hang creates a snapshot (full dump) of the process right now, it really has nothing to do with the process hanging, but it got its name because the most common usage for these snapshots is to debug hangs.

When the problem occurs you simply run

adplus -hang -pn processname.exe

then cdb.exe attaches in non-invasive mode (so it can get out later without shutting down the process), takes a snapshot and leaves.

-crash attaches cdb.exe to the process in invasive mode and leaves it attached until either you close the debugger (generating a ctrl-c event), or until the process crashes or gets an interrupt (breakpoint). Whilst attached, it creates mini dumps for all access violation exceptions, and logs all other exceptions that it has set up in a log file... and if the process crashes it generates a full dump when it exits.

Both types of dumps are created in a directory under your debuggers directory marked with the date and timestamp of the debug session.

-pn specifies what process you want to attach to, by process name.

-p specifies what process you want to attach to, by process ID.

-c allows you to pass a configuration file to adplus so that you can configure your own breakpoints, exceptions etc. An example of when you might want to use a configuration file can be seen in my post about debugging exceptions.

For more information on adplus usage, look up adplus in the windbg.exe help file.

Scenarios

Knowing how to take them is 1/10th of the battle, the hard part is knowing when. That is even harder than doing the actual debugging. Looking at a dump taken at the right time in the right way is almost a breeze if you are a somewhat experienced debugger.

Crashes

This is the easy one. You attach the debugger in crash mode, leave it running until the process dies and you're done. Hmm... in most cases... The caveats here is that the process might only crash every 2 weeks, and meanwhile it might be throwing a lot of NullReferenceExceptions, generating a lot of access violation mini dumps which makes it not feasible to run. In that case you have to configure adplus to not dump on 1st chance access violations.

The process might also be recycled once in a while for maintenance or other reasons, so you may get a false positive, so make sure to match it up with the event log to make sure that you are actually catching what you think you're catching.

And 3rd, sometimes you will only see one thread in the crash dump, so it wont give you much. In that case it might be good to create a config file that allows you to break on Kernel32!ExitProcess so that you catch the threads as the process is trying to shut down.

The hardest crashes to debug are ones where nothing in the process caused it to die, but rather an external application, or condition (no available memory for example) caused it to die.

In each case, when you get a crash dump, look at the faulting thread (the thread you are on when you open the dump), and look in the log file for what exceptions occurred right before the crash, and finally match up with any events that happened recently in the event log.

Hangs and performance issues

A real hang is usually pretty easy to catch, the web admin can grab a hang dump right before restarting the application pool.

If your server starts slowing down and you don't know if it is really a hang, or just very slow performance, then take two hang dumps after each other, with perhaps 2 minutes in between. (Make sure that the first dump has completely finished writing before you take the 2nd one, otherwise you will get two identical dumps). That way you can compare the dumps and see if anything is moving at all.

The trickier ones are the ones where one out of 5000 requests take 3 seconds instead of 1 second. They are so tricky that I really don't have a good answer to what to do. Mainly in these cases, if it is a web server, we try to look at the IIS logs and see if there is a specific request that seems to take longer (time-taken) at times, and try to correlate it to other things happening around the time, along with data about the user, location of the browser on the network etc. In short, this is one place where debugging may not be the best way. Once we know a bit more about the conditions, stress testing to make it happen more readily is a good strategy here.

Memory issues

There are three different situations here

  • you want to know why you are using so much memory
  • you want to know why your memory is constantly rising
  • you want to know what caused an out of memory exception

In the first case, a hang dump when the memory is high is a good start.

In the second case, a number of dumps spaced 100-200 MB apart is recommended, so you can compare the memory usage.

And finally, for the out of memory exception, you can either look at a hang dump when the OOM has occurred and see what is using the memory, or you can run adplus in crash mode and set the registry keys for GCFailFastOnOOM to 2 so that the process recycles when you hit an OOM. (http://support.microsoft.com/?kbid=820745)

I'm working on a post for debugging memory issues but it is taking some time, so stay tuned…

What is SOS.dll

You can't really talk about debugging managed processes without mentioning SOS.dll. You can load extensions in windbg and cdb that automate tasks that you would otherwise have to do manually. Some things that it helps you automate are very hard to do manually, like building the managed stack from the native stack. That is where sos.dll comes in.

You can find sos.dll in the clr10 directory, under your debuggers directory, for 1.1 applications, or in the framework directory, and you load it using .load clr10\sos or .load c:\blablamypath\sos.dll

Then you can start running commands using !commandname, like !clrstack for example to get the managed stack of a thread.

The full list of commands you can run can be found by running !sos.help, and throughout my posts I am using a lot of them.

A long long time ago, and many versions of sos.dll ago, me and a colleague wrote some very basic help files for sos.dll. If you have the SDK for .net 1.0 or 1.1 installed you can find them in a directory similar to C:\Program Files\Microsoft Visual Studio .NET 2003\SDK\v1.1\Tool Developers Guide\Samples\sos\SOS.htm but beware, they are a bit raw to say the least:) and many commands have been added to sos.dll since then, so they are by no means complete help files.

It's a bit of a nostalgic trip, really, to look at them and realize how little we (as a collective) knew about managed debugging back then:) and how hard it used to be in the days before sos.dll

Snap, snap... enough writing, back to debugging some nasty deadlocks:)

 

 






  • I’ve written quite a few posts on memory issues because that is the type of problem we get most frequently...
  • Note: .net v2 sos.dll can be found in %windir%\Microsoft.net\framework\v2....\sos.dll
  • Use the following commands for loading SOS for the .net 2.0 framework...

    .cordll -u -lp %windows%\Microsoft.NET\Framework\v2.0.50727\

    .load %windows%\Microsoft.NET\Framework\v2.0.50727\sos

    Hail to the king baby.
  • Hello,

    One of the .NET sites developed using 1.1 goes down now and then with a 'Service Unavailable' error - could it be a memory problem which can be debugged as above , or do you think it could be something else more specific ?

    Regards.
  • Service unavailable means that the process crashed and you will likely see a stopped unexpectedly event in the eventlog.  Depending on if the memory usage is high or not you would debug it as a crash or a high mem issue.  For crashes run adplus -crash -pn w3wp.exe to get a dump of the process when it crashes.
  • Hi Tess,

    Our 1.1 ASP.NET application recently started to throw some 1000 events faulting mscorsvr.dll. I would like to start using the debug tools you described but since the faults only occur once in a few days and it is a production site I would like to know how much performance impact and security risk is attached to running debug tools like these in a production environment.

    Ewoud
  • Hi Ewoud,

    The impact depends a little on how many exceptions you're throwing and what config files you use.

    If you throw a lot of null reference exceptions and use the default (just -crash) you will get a minidump for each first chance exception which might stall the process a little.  If so i would recommend running the debugger with

    adplus -crash -pn <processname.exe> -NoDumpOnFirst

    to begin with, and then monitor it for a little bit just to make sure it's not making a huge impact.

    If your app is not throwing a lot of NullReferenceExceptions you should be good to go with just -crash.

    We do use this in a production environment all the time though, that is what they are meant for, its just a matter of getting the right configuration for your environment.

    Thanks
  • I am trying to pick up the basics of production debugging by going through one of the documents on the Patterns and Practices site about the use of DBG. In the walkthrough, i created a System Monitor report in Performance tool, for .NET Clr memory object, for aspnet_wp. When i see the counters in report view, the #Bytes in all heaps shows 19,01456, whereas the Gen 0 Heap size is 83,88608, Gen 1 is 7,99968 , Gen 2 is 1,28592 and Gen 3 is 9,72896. So the Gen0 + Gen1 + Gen 2 + Large Objects do not add up to #Bytes in all heaps - in fact they are more.

    I am using XP Professional and IIS 5.1 . Any idea on what could be wrong ?

    Thanks.
  • Hi Chakravarthy,

    Are these numbers for average or maximum or last?  

    Also what do the commas in the numbers mean? and how many bytes do you have on the loader heap?

  • Thanks for trying to help.

    These numbers are last, because i just invoked the sample program called memory.aspx. which has buttons to load large blocks of memory and other buttons to release them, none of which i have clicked yet,  and have just selected the system monitor parameters for aspnet_wp , to have the baseline numbers , like the walkthrough said. The commas are for the indian lakhs , which is one hundred thousand. Sorry, i should have presented the commas at  million and thousand positions as below (these are my latest numbers in 'Performance', after loading but not running the memory.aspx) -

    .NET CLR loading :-

    Bytes in loader heap - 1,056,768


    .NET Clr memory :-

    #Bytes in all heaps          - 1,796,856
    #Gen 0 collection             - 6
    #Gen 1 collections           -  2
    #Gen 2 collections           -  1
    % Time in GC                   -  0.021
    Gen 0 Heap size              - 8,061,040
    Gen 1 Heap size              -    642,028
    Gen 2 Heap size              -    173,740
    large Object Heap size    -    981,088

    The walthrough had said the '#Bytes in all heaps' would add upto Gen 0, 1,2 and Large Object heap size. So i am just trying to get that baseline right.

    Just point me in the right direction and i will resume the walkthrough anyway - i just wanted to get the basics.

    Thanks again.

    Chak.

  • I agree, this looks very odd... it is completely omitting the Gen 0 heap size

    #Bytes in all heaps          - 1,796,856
    =
    Gen 1 Heap size              -    642,028
    Gen 2 Heap size              -    173,740
    large Object Heap size    -    981,088

    I had never noticed this before because the Gen 0 size is negligable in most cases (compared to the others) and I had never explicitly added up the numbers but this seems to be true when i tested it on my machine as well...

    Since this seemed weird it prompted me to do some investigation and the problem is not the #Bytes in all heaps, but rather the perception of what the Gen 0 counter tells us. (Btw, i was under the wrong impression about this as well:))

    Reading from http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpgenref/html/gngrfmemoryperformancecounters.asp

    Gen 0 heap size: "Displays the maximum bytes that can be allocated in generation 0; it does not indicate the current number of bytes allocated in generation 0."

    However... the Gen 1 heap size and Gen 2 heap size counters are different... for example Gen 1

    "Displays the current number of bytes in generation 1; this counter does not display the maximum size of generation 1. Objects are not directly allocated in this generation; they are promoted from previous generation 0 garbage collections. This counter is updated at the end of a garbage collection, not at each allocation."



    Having said this... What you see in "#Bytes in all heaps" is very close to current reality, so do trust it...
  • You mentioned doing a crash dump by breaking on Kernel32!ExitProcess.  I need to do this or an ASP.NET worker process.  Can you provide the adplus command line to do this.

    Something like:

    adplus -crash -pn aspnet_wp.exe ...?

    thanks!
  • Nevermind that request..I came across the config file from John Robbins that does the trick...

    <!-- John Robbins - Bugslayer Column, MSDN Magazine   -->
    <!-- Default Crash Options                            -->
    <ADPlus>
       <Settings>
           <!-- Set the mode to CRASH                    -->
           <RunMode>
                     CRASH
           </RunMode>
           <!-- Snap the dumps, don't tell me about it   -->
           <Option>
                     Quiet
           </Option>
       </Settings>
       <Exceptions>
           <!-- Don't dump on first-chance exceptions    -->
           <Option>  
                     NoDumpOnFirstChance  
           </Option>
       </Exceptions>
       <Breakpoints>
           <NewBP>
               <!-- Set the breakpoint on ExitProcess    -->
               <Address>
                        kernel32!ExitProcess
               </Address>
               <!-- A normal breakpoint                  -->
               <Type>
                        BP
               </Type>
               <!-- When hit, walk the stacks and do a   -->
               <!-- mini dump with full heap.                      -->
               <Actions>
                       FullDump;
                       Stacks
               </Actions>
               <!-- After doing the actions, just        -->
               <!-- continue                             -->
               <ReturnAction>
                        G
               </ReturnAction>
           </NewBP>
       </Breakpoints>
    </ADPlus>
  • PingBack from http://www.brianlow.com/index.php/2007/01/19/clr-debugging/

  • If you are experiencing Memory Dumps where your system crashes and show you the blue screen of death, I suggest you go to

    http://www.barrett.net/xpmemorydump.html

    print off the 5 instruction pages and then work through them.  You should then have cured the problem

Page 1 of 2 (20 items) 12
Leave a Comment
  • Please add 1 and 1 and type the answer here:
  • Post