PDC 2008 happened not long ago so I get to write another “what’s new in GC” blog entry. For quite a while now I’ve been working on a new concurrent GC that replaces the existing one. And this new concurrent GC is called “background GC”.
First of all let me apologize for having not written anything for so long. It’s been quite busy working on the new GC and other things.
Let me refresh your memory on concurrent GC. Concurrent GC has existed since CLR V1.0. For a blocking GC, ie, a non concurrent GC we always suspend managed threads, do the GC work then resume managed threads. Concurrent GC, on the other hand, runs concurrently with the managed threads to the following extend:
§ It allows you to allocate while a concurrent GC is in progress.
However you can only allocate so much – for small objects you can allocate at most up to end of the ephemeral segment. Remember if we don’t do an ephemeral GC, the total space occupied by ephemeral generations can be as big as a full segment allows so as soon as you reached the end of the segment you will need to wait for the concurrent GC to finish so managed threads that need to make small object allocations are suspended.
§ It still needs to stop managed threads a couple of times during a concurrent GC.
During a concurrent GC we need to suspend managed threads twice to do some phases of the GC. These phases could possibly take a while to finish.
We only do concurrent GCs for full GCs. A full GC can be either a concurrent GC or a blocking GC. Ephemeral GCs (ie, gen0 or gen1 GCs) are always blocking.
Concurrent GC is only available for workstation GC. In server GC we always do blocking GCs for any GCs.
Concurrent GC is done on a dedicated GC thread. This thread times out if no concurrent GC has happened for a while and gets recreated next time we need to do concurrent GC.
When the program activity (including making allocations and modifying references) is not really high and the heap is not very large concurrent GC works well – the latency caused by the GC is reasonable. But as people start writing larger applications with larger heaps that handle more stressful situations, the latency can be unacceptable.
Background GC is an evolution to concurrent GC. The significance of background GC is we can do ephemeral GCs while a background GC is in progress if needed. As with concurrent GC, background GC is also only applicable to full GCs and ephemeral GCs are always done as blocking GCs, and a background GC is also done on its dediated GC thread. The ephemeral GCs done while a background GC is in progress are called foreground GCs.
So when a background GC is in progress and you’ve allocated enough in gen0, we will trigger a gen0 GC (which may stay as a gen0 GC or get elevated as a gen1 GC depending on GC’s internal tuning). The background GC thread will check at frequent safe points (ie, when we can allow a foreground GC to happen) and see if there’s a request for a foreground GC. If so it will suspend itself and a foreground GC can happen. After this foreground GC is finished, the background GC thread and the user threads can resume their work.
Not only does this allow us to get rid of dead objects in young generations, it also lifts the restriction of having to stay in the ephemeral segment – if we need to expand the heap while a background GC is going on, we can do so in a gen1 GC.
We also made some performance improvement in background GC which does better at doing more things concurrently so the time we need to suspend managed threads is also shorter.
We are not offering background GC for server GC in V4.0. It’s under consideration – we recognize how important it is for server applications (which usually have much larger heaps than client apps) to benefit from smaller latency but the work did not fit in our V4.0 timeframe. For now for server applications, I would recommend you to look at the full GC notification feature we added in .NET 3.5 SP1. It’s explained here: http://msdn.microsoft.com/en-us/library/cc713687.aspx. Basically you register to get notified when a full GC is approaching and when it’s finished. This allows you to do software load balancing between different server instances – when a full GC is about to happen in one of the server instances, you can redirect new requests to other instances.
As 64-bit machines become more common, the problems we need to solve also evolve. In this post I’d like to talk about what it means for the GC and the applications’ memory usage when we move from 32-bit to 64-bit.
One big limitation of 32-bit is the virtual memory address space - as a user mode process you get 2GB, and if you use large address aware you get 3GB. A few years these seemed like giant numbers but I've seen as more and more people start using .NET framework, the sizes of the managed heap go up at a quite high rate. I remember when I first started working on GC (which was late 2004 I think) we were talking about hundreds of MBs of heaps - 300MB seemed like a lot. Today I am seeing managed heaps easily of GBs in size - and yes, some of them (and more and more of them) are on 64-bit - 2 or 3GB is just not enough anymore.
And along with this, we are shifting to solving a different set of problems. In CLR 2.0 we concentrated heavily on using the VM space efficiently. We tried very hard to reduce the fragmentation on the managed heap so when you get a hold of a chunk of virtual memory you can make very efficient use of it. So people don't see problems like they have N managed heap segments, are running out of VM, yet many of these segments are quite empty (meaning having a lot of free space on them).
Then you switch to 64-bit. Now suddenly you don't need to worry about VM anymore - you get plenty there. Practically unlimited for many applications (of course it’s still limited – for example if you are running out physical memory to even allocate the datastructures for virtual pages then you still can’t reserve those pages). What kind of differences will you see in your managed memory usage?
First of all, your process consumes more memory - I am sure all of you are already aware of this - the pointer size is bigger - it's doubled on 64-bit so if you don't change anything at all, now your managed heap (which undoubtly contains references) is bigger. Of course being able to manipulate memory in QWORDs instead of DWORDs can also be beneficial –our measurements show that the raw allocation speed is slightly higher on 64-bit than on 32-bit that can be attributed to this.
There are other factors that could make your process consume more memory - for example the module size is bigger (mscorwks.dll is about 5MB on x86, 10MB on x64 and 20MB on ia64), instructions are bigger on 64-bit and what have you.
Another thing you may notice - if you have looked at the performance counters under .NET CLR Memory - is that you are now doing a lot fewer GCs on 64-bit than what you used to see on 32-bit.
The curious minds might have already noticed one thing - the managed heap segments are much bigger in size on 64-bit. If you do !SOS.eeheap -gc you will now see way bigger segments.
Why did we make the segment size so much bigger on 64-bit? Well, remember we talked about in Using GC Efficiently Part 2 how we have a budget for gen0 and when you've allocated more than this budget a GC will be triggered. When you have a bigger budget it means you’ll need to do fewer GCs which means your code will get more chance to run. From this perspective you should get a performance gain when you move to 64-bit - I want to emphasize the “this perspective” part because in general things tend to run slower on 64-bit. The perf benefit you get because of GC may very well be obscured by other perf degrades. In reality many people are not expecting perf gain when they move to 64-bit but rather they are happy with being able to use more memory to handle more work load.
Of course we also don’t want to wait for too long before we collect – we strive for the right balance between memory (how much memory your app consumes) and CPU (how often user threads run).
I was making some code changes today and thought this was interesting to share. As you know, the WeakReference class has a getter and a setter method to get and set the Target which is what the weakref points to. See Using GC Efficiently – Part 3 for more details on WeakReference.
Note that the code below is only for illustration purposes – it is not necessarily what’s in the production code.
So let’s say the code used to look like this in the WeakReference class:
internal IntPtr m_handle;
public Object Target
{
get
{
IntPtr h = m_handle;
if (IntPtr.Zero == h)
return false;
Object o = GCHandle.InternalGet(h);
h = Thread.VolatileRead(ref m_handle);
GC.KeepAlive (this);
return (h == IntPtr.Zero) ? null : o;
}
...
}
m_handle is the weak GCHandle that we create to implemente the weakref funtionality. It’s a weak handle that points to the object that you want your weakref object to point to.
The problem is Thread.VolatileRead kind of a heavy weight thing - not very performant (there was a reason why we used this API in the first place…not necessarily a good one but it’s what we ended up with).
First of all let’s take a look at the old code. Notice that we have a GC.KeepAlive in there. Why would we do that? I mean if during the call of Thread.VolatileRead, the weakref object is dead and its finalizer sets m_handle to 0, we’d just read 0. That’s fine right?
But imagine this code:
Object o = new Object();
while (true)
{
WeakReference wr = new WeakReference (o);
if (wr.Target == null)
{
...
}
}
o.GetHashCode();
We know that o is live during the while loop which means it would be really nice if wr.Target is null (some would consider it a bug if wr.Target was ever null in this case). If we didn’t have the KeepAlive, object wr could be considered dead as soon as its address is passed in to the Thread.VolatileRead call as the argument. So then h could be 0 and we’d return null. So we want to make sure that if the object the weakref points to is guaranteed to be live during the getter call, you will always get back that object instead of null.
Wouldn’t that achieve the same effect since m_handle is an instance data member and if it’s used, the *this* object should be kept alive where the last statement is?
Well, actually since m_handle is not a volatile, jit could generate some code that stores the value of m_handle in a register so it will not need to read the value of m_handle again when it’s at that last statement.
Another interesting thing about this code is that it does the IntPtr.Zero check after it first read the value of m_handle. But how can m_handle be 0? m_handle is only set to 0 in the WeakReference class’s finalizer code. If we are already KeepAlive-ing the weakref object, it means the object should be live therefore by definition the finalizer should have not been run, right?
Well, unfortunately there is a case where m_handle can be 0 while we are in the getter which is when the getter is called in an object’s finalizer. Imagine you have this object hireachy:
public class ObjectA
{
WeakReference wr;
public ObjectA(WeakReference wr0)
{
...
wr = wr0;
}
...
public ~ObjectA()
{
...
if (wr.Target == null)
{
...
}
...
}
}
ObjectA a = new ObjectA();
When a is not used anymore, at some point both a’s and wr0’s finalizer (assuming a is the only object that contains a reference to wr0) will be put on the finalize queue.
Now if a’s finalizer gets to run first, we are fine ‘cause when we are in wr0’s getter, m_handle is still valid. But if wr0’s finalizer gets to run first, then when a’s finalizer is run, m_handle is already set to 0. Of course as we’ve been saying that a finalizer should do no more than releasing native resources, this shouldn’t be a common scenario.
So, the idea is to change m_handle to volatile and eliminate the need for calling Thread.VolatileRead. The resulting code looks like this:
internal volatile IntPtr m_handle;
public Object Target
{
get
{
IntPtr h = m_handle;
if (IntPtr.Zero == h)
return null;
Object o = GCHandle.InternalGet(h);
return (m_handle == IntPtr.Zero) ? null : o;
}
...
}
Notice that KeepAlive is gone because we are reading a volatile value which means the weakref object will be kept alive though out the getter. We still need to check if h is IntPtr.Zero at the beginning because we are still subject to be called from another object’s finalizer.
Some people don’t use volatile’s in fear of losing performance ‘cause volatile’s can’t be optimzed by the compiler. In reality though, if you read the volatile into a local when you need to access it frequently and that you are fine with the cached value, no reason to be afraid of using volatile’s.
.NET CLR Memory\% Time in GC counter and !runaway on thread(s) doing GC.
The 2 common ways people use to look at the time spent in GC are the % Time in GC performance counter under .NET CLR Memory, and the CPU time displayed by the !runaway debugger command in cdb/windbg. What do they mean exactly? % Time in GC is calculated like this:
When the nthGC starts (ie, after the managed threads are suspended and before the GC work starts), we record the timestamp at that time. Let’s call this TA(n);
When nth GC ends (ie, after the GC work is done and before we resume the managed threads), we record another timestamp. Let’s call this TB(n);
So Time spent in this GC is TB(n) – TA(n). And the time since the last GC ended is TB(n) – TB(n-1). So % Time in GC is (TB(n) – TA(n)) / (TB(n) – TB(n-1)).
Since we only record the timestamps we don’t actually discount the time when the thread was switched out – for example, if you are on a single proc machine and another process has a thread of the same priority (as the thread that’s doing GC) that’s also ready to run it may take away some time from the thread that’s doing the GC. None the less it’s a good approximation.
One common scenario where it’s not a good approximation is when paging occurs. In this case you will see a very high % Time in GC but really the time that’s actually spent doing GC work is low ‘cause most of the time is spent doing IO. To verify if you hit this case you can look at the Memory\Pages/sec to see how much it’s paging.
!runaway is more accurate in the sense that it does record the actual time spent on the threads. However I did observe 2 common mistakes when using !runaway to look at time spent in GC.
1) mistake “time spent on the GC thread(s)” as “time spent in GC”.
Let’s take Server GC as an example. When a GC is needed the GC threads first perform the suspension work. Obviously this takes time. Sometimes it can take quite a bit time if you have many threads or some threads get stuck and it takes a long time to suspend them.
2) using the current !runaway output to judge how much time GC has taken without taking into count that some user threads have died. Besides looking at the User/Kernel Mode time you may want to also look at Elapsed Time. The following output is from one of the issues I looked at:
Elapsed Time
Thread Time
0:2094 0 days 11:57:10.406
1:2098 0 days 11:57:10.152
2:2044 0 days 11:57:09.898
3:27cc 0 days 11:57:09.882
4:20c0 0 days 11:57:09.597
13:810 0 days 11:57:09.565
12:21d4 0 days 11:57:09.565
11:2308 0 days 11:57:09.565
10:1d24 0 days 11:57:09.565
9:23e0 0 days 11:57:09.565
8:24d0 0 days 11:57:09.565
7:2168 0 days 11:57:09.565
6:2134 0 days 11:57:09.565
5:2124 0 days 11:57:09.565
14:570 0 days 11:57:08.582
15:10fc 0 days 11:57:00.000
16:1ac4 0 days 11:56:58.256
17:1900 0 days 11:56:58.176
18:1624 0 days 11:56:57.494
19:11f8 0 days 10:35:01.881
21:1620 0 days 0:41:45.017
22:1cc8 0 days 0:37:30.496
23:2de0 0 days 0:29:01.820
24:2e10 0 days 0:29:01.711
25:2d88 0 days 0:22:26.988
26:1668 0 days 0:19:26.175
27:2cb0 0 days 0:16:16.814
29:2d1c 0 days 0:11:53.779
28:1de8 0 days 0:11:53.779
30:2ed4 0 days 0:11:35.466
32:2030 0 days 0:11:22.544
31:2084 0 days 0:11:22.544
33:12b0 0 days 0:07:53.510
35:28e4 0 days 0:03:13.801
34:2654 0 days 0:03:13.801
36:2d68 0 days 0:02:31.988
The orange lines are GC threads. They were created about 12 hours ago. The green lines are user threads and they were created 2 mins to 40mins ago. So naturally at the current state these user threads couldn’t’ve spent much time.
Both the !SOS.gchandles command (added in CLR 2.0) and the .NET CLR Memory\# GC Handles counter show you the number of GC handles you have in your process.
The # GC Handles counter is one of the rare counters in the .NET CLR Memory category that doesn’t get updated at the end of each GC. Rather we update it in the handle table code, for example, when some code in the CLR calls the function to create a GC handle (possibly because the user has requested to create a handle via managed code), we increase this counter value by one. For performance reasons we don’t use interlocked operations when we need to increase or decrease this value. This means the value can get changed concurrently by multiple threads. For this reason you should always trust the value returned by the !SOS.gchandles command if you ever doubt the counter value. The SOS command is accurate because we always walk the handle table when you issue the command so it returns the # of handles truthfully.
Managed Heap Size
We have both .NET CLR Memory perf counters and SoS extensions that report manged heap size related data.
Difference 2
There are a few .NET CLR Memory counters that are related to the managed heap size:
# Total Committed Bytes
# Total Reserved Bytes
I explained what these counters mean here.
Now, how are they related to the values you see when you do a !SOS.eeheap –gc?
0:003> !eeheap -gc
Number of GC Heaps: 1
generation 0 starts at 0x01245078
generation 1 starts at 0x0124100c
generation 2 starts at 0x01241000
ephemeral segment allocation context: (0x0125a900, 0x0125b39c)
segment begin allocated size
001908c0 793fe120 7941d8a8 0x0001f788(128904)
01240000 01241000 0125b39c 0x0001a39c(107420)
Large object heap starts at 0x02241000
segment begin allocated size
02240000 02241000 02243250 0x00002250(8784)
Total Size 0x3bd74(245108)
------------------------------
GC Heap Size 0x3bd74(245108)
The allocated column indicates the end of the last live object on the segment. So for gen0 and LOH it changes as the managed threads allocate, unlike the .NET CLR Memory counters which only reflect the end of the last live object on the segment when the last GC’s happened (at the end of last GC).
The # Bytes in All Heaps counter under .NET CLR Memory counter is kind of misleading. The explanation says it’s the bytes in “all heaps” but really it’s gen1+gen2+LOH in CLR 2.0, so it doesn’t include gen0 ‘cause most of the time gen0 size is 0 right after a GC. So if you break into your process under the debugger and use the !sos.eeheap –gc command you will most likely get a value that’s the same as this counter. But if you break between 2 GCs the value you get from !eeheap –gc will always be larger than the value of this counter.
If you are concerned with the true amount of memory that the managed heap commits (which is usually what you need to worry about) you should use the # Total Committed Bytes counter.
So, there are many perf tools and some of them report either the same or the same type of data. I want to talk about various differences between the ones related to managed heap investigation. This is not supposed to cover everything..just the ones I think people use frequently.
Managed Heap Size
We have both .NET CLR Memory perf counters and SoS extensions that report manged heap size related data.
Difference 1
The values reported by the .NET CLR Memory perf counters are affected by
1) how often they are collected - meaning however often you tell the tool you use (most people use perfmon, internally I know many groups' test teams collect them as part of the automation). The most frequent interval perfmon gives you is once per second.
2) how often they are updated. Most .NET CLR Memory perf counters are updated only at the end of each GC.
So for example, assuming you are using perfmon to collect at one sample per second and if more than one GC happened in the past second, you'll be missing the intermediate values.
And since these perf counters are updated only every so often, before a value is refreshed it stays the same value as it was updated last time. For example, if you are looking at % Time in GC and say the last value was 80% and if no GC happens for 10 seconds, this counter will stay at 80% but really during the past 10 seconds no GCs were happening.
On the other hand, SOS data is only gotten when you request, which means you get the value at the time you did the SOS command (unless of course the command specifically tells you it reflects a value updated at some specific point in the past).
These namespaces were introduced in CLR 2.0. For example for the
GCHeap::GcCondemnedGeneration
symbol, it's WKS::GCHeap::GcCondemnedGeneration for Workstation GC and SVR::GCHeap::GcCondemnedGeneration for Server GC (if you are reading the Investigating Memory Issues article in the recent MSDN magazine and are trying out some of the debugger commands mentioned in there).
If you are using CLR 1.1 or prior, the Workstation version lives in mscorwks.dll while the Server version lives in mscorsvr.dll so the symbol names are not prefixed with WKS:: or SVR::. So the breakpoint
bp mscorwks!WKS::GCHeap::RestartEE "j (dwo(mscorwks!WKS::GCHeap::GcCondemnedGeneration)==2) 'kb';'g'"
should be
bp mscorwks!GCHeap::RestartEE "j (dwo(mscorwks! GCHeap::GcCondemnedGeneration)==2) 'kb';'g'"
Many people know Patrick Dussud by his outstanding work on Garbage Collection. But did you know he was one of the founders of the CLR? In his intro blog entry he talks about how the CLR came to life. I am sure it will be a great read for those of you who are curious about it.
Last time I talked about the hang scenario where your process is taking 0 CPU and the CPU is taking by other process(es) on the same machine.
The next scenario is your process is taking 0 CPU and the CPU is barely used by other processes.
As one of the readers correctly pointed out, this is very likely because you have a deadlock. Usually debugging deadlocks is relatively straightforward – you look at what the threads are waiting on and figure out which other threads are holding the lock(s). And there are plenty of online resources that talk about debugging deadlocks. If you use the Windows Debugger package there are built in debugger extension dlls that help you with this like !locks and etc. If you are debugging a managed app the SoS debugger extension has commands that will aid you - !SyncBlk shows you managed locks (for CLR 2.0 there’s also !Dumpheap –thinlock for objects locked with ThinLocks instead of SyncBlk’s).
Another possibility is your process is not doing any CPU related activities. A common activity is IO – for example if the process is heavily paging you will see almost 0 CPU usage but it appears hang because the memory it needs is getting loaded from the disk which is really slow. A very useful tool that shows you what processes are doing is Process Monitor. Yesterday a program on my machine paused periodically – very annoying. So I used process monitor which showed me that this program periodically checks if I am logged onto my account in another program and since I am not, it would log me on, does a little bit stuff then log me off. And the hang was due to waiting on network IO. So to make it happy I logged myself on then the annoying periodic hang disappeared.
Now if your process is indeed taking CPU it can also appear to hang – as I mentioned last time this means different things for different people. If you have a UI app this can mean the UI is not getting drawn; if you have a server app this can mean your app is not processing requests. So you’ll have to define what hang means to you. I will use server app not processing requests as an example. Usually server applications run on dedicated machines. So let’s assume that’s the case here – you run a server on a machine and the server could consist of multiple processes. You measure the server performance by throughput. One scenario is the CPU usage is high (perhaps even higher than usual) but the throughput is lower than usual.
The easiest case, as one of the readers pointed out, is an infinite loop – very easy to debug. You break into the debugger a few times and see a thread is taking all the CPU and that thread can not exit some function – so there goes your infinite loop. And if your process is pretty much the only process that uses CPU at the time this is super obvious. It gets a bit more complicated if you have multiple CPUs and other processes are also using CPUs. But still since it’s an infinite loop the nice thing is it will always be executing if you don’t interfere so it’s always available to you to investigate as soon as it happens.
It becomes hard if the hang only reproes sporadically and when it reproes it only lasts for a little while. Time to whip out a CPU profiler. As another reader pointed out, Process Explorer is a useful tool to get you started. It shows you which processes are “active” – meaning it’s using CPUs. Personally I start with collecting appropriate performance counters, partially because pretty much all test teams in the product groups at Microsoft have some sort of automated testing procedure that collects perf counters so requesting them is easy. And because of the low overhead you can collect them for a long period of time so you have a histogram.
These are the counters I usually request (comments in []’s):
Processor\% Processor Time for _Total and all processors
[This is so I have an idea what kind of CPU usage I am looking at and if there are paticular processors that get used more than the rest]
Process\% Processor Time for all processes or less the ones that you already know can not be the problem
Thread\% Processor Time for all processes or less the ones that you already know can not be the problem
[The above counters will tell you which threads are using the CPU so you know which threads to look at]
[Since I usually look at GC related issues I request all counters under .NET CLR Memory]
.NET CLR Memory counters for all managed processes or less the ones that you already know can not be the problem
[If you are looking at other things you should add appropriate counters – for example, ASP.NET counters for apps that use ASP.NET]
[If you know the kind of activities your processes do you can add appropriate counters for them. For me I often request memory related counters like:]
Memory\% Committed Bytes In Use
Memory\Available Bytes
Memory\Pages/sec
Process\Private Bytes for processes I am interested in
…
At this point I can look at the results and concentrate on the interesting parts – for example when the CPU is usually high. I will have an idea which threads in what processes are consuming the CPU and the aspects of them that interest me (usually GC and other memory activities). Then I can request more detailed data on those processes/threads. For example I can ask the person to use a sampling profiler so I can see what functions are executing in the part I am interested in (along with other info – this depends on what the profiler you are using is capable of).
Some people prefer to take memory dumps when the process hangs, sometimes this doesn’t necessarily work (when it works it’s great) because if the hang is related to timing/how threads are scheduled the threads can easily behave differently when you interrupt it in order to take memory dumps so the hang may not repro anymore. If you do have consecutive dumps from one hang then you can use the !runaway command to see which threads have been consuming CPU. One dump is hardly useful for debugging hangs because it only gives you info at one point in time how the process behaves.
!address is a very powerful debugger command. It shows you exactly what your VM space looks like. If you already got the output from the !sos.eeheap -gc command (refer to this article for usage on sos), for example:
0:003> !eeheap -gc
Number of GC Heaps: 1
generation 0 starts at 0x01245078
generation 1 starts at 0x0124100c
generation 2 starts at 0x01241000
ephemeral segment allocation context: (0x0125a900, 0x0125b39c)
segment begin allocated size
001908c0 793fe120 7941d8a8 0x0001f788(128904)
01240000 01241000 0125b39c 0x0001a39c(107420)
Large object heap starts at 0x02241000
segment begin allocated size
02240000 02241000 02243250 0x00002250(8784)
Total Size 0x3bd74(245108)
------------------------------
GC Heap Size 0x3bd74(245108)
you can correlate the segments with the output of !address to get a better view of them. For this specific case here's the excerpt of the output from the !address command:
0:003> !address
[omitted]
01232000 : 01232000 - 0000e000
Type 00000000
Protect 00000001 PAGE_NOACCESS
State 00010000 MEM_FREE
Usage RegionUsageFree
01240000 : 01240000 - 00052000
Type 00020000 MEM_PRIVATE
Protect 00000004 PAGE_READWRITE
State 00001000 MEM_COMMIT
Usage RegionUsageIsVAD
01292000 - 00fae000
Type 00020000 MEM_PRIVATE
Protect 00000000
State 00002000 MEM_RESERVE
Usage RegionUsageIsVAD
02240000 - 00012000
Type 00020000 MEM_PRIVATE
Protect 00000004 PAGE_READWRITE
State 00001000 MEM_COMMIT
Usage RegionUsageIsVAD
02252000 - 00fee000
Type 00020000 MEM_PRIVATE
Protect 00000000
State 00002000 MEM_RESERVE
Usage RegionUsageIsVAD
03240000 : 03240000 - 73050000
Type 00000000
Protect 00000001 PAGE_NOACCESS
State 00010000 MEM_FREE
Usage RegionUsageFree
76290000 : 76290000 - 00001000
Type 01000000 MEM_IMAGE
Protect 00000002 PAGE_READONLY
State 00001000 MEM_COMMIT
Usage RegionUsageImage
FullPath C:\WINDOWS\system32\IMM32.DLL
76291000 - 00015000
Type 01000000 MEM_IMAGE
Protect 00000020 PAGE_EXECUTE_READ
State 00001000 MEM_COMMIT
Usage RegionUsageImage
FullPath C:\WINDOWS\system32\IMM32.DLL
[omitted]
-------------------- Usage SUMMARY --------------------------
[omitted]
-------------------- State SUMMARY --------------------------
TotSize Pct(Tots) Usage
0275c000 : 1.92% : MEM_COMMIT
7b20a000 : 96.20% : MEM_FREE
0268a000 : 1.88% : MEM_RESERVE
Largest free region: Base 03240000 - Size 73050000
This says that the 2 segments (starting from 01240000 and 02240000) are adjacent to each other - part of them are committed, the rest is still reserved memory. Before and after the 2 segments we got some free space there. As I mentioned below it’s very unlikely that the managed heap is fragmenting the VM because we are good about requesting large chunks at a time and usually the OS is not bad at giving us addresses that are pretty contiguous if possible. One of the very few cases where you would see managed heap fragmenting VM is if you have temporary large object segments and GC needs to frequently acquire and release VM chunks. Those chunks could be scattered in the VM space especially considering there are other things that consume VM as well at the same time.
We have a new MSDN article out in the November issue that talks about investigating managed memory issues.
http://msdn.microsoft.com/msdnmag/issues/06/11/CLRInsideOut/default.aspx?loc=en
Take a look and let me know what you think.
Oh, and it's also in 6 other languages (German, Spanish, French, Russian, Portuguese and Chinese) for readers that prefer one of those languages.
Defining “hang” is a good place to start.
When people say “hang” they could mean all sorts of things. When I say “hang” I mean the process is not making progress – the threads in the process are either blocked (eg. deadlocked, or not scheduled because of threads from other processes) or executing code (madly) but not doing useful work (eg. infinite loop, or busy spinning for a long time without doing useful work). The former uses no CPU while the later using 100% CPU. When a UI developer says “hang” he could mean “the UI is not getting drawn” so essentially they mean the UI threads are not working – other threads in their process could be doing lots of work but since the UI is not getting updated it appears “hang”. So clarifying what you mean when you say “hang”, which requires you to look at your process and its threads, is the first step.
If you start Task Manager (taskmgr.exe) it shows you how much CPU each process is using currently. If you don’t see a CPU column you can add it by clicking View\Select Columns and check the “CPU Usage” checkbox.
Note that if you have multiple CPUs, the CPU usage is at most 100. Let’s say you have 4 CPUs and your process has one thread that’s running and taking all the CPU it can you will see the CPU column for your process 25 – since your process can only use one CPU (at most to its full) at any given time.
The CPU usage for a process is calculated as the CPU usage used by all the threads that belong to the process. Threads are what get to run on the CPUs. They get scheduled by the OS scheduler which decides when to run what thread on which processor. I won’t cover the details here – the Windows Internals book by Russinovich and Solomon covers it.
If you see your process is taking 0 CPU, that would explain why it’s hung (for the period of time when the CPU keeps being 0) – no threads are getting to run in your process! The next thing to look at is the CPU usage of other processes. If you see one or multiple other processes that take up all the CPU that means the threads in your process simply don’t get a chance to run – this is because the threads in those other processes are of higher priorities (or temporarily of higher priorities due to priority boosting) than the threads in your process. The possible causes are:
1) there are threads that are marked as low priority which acquired locks that other threads in your process need in order to run. And the low priority threads are preempted by other (normal or high) prority threads from those other processes. This happens when people mistakenly use low priority threads to do “unimportant work” or “work that doesn’t need to be done in a timely fashion” without realizing that it’s nearly impossible to avoid taking locks on those threads. I’ve heard of many people say “but I am not taking a lock on my low priority threads” which is not a valid argument because the APIs you call or the OS services you use can take locks in order to run your code – allocating on native NT heap can take locks; even triggering a page fault can take locks (which is not something an application developer can control in his code).
2) the threads in your process are of normal priority but those other processes have high priority threads – this should be relatively easy to diagnose (and unless some process is simply bad citizens this rarely happens) – you can take a look at what those processes are doing (again looking at their threads’ callstacks is a good place to start).
That’s all for today. Next time I will talk about other hang scenarios and techniques to debug them.
So far I’ve never written a blog entry that gives out philosophical advices on doing performance work. But lately I thought perhaps it’s time to write such an entry because I’ve seen enough people who looked really hard at some performance counters (often not correct ones) or some other data and asked tons of questions such as “is this allocation rate too high? It looks too high to me.” or “my gen1 size is too big, right? It seems big…”, before they have enough evidence to even justify such an investigation and questions.
Now, if you are just asking questions to satisfy your curious mind, that’s great. I am happy to answer questions or point you at documents to read. But for people who are required to investigate performance related issues, especially when the deadline is close, my advice is “understand the problem before you try to find a solution”. Determine what to look at based on evidence, not based on your lack of knowledge in the area unless you already exhausted areas that you do know about. Before you ask questions related to GC, ask yourself if you think GC is actually the problem. If you can’t answer that question it really is not a good use of your time to ask questions related to GC.
I’ve seen many cases when something seems to go wrong in a managed application, people immediately suspect GC without any evidence to support that suspicion. Then they start to ask questions – often very random – hoping that they would somehow find the solution to fix their problem without understand what the problem is. That’s not logical is it? So stop doing it!
So how do you know the right problems to solve, I would recommend the following:
1) Having knowledge about fundamentals really helps.
What are fundamentals? Performance in general comes down to 2 things – memory and CPU. Knowing basics of these 2 areas helps a lot in determining which area to look at. Obviously this involves a lot of reading and experimenting. I will list some memory fundamentals to get you started:
Some fundamentals of memory
· Each process has its own and separated virtual address space; all processes on the same machine share the physical memory (plus the page file if you have one). On 32-bit each process has a 2GB user mode virtual address space by default.
· You, as an application author, work with virtual address space – you don’t ever manipulate physical memory directly. If you are writing native code usually you use the virtual address space via some kind of win32 heap APIs (crt heap or the process heap or the heap you create) – these heap APIs will allocate and free virtual memory on your behalf; if you are writing managed code, GC is the one who allocates/frees virtual memory on your behalf.
· Virtual address space can get fragmented – in other words, there can be “holes” (free blocks) in the address space. When a VM allocation is requested, the VM manager needs to find one free block that’s big enough to satisfy that allocation request – if you only got a few free blocks whose sum is big enough it won’t work. This means even though you got 2GB, you don’t necessarily see all 2GB used.
· VM can be in different states – free, reserved and committed. Free is easy. The difference between reserved and committed is what confused people sometimes. First of all, you need to recognized that they are different states. Reserved is saying “I want to make this region of memory available for my own use”. After you reserve a block of VM that block can not be used to satisfy other reserve requests. At this point you can not store any of your data in that block of memory yet – to be able to do that you have to commit it, which means you have to back it up with some physical storage so you can store stuff in it. When you are looking at the memory via perf counters and what not, make sure you are looking at the right things. You can get out of memory if you are running out space to reserve, or space to commit.
· If you have a page file (by default you do) you can be using it even if your physical memory pressure is very low. What happens is the first time your physical memory pressure gets high and the OS needs to make room in the physical memory to store other data, it will back up some data that’s currently in physical memory in the page file. And that data will not be paged in until it’s needed so you can get into situations where the physical memory load is very low yet you are observing paging.
2) Knowing what your performance requirements are is a must.
If you are writing a server application, it’s very likely that you want to use all the memory and CPU that’s available because people delicate the machine completely to run your app so why waste resources? If you are writing a client application, totally different story – you’ll have to know how to cope with other applications running on the same machine. There’re no rules such as “you have to make your app use as little memory as possible”.
When you’ve decided that there is a problem, dig into it instead of guess what might be wrong. If your app is using too much memory, look at who is using the memory. If you’ve decided that the managed heap is using too much memory, look at why. Managed heap using too much memory generally means you survive too much in your app. Look at what is holding on to those survivors.