@TessFerrandez
In December I blogged about a little tool that i wrote to analyze hangs in dumps, and i showed the following output, but didnt really get into the details of why the process was stuck in here...
____________________________________________________________________________________________________________________ GC RELATED INFORMATION ____________________________________________________________________________________________________________________ The following threads are GC threads: 18 19 The following threads are waiting for the GC to finish: 14 16 24 26 27 28 30 31 36 37 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 57 58 59 60 62 63 64 65 66 67 68 69 70 71 72 73 74 77 78 The GC was triggered by thread: 75 The GC is working on suspending threads to continue with garbage collection The following threads can't be suspended because preemptive GC is disabled: 23 25 33 34 35 38 56 61 The Finalizer (Thread 20) is not blocked
The issue the customer is running into is a hang during heavy load. The only way to get out of the hang is to recycle the process (IISReset).
Debugging the issue:
I have seen this issue before on a few occations and although as you will see later it was currently fixed in the framework, but in my earlier cases on this we ended up not needing a fix since the customers I worked with made code changes that made it such that they no longer were subject to the issue.
So what is going on here?
Thread 75 triggered a garbage collection by making an allocation that would have made Gen 0 go over its allocation budget.
0:075> kb 2000 ChildEBP RetAddr Args to Child 1124de6c 7c822124 77e6bad8 000002e8 00000000 ntdll!KiFastSystemCallRet 1124de70 77e6bad8 000002e8 00000000 00000000 ntdll!NtWaitForSingleObject+0xc 1124dee0 79e718fd 000002e8 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xac 1124df24 79e718c6 000002e8 ffffffff 00000000 mscorwks!PEImage::LoadImage+0x199 1124df74 79e7187c ffffffff 00000000 00000000 mscorwks!CLREvent::WaitEx+0x117 1124df84 7a0d0d0f ffffffff 00000000 00000000 mscorwks!CLREvent::Wait+0x17 1124dfa8 7a0d5289 ffffffff 000d4558 106cb970 mscorwks!SVR::gc_heap::wait_for_gc_done+0x99 1124dfcc 7a0d5fa2 00000000 00000000 00000020 mscorwks!SVR::GCHeap::GarbageCollectGeneration+0x267 1124e058 7a0d691f 106cb970 00000020 00000000 mscorwks!SVR::gc_heap::try_allocate_more_space+0x1c0 1124e078 7a0d7ecc 106cb970 00000020 00000000 mscorwks!SVR::gc_heap::allocate_more_space+0x2f 1124e098 7a08bd32 106cb970 00000020 00000002 mscorwks!SVR::GCHeap::Alloc+0x74 1124e0b4 79e7b43e 00000020 00000000 00080000 mscorwks!Alloc+0x60 1124e180 79e8f41c 79157f42 1124e230 00000001 mscorwks!AllocateArrayEx+0x1d1 1124e244 7937f5c2 064f60b8 064f60b8 064f60b8 mscorwks!JIT_NewArr1+0x167 1124e27c 5088a509 00000000 00000000 00000000 mscorlib_ni!System.Reflection.RuntimeMethodInfo.GetParameters()+0x4a ...
Nothing strange there, allocation and garbage collection happens all the time... however a GC is usually extremely fast and in this case we have 45 threads waiting for the GC to finish...
0:014> kb ChildEBP RetAddr Args to Child 01a0fc74 7c822124 77e6bad8 000002e4 00000000 ntdll!KiFastSystemCallRet 01a0fc78 77e6bad8 000002e4 00000000 00000000 ntdll!NtWaitForSingleObject+0xc 01a0fce8 79e718fd 000002e4 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xac 01a0fd2c 79e718c6 000002e4 ffffffff 00000000 mscorwks!PEImage::LoadImage+0x199 01a0fd7c 79e7187c ffffffff 00000000 00000000 mscorwks!CLREvent::WaitEx+0x117 01a0fd8c 7a0851cb ffffffff 00000000 00000000 mscorwks!CLREvent::Wait+0x17 01a0fd9c 79f40e96 00000000 13d43bb8 0e874858 mscorwks!SVR::GCHeap::WaitUntilGCComplete+0x32 01a0fdd8 79e7385b 00000001 7a0e607b 00000001 mscorwks!Thread::RareDisablePreemptiveGC+0x1a1 01a0fde0 7a0e607b 00000001 13d43838 00000102 mscorwks!GCHolder<1,0,0>::GCHolder<1,0,0>+0x2d 01a0fe2c 7a0e673e 00000000 7a393704 7a114dea mscorwks!Thread::OnThreadTerminate+0x53 01a0fe38 7a114dea 0e874858 13d4385c 00000000 mscorwks!DestroyThread+0x43 01a0fe94 79f79c4f 00000000 00000000 00000000 mscorwks!ThreadpoolMgr::CompletionPortThreadStart+0x33d 01a0ffb8 77e6608b 000c4c00 00000000 00000000 mscorwks!ThreadpoolMgr::intermediateThreadProc+0x49 01a0ffec 00000000 79f79c09 000c4c00 00000000 kernel32!BaseThreadStart+0x34
so for some reason the GC appears to be taking some time...
There are 2 GC threads (one per logical processor), thread 18 and 19 in this case... Thread 19 is simply waiting for work
0:019> kb ChildEBP RetAddr Args to Child 01e2fd68 7c822124 77e6bad8 000002d0 00000000 ntdll!KiFastSystemCallRet 01e2fd6c 77e6bad8 000002d0 00000000 00000000 ntdll!NtWaitForSingleObject+0xc 01e2fddc 79e718fd 000002d0 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xac 01e2fe20 79e718c6 000002d0 ffffffff 00000000 mscorwks!PEImage::LoadImage+0x199 01e2fe70 79e7187c ffffffff 00000000 00000000 mscorwks!CLREvent::WaitEx+0x117 01e2fe80 7a0d8898 ffffffff 00000000 00000000 mscorwks!CLREvent::Wait+0x17 01e2fea8 7a0d8987 01e2ff00 77e60eb5 01e2fec8 mscorwks!SVR::gc_heap::gc_thread_function+0x58 01e2ffb8 77e6608b 000d5050 00000000 00000000 mscorwks!SVR::gc_heap::gc_thread_stub+0x9b 01e2ffec 00000000 7a0d88eb 000d5050 00000000 kernel32!BaseThreadStart+0x34
But interestingly enough thread 18 is waiting to suspend all managed threads in order to continue the GC (to avoid for anyone to allocate any more data while it is performing the GC)
0:018> kb ChildEBP RetAddr Args to Child 01defb10 7c821524 77e98ef4 00000f64 01defb34 ntdll!KiFastSystemCallRet 01defb14 77e98ef4 00000f64 01defb34 01defe04 ntdll!NtGetContextThread+0xc 01defb24 7a0de046 00000f64 01defb34 00010002 kernel32!GetThreadContext+0x11 01defe04 7a0defc1 00000f64 1069d328 13aa3858 mscorwks!EnsureThreadIsSuspended+0x3f 01defe4c 7a0e290a 00000000 00000000 13aa3874 mscorwks!Thread::SuspendThread+0xd0 01defe9c 7a086e76 00000000 13aa399c 00000000 mscorwks!Thread::SysSuspendForGC+0x5a6 01deff88 7a0d867b 00000001 00000000 000d4368 mscorwks!SVR::GCHeap::SuspendEE+0x16c 01deffa8 7a0d8987 00000000 13aa39ac 01deffec mscorwks!SVR::gc_heap::gc_thread_function+0x3b 01deffb8 77e6608b 000d4368 00000000 00000000 mscorwks!SVR::gc_heap::gc_thread_stub+0x9b 01deffec 00000000 7a0d88eb 000d4368 00000000 kernel32!BaseThreadStart+0x34
Normally suspending all managed threads happens in a matter of nanoseconds, and pretty much the only thing that could cause the process to be stuck while suspending is if some thread has disabled preemptive GC, i.e. told the GC that it is in a state where it can't be disturbed...
Yun Jin describes PreemptiveGC like this in one of his posts
Preemptive GC: also very important. In Rotor, this is m_fPreemptiveGCDisabled field of C++ Thread class. It indicates what GC mode the thread is in: "enabled" in the table means the thread is in preemptive mode where GC could preempt this thread at any time; "disabled" means the thread is in cooperative mode where GC has to wait the thread to give up its current work (the work is related to GC objects so it can't allow GC to move the objects around). When the thread is executing managed code (the current IP is in managed code), it is always in cooperative mode; when the thread is in Execution Engine (unmanaged code), EE code could choose to stay in either mode and could switch mode at any time; when a thread are outside of CLR (e.g, calling into native code using interop), it is always in preemptive mode.
In our case, as we can see from the output from my tool the following threads have preemptive GC disabled (which is very uncommon)
23 25 33 34 35 38 56 61
In the !threads output it looks like this (notice the PreEmptive GC column)
0:018> !threads ThreadCount: 59 UnstartedThread: 0 BackgroundThread: 59 PendingThread: 0 DeadThread: 0 Hosted Runtime: yes PreEmptive GC Alloc Lock ID OSID ThreadOBJ State GC Context Domain Count APT Exception 16 1 1008 000d0828 1808220 Enabled 06b79310:06b79db0 000f0728 1 MTA (Threadpool Worker) 20 2 1e7c 000d62d0 b220 Enabled 00000000:00000000 000cd190 0 MTA (Finalizer) 21 3 1d3c 000db120 1220 Enabled 00000000:00000000 000cd190 0 Ukn 22 4 1c38 000ed3c8 80a220 Enabled 00000000:00000000 000cd190 0 MTA (Threadpool Completion Port) 23 5 19ac 0013a520 180b222 Disabled 06c17754:06c18bd8 00158e70 2 MTA (Threadpool Worker) 24 6 1fd4 0014d1a0 b220 Enabled 00000000:00000000 000f0728 1 MTA 25 7 1568 0e81aae0 180b222 Disabled 02cea570:02cec4a8 00158e70 2 MTA (Threadpool Worker) 26 8 1ad0 0e82bd58 b220 Enabled 00000000:00000000 00158e70 0 MTA 27 9 1f04 0e829310 b220 Enabled 032b0cb8:032b2938 00158e70 0 MTA 28 a ffc 0e820d18 b220 Enabled 070173e4:070193c0 00158e70 1 MTA 14 b 1ee8 0e874858 1800220 Enabled 02b34750:02b34c78 000cd190 0 Ukn (Threadpool Worker) 30 d 1080 0e8aff40 b220 Enabled 00000000:00000000 0e87c310 0 MTA 31 e 11ec 0e8d1f28 8801220 Enabled 02b418e4:02b42c78 000cd190 0 MTA (Threadpool Completion Port) 33 f 1f1c 0e8e6420 180b222 Disabled 06bc1fa0:06bc3dcc 00158e70 2 MTA (Threadpool Worker) 34 10 1a6c 0e8e8b20 180b222 Disabled 02c449b8:02c45d74 00158e70 2 MTA (Threadpool Worker) 35 11 710 0e8ee550 180b222 Disabled 06cdde68:06cdea48 00158e70 2 MTA (Threadpool Worker) 36 c 1ae0 0e875228 180b220 Enabled 072bd88c:072bf508 000cd190 0 MTA (Threadpool Worker) 37 12 19c8 00147290 180b220 Enabled 06bb4b70:06bb5dcc 000f0728 1 MTA (Threadpool Worker) 38 13 18f0 0e8901d8 180b222 Disabled 02c49ce8:02c49d74 00158e70 2 MTA (Threadpool Worker) 39 14 19b8 0e8f7338 180b220 Enabled 073c6fbc:073c6fe8 00158e70 1 MTA (Threadpool Worker) 40 15 1308 0e8f8610 180b220 Enabled 02af10a4:02af10ac 00158e70 3 MTA (Threadpool Worker) 41 16 1f38 0e8f99e8 180b220 Enabled 070c7154:070c761c 000cd190 0 MTA (Threadpool Worker) 42 17 1be0 0e8facc0 180b220 Enabled 06ad3a74:06ad4cc8 000cd190 0 MTA (Threadpool Worker) 43 18 1efc 0e8fbf98 180b220 Enabled 0321b11c:0321c938 000cd190 0 MTA (Threadpool Worker) 44 19 1470 0e8fd160 1801220 Enabled 06ef6770:06ef7178 000cd190 0 MTA (Threadpool Worker) 45 1a 150 0e8fe438 180b220 Enabled 02a15a34:02a161bc 000cd190 0 MTA (Threadpool Worker) 46 1b 1b10 0e8ff710 180b220 Enabled 06b216b8:06b22d00 000f0728 1 MTA (Threadpool Worker) 47 1c 1c8c 10690ac0 180b220 Enabled 02be2430:02be3b00 000f0728 1 MTA (Threadpool Worker) 48 1d 16c8 10691ad8 180b220 Enabled 06b31520:06b3189c 000f0728 1 MTA (Threadpool Worker) 49 1e 1a5c 10692e20 180b220 Enabled 02ab3a48:02ab50ac 000f0728 1 MTA (Threadpool Worker) 50 1f 1908 10694168 180b220 Enabled 06b686fc:06b69db0 000f0728 1 MTA (Threadpool Worker) 51 20 284 106954d0 180b220 Enabled 0319d1e4:0319e900 000cd190 0 MTA (Threadpool Worker) 52 21 1d74 10696708 180b220 Enabled 06bac634:06baddcc 000f0728 1 MTA (Threadpool Worker) 53 22 58c 106979c8 180b220 Enabled 02c14784:02c15d58 000f0728 1 MTA (Threadpool Worker) 54 23 1860 10698d10 180b220 Enabled 072e1984:072e3508 00158e70 1 MTA (Threadpool Worker) 55 24 1c9c 1069ae00 180b220 Enabled 071a4784:071a5508 000cd190 0 MTA (Threadpool Worker) 56 25 15c4 1069d328 180b222 Disabled 06bc7174:06bc7dcc 00158e70 2 MTA (Threadpool Worker) 57 26 1968 106a09e0 180b220 Enabled 0321da48:0321e938 000cd190 0 MTA (Threadpool Worker) 58 27 1e10 106a3608 180b220 Enabled 070172c8:070173c0 000cd190 0 MTA (Threadpool Worker) 59 28 1a18 106a6418 180b220 Enabled 02ac036c:02ac10ac 000f0728 1 MTA (Threadpool Worker) 60 29 1d2c 106a9148 180b220 Enabled 06a9b5a4:06a9cc90 000cd190 0 MTA (Threadpool Worker) 61 2a 1b50 106ac0a8 180b222 Disabled 06cdaf94:06cdbda8 00158e70 2 MTA (Threadpool Worker) 62 2b 96c 106ae8d0 8801220 Enabled 02c00980:02c01d3c 000cd190 0 MTA (Threadpool Completion Port) 63 2c 1b18 106afa50 180b220 Enabled 0322214c:03222938 000cd190 0 MTA (Threadpool Worker) 64 2d 1d78 106b2d48 180b220 Enabled 06f26e78:06f26eb4 000cd190 0 MTA (Threadpool Worker) 65 2e 198c 106b5cb8 180b220 Enabled 06ed8418:06ed8c98 000cd190 0 MTA (Threadpool Worker) 66 2f 1fcc 106b8b28 180b220 Enabled 0728d400:0728d508 00158e70 1 MTA (Threadpool Worker) 67 30 1958 106bb998 180b220 Enabled 0304e7e8:03050420 000cd190 0 MTA (Threadpool Worker) 68 31 1b48 106beb58 180b220 Enabled 031a5b98:031a691c 000cd190 0 MTA (Threadpool Worker) 69 32 1d80 106c19c8 180b220 Enabled 06b13444:06b14d00 000f0728 1 MTA (Threadpool Worker) 70 33 17cc 106c4838 180b220 Enabled 0700d468:0700f3a4 000cd190 0 MTA (Threadpool Worker) 71 34 1424 106c8890 180b220 Enabled 06b6f8bc:06b6fdb0 000f0728 1 MTA (Threadpool Worker) 72 35 1e00 106bd870 180b220 Enabled 06b8dcbc:06b8ddb0 000f0728 1 MTA (Threadpool Worker) 73 36 1d08 106c95e0 180b220 Enabled 071a081c:071a1508 000cd190 0 MTA (Threadpool Worker) 74 37 1fdc 106ca680 180b220 Enabled 02c05810:02c05d3c 000cd190 0 MTA (Threadpool Worker) 75 38 1b8c 106cb930 180b220 Enabled 0335f888:0335f88c 00158e70 2 MTA (Threadpool Worker) 76 39 1928 106cc900 880b220 Enabled 072bc2f0:072bd508 000cd190 0 MTA (Threadpool Completion Port) 77 3a 1fd0 106d7270 8801220 Enabled 02c18ca4:02c19d58 000cd190 0 MTA (Threadpool Completion Port) 78 3b 1640 106d7908 180b220 Enabled 0294474c:02945a7c 000cd190 0 MTA (Threadpool Worker)
Why have these threads disabled preemptive GC, not allowing the GC to suspend them, and ultimately blocking our process?
All the blocked threads are sitting in this type of callstack... trying to enter a lock (JIT_MonTryEnter). Normally you would see a thread either owning the lock or waiting in an awarelock like in this post, but here it is just spinning trying to enter the lock...
0:056> kb 2000 ChildEBP RetAddr Args to Child 10b0f4d8 0eb65afc 06bc5448 06bc53f8 22a9c796 mscorwks!JIT_MonTryEnter+0xad WARNING: Frame IP not in any known module. Following frames may be wrong. 10b0f504 69918f30 10b0f560 06bc5448 06bc5448 0xeb65afc 00000000 00000000 00000000 00000000 00000000 System_Web_Services_ni+0x28f30
unfortunately the managed stack is not very helpful in telling us where the lock was taken
0:056> !clrstack OS Thread Id: 0x15c4 (56) ESP EIP 10b0f8f8 79e73eac [ContextTransitionFrame: 10b0f8f8] 10b0f948 79e73eac [GCFrame: 10b0f948] 10b0faa0 79e73eac [ComMethodFrame: 10b0faa0]
and if we run !syncblk we have no active sync blocks, so no help there when it comes to finding out who owns this lock...
0:056> !syncblk Index SyncBlock MonitorHeld Recursion Owning Thread Info SyncBlock Owner ----------------------------- Total 165 CCW 4 RCW 1 ComClassFactory 0 Free 8
seems like we are stuck between a rock and a hard place here...
I have shown the command !dumpstack before... it shows a mixture of a managed and native callstack but it is a raw stack, meaning that it will pretty much just display any addresses on the stack that happen to be pointing to code. This means that !dumpstack will not give a true stack trace (i.e. like !clrstack and kb where if functionA calls funcitonB we will see functionA directly below functionB in the stack) and anything shown by !dumpstack may or may not be correct. Anyways, with that little warning, here comes !dumpstack:)
0:056> !dumpstack OS Thread Id: 0x15c4 (56) Current frame: mscorwks!JIT_MonTryEnter+0xad ChildEBP RetAddr Caller,Callee 10b0f47c 638c7552 (MethodDesc 0x63a539a0 +0x62 System.Xml.Schema.XmlSchemaSet.Add(System.Xml.Schema.XmlSchemaSet)), calling mscorwks!JIT_MonTryEnter 10b0f480 793463bb (MethodDesc 0x7923bbc8 +0xdb System.Collections.Hashtable..ctor(Int32, Single)), calling mscorwks!JIT_Dbl2IntSSE2 10b0f498 638b36f6 (MethodDesc 0x63a505f0 +0xa6 System.Xml.Schema.SchemaInfo..ctor()), calling (MethodDesc 0x7923bbc8 +0 System.Collections.Hashtable..ctor(Int32, Single)) 10b0f49c 638b3719 (MethodDesc 0x63a505f0 +0xc9 System.Xml.Schema.SchemaInfo..ctor()), calling mscorwks!JIT_Writeable_Thunks_Buf 10b0f4d8 0eb65afc (MethodDesc 0xeb756f0 +0x5c CustomComponent.Tools.Web.Services.Extensions.ValidationExtension.ProcessMessage(System.Web.Services.Protocols.SoapMessage)), calling (MethodDesc 0x63a539a0 +0 System.Xml.Schema.XmlSchemaSet.Add(System.Xml.Schema.XmlSchemaSet)) 10b0f504 69918f30 (MethodDesc 0x699c4798 +0x3c System.Web.Services.Protocols.SoapMessage.RunExtensions(System.Web.Services.Protocols.SoapExtension[], Boolean)) 10b0f518 699227b3 (MethodDesc 0x699c5250 +0x2cb System.Web.Services.Protocols.SoapServerProtocol.Initialize()), calling (MethodDesc 0x699c4798 +0 System.Web.Services.Protocols.SoapMessage.RunExtensions(System.Web.Services.Protocols.SoapExtension[], Boolean)) ...
!dumpstack is a bit tricky to deal with, but in the stack above if we start from the bottom of the part of the stack that i've shown we have SoapServerProtocol.Initialize() which calls SoapMessage.RunExtensions...
this we can trust because !dumpstack is actually telling us that Initialize is calling RunExtensions in this case (assuming that we believe that Initialize was called).
RunExtensions doesn't tell us what it is calling but that is because it is calling into a custom component CustomComponent.Tools.Web.Services.Extensions.ValidationExtension.ProcessMessage.
This is in turn calling into XmlSchemaSet.Add which is calling into JIT_MonTryEnter. So long story short... the real stacktrack should look like this
mscorwks!JIT_MonTryEnter+0xad System.Xml.Schema.XmlSchemaSet.Add(System.Xml.Schema.XmlSchemaSet)CustomComponent.Tools.Web.Services.Extensions.ValidationExtension.ProcessMessage(System.Web.Services.Protocols.SoapMessage)System.Web.Services.Protocols.SoapMessage.RunExtensions(System.Web.Services.Protocols.SoapExtension[], Boolean)System.Web.Services.Protocols.SoapServerProtocol.Initialize()...
!sos.clrstack just wasn't able to rebuild it because of the state it was in
Looking through the output from ~* e !clrstack I find one stack that is currently in XmlSchemaSet.Add and thus is probably the one holding the lock we are trying to enter here
OS Thread Id: 0x1308 (40) ESP EIP 104ce878 7c82ed54 [HelperMethodFrame: 104ce878] 104ce8d0 638b46bf System.Xml.Schema.SchemaNames..ctor(System.Xml.XmlNameTable) 104cebd4 638ca6fb System.Xml.Schema.XmlSchemaSet.GetSchemaNames(System.Xml.XmlNameTable) 104cebe0 638c96fc System.Xml.Schema.XmlSchemaSet.PreprocessSchema(System.Xml.Schema.XmlSchema ByRef, System.String) 104cebf8 638c86b7 System.Xml.Schema.XmlSchemaSet.Add(System.String, System.Xml.Schema.XmlSchema) 104cec04 638c7667 System.Xml.Schema.XmlSchemaSet.Add(System.Xml.Schema.XmlSchemaSet) 104cec60 0eb65afc CustomComponent.Tools.Web.Services.Extensions.ValidationExtension.ProcessMessage(System.Web.Services.Protocols.SoapMessage) 104cec8c 69918f30 System.Web.Services.Protocols.SoapMessage.RunExtensions(System.Web.Services.Protocols.SoapExtension[], Boolean) 104ceca4 699227b3 System.Web.Services.Protocols.SoapServerProtocol.Initialize() 104cece8 6990d904 System.Web.Services.Protocols.ServerProtocolFactory.Create(System.Type, System.Web.HttpContext, System.Web.HttpRequest, System.Web.HttpResponse, Boolean ByRef) 104ced28 699263ab System.Web.Services.Protocols.WebServiceHandlerFactory.CoreGetHandler(System.Type, System.Web.HttpContext, System.Web.HttpRequest, System.Web.HttpResponse) 104ced64 69926329 System.Web.Services.Protocols.WebServiceHandlerFactory.GetHandler(System.Web.HttpContext, System.String, System.String, System.String) 104ced88 65fc057c System.Web.HttpApplication.MapHttpHandler(System.Web.HttpContext, System.String, System.Web.VirtualPath, System.String, Boolean) 104cedcc 65fd58cd System.Web.HttpApplication+MapHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute() 104ceddc 65fc1610 System.Web.HttpApplication.ExecuteStep(IExecutionStep, Boolean ByRef) 104cee1c 65fd32e0 System.Web.HttpApplication+ApplicationStepManager.ResumeSteps(System.Exception) 104cee6c 65fc0225 System.Web.HttpApplication.System.Web.IHttpAsyncHandler.BeginProcessRequest(System.Web.HttpContext, System.AsyncCallback, System.Object) 104cee88 65fc550b System.Web.HttpRuntime.ProcessRequestInternal(System.Web.HttpWorkerRequest) 104ceebc 65fc5212 System.Web.HttpRuntime.ProcessRequestNoDemand(System.Web.HttpWorkerRequest) 104ceec8 65fc3587 System.Web.Hosting.ISAPIRuntime.ProcessRequest(IntPtr, Int32) 104cf078 79f35ee8 [ContextTransitionFrame: 104cf078] 104cf0c8 79f35ee8 [GCFrame: 104cf0c8] 104cf220 79f35ee8 [ComMethodFrame: 104cf220] 0:040> kb ChildEBP RetAddr Args to Child 104ce5d4 7c822124 77e6bad8 000002e8 00000000 ntdll!KiFastSystemCallRet 104ce5d8 77e6bad8 000002e8 00000000 00000000 ntdll!NtWaitForSingleObject+0xc 104ce648 79e718fd 000002e8 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xac 104ce68c 79e718c6 000002e8 ffffffff 00000000 mscorwks!PEImage::LoadImage+0x199 104ce6dc 79e7187c ffffffff 00000000 00000000 mscorwks!CLREvent::WaitEx+0x117 104ce6ec 7a0d0d0f ffffffff 00000000 00000000 mscorwks!CLREvent::Wait+0x17 104ce710 7a0d5dfa ffffffff 00001037 000d4368 mscorwks!SVR::gc_heap::wait_for_gc_done+0x99 104ce788 7a0d691f 0e8f8650 00000014 00000000 mscorwks!SVR::gc_heap::try_allocate_more_space+0x17 104ce7a8 7a0d7ecc 0e8f8650 00000014 00000000 mscorwks!SVR::gc_heap::allocate_more_space+0x2f 104ce7c8 7a08bd32 0e8f8650 00000014 00000002 mscorwks!SVR::GCHeap::Alloc+0x74 104ce7e4 79e754ff 00000014 00000000 00080000 mscorwks!Alloc+0x60 104ce824 79e755c1 639f59e0 02382edc 02af0090 mscorwks!FastAllocateObject+0x38 104ce8c8 638b46bf 02af0e60 00000000 00000000 mscorwks!JIT_NewFast+0x9e 104ce8cc 02af0e60 00000000 00000000 00000000 System_Xml_ni+0x1146bf WARNING: Frame IP not in any known module. Following frames may be wrong. 104ce8d0 00000000 00000000 00000000 00000000 0x2af0e60
But unfortunately it can't give up this lock and finish what it is doing because it is waiting for the GC to finish, so effectively we are in a deadlock.
Thread 40 holds a lock but to release it it needs the GC to complete. The GC can't complete because it can't suspend thread 56 (and other threads) that have preemptive GC disabled. Thread 56 can't enable preemptive GC until it gets the lock owned by thread 40.
I should add, that the only times I have seen this it has been in the lock in XmlSchemaSet.Add, and under heavy load when multiple threads were trying to access the same XmlSchemaSet.
Solution:
So what can we do about this? Well, a hotfix just came out (KB946644) that will fix this problem so if you are running into this issue you can call into support and ask for that hotfix.
In my earlier cases, the customers have locked around the XmlSchemaSet in the custom component since according to the msdn documentation any instance members of XmlSchemaSet are not guaranteed to be threadsafe. This resolved the issue...
Laters,
Tess
ASP.NET ASP.NET Wiki Beta [Via: Scott Hanselman ] Sharepoint Dev Tip: The SharePoint University of...
Link Listing - February 10, 2008
Tess, is that a CLR hotfix or an XML hotfix? We are seeing the same root problem (gc thread cant suspend due to other threads w/ preemptive disabled waiting on crit secs owned by gc thread) but XML is not in the stacks at all.
this hotfix is specifically for this issue... you might be running into the gc/loaderlock deadlock described here though
http://blogs.msdn.com/tess/archive/2007/03/12/net-hang-case-study-the-gc-loader-lock-deadlock-a-story-of-mixed-mode-dlls.aspx
there are two versions of it, 1. you load up mixed mode (c++) dlls and you either have managed code in dllmain or other managed entry points like static arrays etc. or 2. you are loading up a dll that references a strong named mixed mode dll and you block in the policy resolution, and the load is done using a native loading method like createobject (something that grabs the loaderlock)
in case #1 you have to recompile the dll with /noentry and follow the articles referenced in that post to make sure you have no entry points. in case #2 you should manually load the referenced dll using assembly.load in application_start for example
Sorry for the convoluted answer, the space is a little short in the comments:) I can write a post on case #2 soon if you think you are running into that, but hopefully this helps you in the meantime...
From looking at the stacks, I know 1 thread (#31) is in gc trying to suspend 2 other threads (#17,30) that are in preemptive disabled mode that are trying to enter crit secs owned by the gc thread (#31), hence the deadlock. (note: the non-gc threads waiting on the crit secs are in exception handlers at the time). This is happening in Microsoft CRM 3.0 (which is an asp.net 1.1 app). I am trying to better understand if this can be caused by the crm app code or if that hotfix was for the CLR itself. It looks like an EE crit sec, and my understanding of preemptive is that it can only be set from the CLR unmanaged code. So based on that I am guessing its a CLR bug, but I could be missing something. There is a US MS Case open, would love if you could take a look at the dump I uploaded :) Waiting for TAMs to get thier acts together.
sounds interesting, the support engineer you are working with probably debugs as much as I :) and is probably specializing in CRM, but send me the case number and support engineers name or email address via the contact me section and I'll talk to him or her to see if they need a second opinion...
Tess, we are having exactly the same issue as described in your article and are also working with MS support to resolve it. I'll send you the details and would appreciate if you could give the person we're working with your opinion as they don't seem to be able to find the hotfix you mentioned in the article and we're running out of options to try on our side. This is how the stack looks in our case:
2f8384b4 0a06f23b System_Xml!System.Xml.Schema.SchemaNames..ctor(System.Xml.XmlNameTable)+0x6d1
0a45e7fc 0a043bf0 System_Xml!System.Xml.Schema.XmlSchemaSet.GetSchemaNames(System.Xml.XmlNameTable)+0x43
0a45e7fc 0a043b98 System_Xml!System.Xml.Schema.XmlSchemaSet.PreprocessSchema(System.Xml.Schema.XmlSchema ByRef, System.String)+0x20
0a45e840 0a04f35a System_Xml!System.Xml.Schema.XmlSchemaSet.Add(System.String, System.Xml.Schema.XmlSchema)+0x28
0a45e840 09f6209a System_Xml!System.Xml.Schema.XmlSchemaSet.Add(System.Xml.Schema.XmlSchemaSet)+0x19a
As you can see it's almost the exact replica of what you have in the article.
Hi Paul,
The article doesn't seem to be public yet but the hotfix is available, if they search on the KB 946644, if they can't find it they can contact me.
Hi Tess,
It seems like our support person already talked to you and indeed, the hotfix for KB 946644 helped in our case (SP1 by itself didn't). I can't say that the situation is completely reolved though as we're till getting strange errors related to schema validation (102 errors out of 33k+ requests) that look like the following:
System.Xml.Schema.XmlSchemaValidationException: The 'Service' element is not declared. at System.Xml.Schema.XmlSchemaValidator.SendValidationEvent(XmlSchemaValidationException e, XmlSeverityType severity)
at System.Xml.Schema.XmlSchemaValidator.SendValidationEvent(String code, String arg)
at System.Xml.Schema.XmlSchemaValidator.ThrowDeclNotFoundWarningOrError(Boolean declFound)
at System.Xml.Schema.XmlSchemaValidator.ValidateElement(String localName, String namespaceUri, XmlSchemaInfo schemaInfo, String xsiType, String xsiNil, String xsiSchemaLocation, String xsiNoNamespaceSchemaLocation)
at System.Xml.XsdValidatingReader.ProcessElementEvent()
at System.Xml.XsdValidatingReader.ProcessReaderEvent()
at System.Xml.XsdValidatingReader.Read()
at System.Xml.XmlLoader.LoadNode(Boolean skipOverWhitespace)
at System.Xml.XmlLoader.LoadDocSequence(XmlDocument parentDoc)
at System.Xml.XmlLoader.Load(XmlDocument doc, XmlReader reader, Boolean preserveWhitespace)
at System.Xml.XmlDocument.Load(XmlReader reader)
Surely enough, the 'Service' element IS declared and the same request is executed just fine right before and right after the one that fails. The element may be different in different errors, but it's always the same error and it's just few lines of code away from the place where we had problems with XMLSchemaSet.Add() call previously. Can it be possibly related to the hotfix? It appears that the schema object gets corrupted in some rare cases (internally), which causes it to loose information about some of the elements that are properly declared in the schema.
Also, we do see abnormal GC patterns that were not observed under 1.1 on the same box with the same application (last time tested three weeks ago). This includes large "% time in GC" (even without any external requests), number of Gen 0 to Gen 1 collections is close to 2:1 rather than to recommended 10:1, Gen 2 Heap size is significantly larger than Gen 0 or Gen 1, number of allocated bytes / sec and number of Gen 0 promoted bytes / sec is large even with zero users and no incoming requests and so on. I sent all the details with graphs and correlations to Michael Noto, who is handling our case (SRX080xxxxx0861).
This difference between 1.1 and 2.0 appears to be somewhat similar to what was reported long time ago by "philippe" in one of the comments to your High CPU in GC article (http://blogs.msdn.com/tess/archive/2006/06/22/643309.aspx). Any insight as to what might be causing it? Thanks much.
Paul.
off the top of my head I can't say that I've seen this before, and I am currently on vacation so I don't have access to my usual resources.
Based on what I know about the hotfix I don't think it is related though as the hotfix deals with blocking the thread for GCs but of course you can never be sure if it causes sideeffects.
You might want to set up a breakpoint and dump on the first occurrence of this System.Xml.Schema.XmlSchemaValidationException so that you can examine the structures, but I would suggest that it is set up in such a way that the breakpoint is disabled after it's been hit once if they are very frequent.
I very frequently get emails like the one I got this morning: "Tess, It sounds like the hotfix for kb946644
I tried to open a support case to get hotfix KB 946644, and was told that this hotfix has been discontinued. Do you know why this would be the case, or if the problem is being addressed in some other way in a future hotfix?
It would be nice to try the hotfix, at least for experimental purposes, to see whether the problem it addresses really is the cause of our symptoms. Is there any other way for us to get a copy?
I had a look and I can't see that it has been discontinued, from the looks of it it will also be included in SP1. I think the confusion might be in that the kb is not released yet, but the hotfix still seems to be there. Unfortunately I can't post the hotfix here, but if you contact your support specialist again they can contact me if they want.