I’ve run into two cases which fell down the same path and figured we should have some information out there regarding it. The issue was that the customer’s instance of Reporting Services would just spike to 100% CPU usage. During this time the end users did not notice any interruption in service. This occurred because Reporting Services was trying to unload an AppDomain and it timed out. The timeout that occurred ended up putting us into an infinite loop within the CLR ThreadPool Manager. The only way to correct the issue at that point was to recycle the Reporting Services Service. In both cases, the issue came down to custom code that they had introduced into the Reporting Services process. So, if you are just using Reporting Services out of the box and haven’t introduce any custom extensions (Data, Security, Renderer), then this probably won’t help you much.
In the first case, they had a custom security extension that ended up purging a cache periodically. When this one particular method was called in their extension, they called Thread.Sleep(30000). This caused our thread to wait for 30 seconds. The default timeout for the AppDomain Unload is 10 seconds. Under normal circumstances with wouldn’t be an issue. I’ll get into that in a bit.
In the second case, we actually hit several different items. I’ll just highlight the first item. This ended up waiting on a System.Socket/Winsock call. This wait went beyond the 10 second AppDomain timeout.
The challenge to this issue was that at the time we go into the CPU spin, the issue had already occurred. If we get a memory dump, we can recognize the pattern, but we won’t be able to identify the root cause at that point. The AppDomain timeout is really what we need to focus on, and look at what the other threads are doing at the time the AppDomain timeout occurs. Unfortunately, I can’t go deep into that on this blog due to the fact that you need private symbols to accomplish certain aspects of that.
What I did in both cases was to get a dump when the CPU spin was occurring. At that point, I provided a breakpoint that allowed us to capture a dump at the time the AppDomain timeout occurred. I can’t tell you what the breakpoint will be for every case as it depends on the module’s base address.
*** NOTE *** All debugger output listed in this blog post makes use of public symbols.
0:000> lmvm mscorwks
start end module name
79e70000 7a400000 mscorwks (pdb symbols) d:\symbols\public\mscorwks.pdb\37AFE5AF09D54705B6B685CBCD2208FC2\mscorwks.pdb
Loaded symbol image file: mscorwks.dll
Image path: C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\mscorwks.dll
Image name: mscorwks.dll
Timestamp: Fri Jul 25 08:58:48 2008 (4889DC18)
File version: 2.0.50727.3053
Product version: 2.0.50727.3053
In the example above, 79e70000 is the base address of mscorwks.dll within the given memory dump. The breakpoint was set at an offset of this address.
If we unassemble that address, we can see that it is a call to mscorwks!RealCOMPlusThrow within the AppDomain::StopEEandUnwindThreads method. This is what we want.
0:000> u 7a02b334
7a02b334 e8bee30600 call mscorwks!RealCOMPlusThrow (7a0996f7)
7a02b339 cc int 3 <—this happens to be our Debugger Break
7a02b33a 397510 cmp dword ptr [ebp+10h],esi
7a02b33d 0f849f15eaff je mscorwks!AppDomain::RaiseAssemblyResolveEvent+0x5f (79ecc8e2)
7a02b343 c7850cfdffff06000000 mov dword ptr [ebp-2F4h],6
7a02b34d e91fb4f4ff jmp mscorwks!AppDomain::RaiseAssemblyResolveEvent+0x84 (79f76771)
7a02b352 33c0 xor eax,eax
7a02b354 e98cb4f4ff jmp mscorwks!AppDomain::RaiseAssemblyResolveEvent+0xf8 (79f767e5)
This breakpoint lined up with where the timeout occured based on the module base address. We had applied a CLR hotfix which was for an issue that could produce the same end result. This KB will put you at a CLR build of 2.0.5727.4043.
FIX: The CPU usage increases significantly when you run a .NET Framework 2.0-based Multi-AppDomain application that uses objects whose finalizers start ThreadPool methods
After this fix was applied, we can see how it altered the base address of mscorwks.
0:054> lmvm mscorwks
start end module name
79e70000 7a400000 mscorwks (pdb symbols) d:\symbols\public\mscorwks.pdb\B6E2DA61DC2E4A038DD1DD1819C4E51B2\mscorwks.pdb
Loaded symbol image file: mscorwks.dll
Image path: c:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\mscorwks.dll
Image name: mscorwks.dll
Timestamp: Mon Jun 01 23:00:46 2009 (4A24A3EE)
File version: 2.0.50727.4043
Product version: 2.0.50727.4043
So, in this case, the breakpoint I had before for address 7a02b334 looks like the following.
0:054> u 7a02b334
7a02b334 3b7a10 cmp edi,dword ptr [edx+10h]
7a02b337 7426 je mscorwks!SystemDomain::NotifyProfilerStartup+0xf6 (7a02b35f)
7a02b339 a19c333b7a mov eax,dword ptr [mscorwks!SystemDomain::m_pSystemDomain (7a3b339c)]
7a02b33e 8b800c030000 mov eax,dword ptr <Unloaded_rtingServicesFileShareDelive>+0x2cb (0000030c)[eax]
7a02b344 8b0da44a3b7a mov ecx,dword ptr [mscorwks!g_profControlBlock+0x8 (7a3b4aa4)]
7a02b34a 8b31 mov esi,dword ptr [ecx]
7a02b34c 6a00 push 0
7a02b34e 50 push eax
That won’t get us what we need. What also made this easier for us was that we were running on Windows 2003. Starting in Vista/Windows 2008, a change was introduced in the kernel that will randomize a module’s base address, which would make this breakpoint inconsistent from each process start. This was called Address Space Load Randomization and is discussed in the following article here.
Enter SQL OS
Starting with Reporting Services 2008, we are using the SQL OS bits that we use within SQL Engine. This allows us to have better control over resource management. This includes thread scheduling, memory management, endpoint hosting, etc… We also host the .NET CLR with SQL OS. This in combination with our endpoint hosting with SQL Network Interface (SNI) allowed us to split from IIS.
SQL OS is really why end users don’t appear to be affected when we hit the high CPU issue. The Thread Manager within SQL OS allows threads to not be starved (assuming we haven’t gone pre-emptive). So, we still switch to other threads to allow them to process work.
Going back to the Thread.Sleep issue, when an AppDomain goes to unload, it will try to wake a thread up through the Alertable mechanism (APC Queue). However, SQL OS wraps the call to Sleep and does not use the Alertable mechanism which is why the threads stick around and cause the AppDomain to timeout. But be aware, that this is not the only way to enter into this situation. Other calls could enter into a wait type call (Sleep, WaitForSingleObject, WaitForMultipleObjects, etc…) and have nothing to do with SQL OS as was evident in the Winsock issue that we had.
How do you recognize that this is the issue you are hitting?
Once we have a dump, you will see a spinning pattern that looks similar to the following:
Child-SP RetAddr Call Site
00000000`0815f820 00000642`7f4a6e1a mscorwks!PerAppDomainTPCountList::GetAppDomainIndexForThreadpoolDispatch+0x13
00000000`0815f880 00000642`7f4a289a mscorwks!ThreadpoolMgr::ExecuteWorkRequest+0xe
00000000`0815f8e0 00000000`0110efcf mscorwks!ThreadpoolMgr::WorkerThreadStart+0x1ba
00000000`0815f980 00000000`011102ca ReportingServicesService!CallProtectorImpl::CallExternalTrustedFull<ClrEntryPointCallTraits,unsigned long,FunctionCallBinder_1<unsigned long,unsigned long (__cdecl*)(void * __ptr64),void * __ptr64> >+0x13f
00000000`0815fa20 00000000`01110849 ReportingServicesService!CHostTask::TaskExecute+0xca
00000000`0815faa0 00000000`01087dda ReportingServicesService!CHostTask::TaskProcWrapper+0x9
00000000`0815fad0 00000000`01088a49 ReportingServicesService!SOS_Task::Param::Execute+0x14a
00000000`0815fbe0 00000000`0108b93d ReportingServicesService!SOS_Scheduler::RunTask+0x99
00000000`0815fc40 00000000`010b5b88 ReportingServicesService!SOS_Scheduler::ProcessTasks+0x1ad
00000000`0815fca0 00000000`010b72ac ReportingServicesService!SchedulerManager::WorkerEntryPoint+0x278
00000000`0815fd70 00000000`010b829f ReportingServicesService!SystemThread::RunWorker+0xcc
00000000`0815fda0 00000000`010b9875 ReportingServicesService!SystemThreadDispatcher::ProcessWorker+0x44f
00000000`0815fe60 00000000`781337d7 ReportingServicesService!SchedulerManager::ThreadEntryPoint+0x275
00000000`0815ff20 00000000`78133894 msvcr80!endthreadex+0x47
00000000`0815ff50 00000000`77d6b71a msvcr80!endthreadex+0x104
00000000`0815ff80 00000000`00000000 kernel32!BaseThreadStart+0x3a
Child-SP RetAddr Call Site
00000000`084bf758 00000000`77d5d6df ntdll!NtQueryPerformanceCounter+0xa
00000000`084bf760 00000642`7f593c6e kernel32!QueryPerformanceCounter+0xf
00000000`084bf790 00000642`7f8a5692 mscorwks!getTimeStamp+0x16
00000000`084bf7c0 00000642`7f848696 mscorwks!__DangerousSwitchToThread+0x72
00000000`084bf830 00000642`7f4a289a mscorwks!`string'+0xaebe6
00000000`084bf8e0 00000000`0110efcf mscorwks!ThreadpoolMgr::WorkerThreadStart+0x1ba
00000000`084bf980 00000000`011102ca ReportingServicesService!CallProtectorImpl::CallExternalTrustedFull<ClrEntryPointCallTraits,unsigned long,FunctionCallBinder_1<unsigned long,unsigned long (__cdecl*)(void * __ptr64),void * __ptr64> >+0x13f
00000000`084bfa20 00000000`01110849 ReportingServicesService!CHostTask::TaskExecute+0xca
00000000`084bfaa0 00000000`01087dda ReportingServicesService!CHostTask::TaskProcWrapper+0x9
00000000`084bfad0 00000000`01088a49 ReportingServicesService!SOS_Task::Param::Execute+0x14a
00000000`084bfbe0 00000000`0108b93d ReportingServicesService!SOS_Scheduler::RunTask+0x99
00000000`084bfc40 00000000`010b5b88 ReportingServicesService!SOS_Scheduler::ProcessTasks+0x1ad
00000000`084bfca0 00000000`010b72ac ReportingServicesService!SchedulerManager::WorkerEntryPoint+0x278
00000000`084bfd70 00000000`010b829f ReportingServicesService!SystemThread::RunWorker+0xcc
00000000`084bfda0 00000000`010b9875 ReportingServicesService!SystemThreadDispatcher::ProcessWorker+0x44f
00000000`084bfe60 00000000`781337d7 ReportingServicesService!SchedulerManager::ThreadEntryPoint+0x275
00000000`084bff20 00000000`78133894 msvcr80!endthreadex+0x47
00000000`084bff50 00000000`77d6b71a msvcr80!endthreadex+0x104
00000000`084bff80 00000000`00000000 kernel32!BaseThreadStart+0x3a
On the threads in question, you could also see high Kernel time along with high user mode time. You’ll notice on the stack above kernel32!QueryPerformanceCounter. This is why we are seeing the Kernel time. It just happens to be called as part of the code path in the ThreadManager. It is not really part of the issue.
From a testing perspective, you would also force the AppDomains within Reporting Services to recycle more frequently than the default. The default is 12 hours. This is controlled by the RecycleTime property within the rsreportserver.config file and is explained in more detail in the following Books Online article.
Application Domains for Report Server Applications
In the second case, we ended up setting this for 5 minutes to get it to reproduce more quickly.
The point I wanted to get out, was that when you are introducing your own code into Reporting Services, be aware of what that code is doing. If it introduces any type of blocking situation at the native level that cannot be interrupted, you could get into this situation. CSS can help to identify what may be causing the issue within your code at which point you may have to introduce a code change on your end.
This issue is purely a timing issue. Meaning that it may be inconsistent and hard to reproduce. The RecycleTime setting listed above could help from a testing perspective or to try and repro it after you have identified it, but it is not a guarantee. Also, anything introduced to the process could alter the timing of the application. This could be anything from new/updated code to hardware – disk I/O for example or load on the CPU.
If you do encounter this issue, I would also recommend you apply KB 971988, mentioned above, to rule out that issue. If you call into CSS, we are probably going to ask you to apply it as well, if you haven’t already.
This issue was corrected in the following Hotfix.
A .NET Framework 2.0-based Multi-AppDomain application stops responding when you run the application
Adam W. Saxton | Microsoft SQL Server Escalation Services