CSS SQL Server Engineers

This is the official team Web Log for Microsoft Customer Service and Support (CSS) SQL Support. Posts are provided by the CSS SQL Escalation Services

High CPU with Reporting Services 2008

High CPU with Reporting Services 2008

  • Comments 4

I’ve run into two cases which fell down the same path and figured we should have some information out there regarding it.  The issue was that the customer’s instance of Reporting Services would just spike to 100% CPU usage.  During this time the end users did not notice any interruption in service. This occurred because Reporting Services was trying to unload an AppDomain and it timed out. The timeout that occurred ended up putting us into an infinite loop within the CLR ThreadPool Manager.  The only way to correct the issue at that point was to recycle the Reporting Services Service.  In both cases, the issue came down to custom code that they had introduced into the Reporting Services process.  So, if you are just using Reporting Services out of the box and haven’t introduce any custom extensions (Data, Security, Renderer), then this probably won’t help you much.

In the first case, they had a custom security extension that ended up purging a cache periodically.  When this one particular method was called in their extension, they called Thread.Sleep(30000).  This caused our thread to wait for 30 seconds. The default timeout for the AppDomain Unload is 10 seconds.  Under normal circumstances with wouldn’t be an issue. I’ll get into that in a bit.

In the second case, we actually hit several different items. I’ll just highlight the first item.  This ended up waiting on a System.Socket/Winsock call.  This wait went beyond the 10 second AppDomain timeout.

The challenge to this issue was that at the time we go into the CPU spin, the issue had already occurred.  If we get a memory dump, we can recognize the pattern, but we won’t be able to identify the root cause at that point.  The AppDomain timeout is really what we need to focus on, and look at what the other threads are doing at the time the AppDomain timeout occurs.  Unfortunately, I can’t go deep into that on this blog due to the fact that you need private symbols to accomplish certain aspects of that.

What I did in both cases was to get a dump when the CPU spin was occurring.  At that point, I provided a breakpoint that allowed us to capture a dump at the time the AppDomain timeout occurred.  I can’t tell you what the breakpoint will be for every case as it depends on the module’s base address. 

*** NOTE ***  All debugger output listed in this blog post makes use of public symbols.

0:000> lmvm mscorwks
start    end        module name
79e70000 7a400000   mscorwks   (pdb symbols)          d:\symbols\public\mscorwks.pdb\37AFE5AF09D54705B6B685CBCD2208FC2\mscorwks.pdb
    Loaded symbol image file: mscorwks.dll
    Image path: C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\mscorwks.dll
    Image name: mscorwks.dll
    Timestamp:        Fri Jul 25 08:58:48 2008 (4889DC18)
    CheckSum:         00597CC4
    ImageSize:        00590000
    File version:     2.0.50727.3053
    Product version:  2.0.50727.3053

In the example above, 79e70000 is the base address of mscorwks.dll within the given memory dump.  The breakpoint was set at an offset of this address. 

bp 7a02b334

If we unassemble that address, we can see that it is a call to mscorwks!RealCOMPlusThrow within the AppDomain::StopEEandUnwindThreads method.  This is what we want.

0:000> u 7a02b334
mscorwks!AppDomain::StopEEAndUnwindThreads+0x407:
7a02b334 e8bee30600      call    mscorwks!RealCOMPlusThrow (7a0996f7)
7a02b339 cc              int     3 <—this happens to be our Debugger Break
7a02b33a 397510          cmp     dword ptr [ebp+10h],esi
7a02b33d 0f849f15eaff    je      mscorwks!AppDomain::RaiseAssemblyResolveEvent+0x5f (79ecc8e2)
7a02b343 c7850cfdffff06000000 mov dword ptr [ebp-2F4h],6
7a02b34d e91fb4f4ff      jmp     mscorwks!AppDomain::RaiseAssemblyResolveEvent+0x84 (79f76771)
7a02b352 33c0            xor     eax,eax
7a02b354 e98cb4f4ff      jmp     mscorwks!AppDomain::RaiseAssemblyResolveEvent+0xf8 (79f767e5)

 

This breakpoint lined up with where the timeout occured based on the module base address. We had applied a CLR hotfix which was for an issue that could produce the same end result.  This KB will put you at a CLR build of 2.0.5727.4043.

FIX: The CPU usage increases significantly when you run a .NET Framework 2.0-based Multi-AppDomain application that uses objects whose finalizers start ThreadPool methods
http://support.microsoft.com/default.aspx?scid=kb;EN-US;971988

After this fix was applied, we can see how it altered the base address of mscorwks.

0:054> lmvm mscorwks
start    end        module name
79e70000 7a400000   mscorwks   (pdb symbols)          d:\symbols\public\mscorwks.pdb\B6E2DA61DC2E4A038DD1DD1819C4E51B2\mscorwks.pdb
    Loaded symbol image file: mscorwks.dll
    Image path: c:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\mscorwks.dll
    Image name: mscorwks.dll
    Timestamp:        Mon Jun 01 23:00:46 2009 (4A24A3EE)
    CheckSum:         005974CF
    ImageSize:        00590000
    File version:     2.0.50727.4043
    Product version:  2.0.50727.4043

So, in this case, the breakpoint I had before for address 7a02b334 looks like the following.

0:054> u 7a02b334
mscorwks!SystemDomain::NotifyProfilerStartup+0xcb:
7a02b334 3b7a10          cmp     edi,dword ptr [edx+10h]
7a02b337 7426            je      mscorwks!SystemDomain::NotifyProfilerStartup+0xf6 (7a02b35f)
7a02b339 a19c333b7a      mov     eax,dword ptr [mscorwks!SystemDomain::m_pSystemDomain (7a3b339c)]
7a02b33e 8b800c030000    mov     eax,dword ptr <Unloaded_rtingServicesFileShareDelive>+0x2cb (0000030c)[eax]
7a02b344 8b0da44a3b7a    mov     ecx,dword ptr [mscorwks!g_profControlBlock+0x8 (7a3b4aa4)]
7a02b34a 8b31            mov     esi,dword ptr [ecx]
7a02b34c 6a00            push    0
7a02b34e 50              push    eax

That won’t get us what we need.  What also made this easier for us was that we were running on Windows 2003.  Starting in Vista/Windows 2008, a change was introduced in the kernel that will randomize a module’s base address, which would make this breakpoint inconsistent from each process start.  This was called Address Space Load Randomization and is discussed in the following article here.

Enter SQL OS

Starting with Reporting Services 2008, we are using the SQL OS bits that we use within SQL Engine.  This allows us to have better control over resource management.  This includes thread scheduling, memory management, endpoint hosting, etc…  We also host the .NET CLR with SQL OS.  This in combination with our endpoint hosting with SQL Network Interface (SNI) allowed us to split from IIS.

SQL OS is really why end users don’t appear to be affected when we hit the high CPU issue.  The Thread Manager within SQL OS allows threads to not be starved (assuming we haven’t gone pre-emptive). So, we still switch to other threads to allow them to process work.

Going back to the Thread.Sleep issue, when an AppDomain goes to unload, it will try to wake a thread up through the Alertable mechanism (APC Queue).  However, SQL OS wraps the call to Sleep and does not use the Alertable mechanism which is why the threads stick around and cause the AppDomain to timeout.  But be aware, that this is not the only way to enter into this situation.  Other calls could enter into a wait type call (Sleep, WaitForSingleObject, WaitForMultipleObjects, etc…) and have nothing to do with SQL OS as was evident in the Winsock issue that we had. 

How do you recognize that this is the issue you are hitting?

Once we have a dump, you will see a spinning pattern that looks similar to the following:

0:040> k
Child-SP          RetAddr           Call Site
00000000`0815f820 00000642`7f4a6e1a mscorwks!PerAppDomainTPCountList::GetAppDomainIndexForThreadpoolDispatch+0x13
00000000`0815f880 00000642`7f4a289a mscorwks!ThreadpoolMgr::ExecuteWorkRequest+0xe
00000000`0815f8e0 00000000`0110efcf mscorwks!ThreadpoolMgr::WorkerThreadStart+0x1ba
00000000`0815f980 00000000`011102ca ReportingServicesService!CallProtectorImpl::CallExternalTrustedFull<ClrEntryPointCallTraits,unsigned long,FunctionCallBinder_1<unsigned long,unsigned long (__cdecl*)(void * __ptr64),void * __ptr64> >+0x13f
00000000`0815fa20 00000000`01110849 ReportingServicesService!CHostTask::TaskExecute+0xca
00000000`0815faa0 00000000`01087dda ReportingServicesService!CHostTask::TaskProcWrapper+0x9
00000000`0815fad0 00000000`01088a49 ReportingServicesService!SOS_Task::Param::Execute+0x14a
00000000`0815fbe0 00000000`0108b93d ReportingServicesService!SOS_Scheduler::RunTask+0x99
00000000`0815fc40 00000000`010b5b88 ReportingServicesService!SOS_Scheduler::ProcessTasks+0x1ad
00000000`0815fca0 00000000`010b72ac ReportingServicesService!SchedulerManager::WorkerEntryPoint+0x278
00000000`0815fd70 00000000`010b829f ReportingServicesService!SystemThread::RunWorker+0xcc
00000000`0815fda0 00000000`010b9875 ReportingServicesService!SystemThreadDispatcher::ProcessWorker+0x44f
00000000`0815fe60 00000000`781337d7 ReportingServicesService!SchedulerManager::ThreadEntryPoint+0x275
00000000`0815ff20 00000000`78133894 msvcr80!endthreadex+0x47
00000000`0815ff50 00000000`77d6b71a msvcr80!endthreadex+0x104
00000000`0815ff80 00000000`00000000 kernel32!BaseThreadStart+0x3a

0:041> k
Child-SP          RetAddr           Call Site
00000000`084bf758 00000000`77d5d6df ntdll!NtQueryPerformanceCounter+0xa
00000000`084bf760 00000642`7f593c6e kernel32!QueryPerformanceCounter+0xf
00000000`084bf790 00000642`7f8a5692 mscorwks!getTimeStamp+0x16
00000000`084bf7c0 00000642`7f848696 mscorwks!__DangerousSwitchToThread+0x72
00000000`084bf830 00000642`7f4a289a mscorwks!`string'+0xaebe6
00000000`084bf8e0 00000000`0110efcf mscorwks!ThreadpoolMgr::WorkerThreadStart+0x1ba
00000000`084bf980 00000000`011102ca ReportingServicesService!CallProtectorImpl::CallExternalTrustedFull<ClrEntryPointCallTraits,unsigned long,FunctionCallBinder_1<unsigned long,unsigned long (__cdecl*)(void * __ptr64),void * __ptr64> >+0x13f
00000000`084bfa20 00000000`01110849 ReportingServicesService!CHostTask::TaskExecute+0xca
00000000`084bfaa0 00000000`01087dda ReportingServicesService!CHostTask::TaskProcWrapper+0x9
00000000`084bfad0 00000000`01088a49 ReportingServicesService!SOS_Task::Param::Execute+0x14a
00000000`084bfbe0 00000000`0108b93d ReportingServicesService!SOS_Scheduler::RunTask+0x99
00000000`084bfc40 00000000`010b5b88 ReportingServicesService!SOS_Scheduler::ProcessTasks+0x1ad
00000000`084bfca0 00000000`010b72ac ReportingServicesService!SchedulerManager::WorkerEntryPoint+0x278
00000000`084bfd70 00000000`010b829f ReportingServicesService!SystemThread::RunWorker+0xcc
00000000`084bfda0 00000000`010b9875 ReportingServicesService!SystemThreadDispatcher::ProcessWorker+0x44f
00000000`084bfe60 00000000`781337d7 ReportingServicesService!SchedulerManager::ThreadEntryPoint+0x275
00000000`084bff20 00000000`78133894 msvcr80!endthreadex+0x47
00000000`084bff50 00000000`77d6b71a msvcr80!endthreadex+0x104
00000000`084bff80 00000000`00000000 kernel32!BaseThreadStart+0x3a

On the threads in question, you could also see high Kernel time along with high user mode time.  You’ll notice on the stack above kernel32!QueryPerformanceCounter.  This is why we are seeing the Kernel time.  It just happens to be called as part of the code path in the ThreadManager.  It is not really part of the issue.

From a testing perspective, you would also force the AppDomains within Reporting Services to recycle more frequently than the default.  The default is 12 hours.  This is controlled by the RecycleTime property within the rsreportserver.config file and is explained in more detail in the following Books Online article.

Application Domains for Report Server Applications
http://technet.microsoft.com/en-us/library/bb934330.aspx

In the second case, we ended up setting this for 5 minutes to get it to reproduce more quickly.

Summary

The point I wanted to get out, was that when you are introducing your own code into Reporting Services, be aware of what that code is doing.  If it introduces any type of blocking situation at the native level that cannot be interrupted, you could get into this situation. CSS can help to identify what may be causing the issue within your code at which point you may have to introduce a code change on your end.

This issue is purely a timing issue.  Meaning that it may be inconsistent and hard to reproduce.  The RecycleTime setting listed above could help from a testing perspective or to try and repro it after you have identified it, but it is not a guarantee.  Also, anything introduced to the process could alter the timing of the application.  This could be anything from new/updated code to hardware – disk I/O for example or load on the CPU.

If you do encounter this issue, I would also recommend you apply KB 971988, mentioned above, to rule out that issue.  If you call into CSS, we are probably going to ask you to apply it as well, if you haven’t already.

UPDATE:

This issue was corrected in the following Hotfix.

A .NET Framework 2.0-based Multi-AppDomain application stops responding when you run the application
http://support.microsoft.com/kb/979744

Adam W. Saxton | Microsoft SQL Server Escalation Services

Leave a Comment
  • Please add 4 and 3 and type the answer here:
  • Post
  • Adam,

    Thank you for the post.  We have been experiencing a similar issue but we are using MSRS 2008 pretty much straight out of the box.  We upgraded from MSRS 2005 in October 2009 and we have had four incidents in the last two weeks where the CPU jumps to 100% and remains there until the server is rebooted or the service is restarted.  

    Although the CPU spikes start at a time when reports are running, the CPU will remain at 100% even after the reports have finished and there is nothing in the "queue" (ReportServer..Event table).  The first time (before we had a monitor in place), the 100% CPU condition lasted for 72 hours.  Task manager shows that the ReportingServicesService.exe process is the culprit, consuming 99-100% of the CPU.

    We are running Reporting Services in a multi-tier topology with the Reporting Services server separated from the RSDB report catalog server.  The sources of the reporting data are varied and exist on remote servers, including SQL Server 2000, 2005, and Oracle.

    Both Reporting Services servers are Intel Xeon 2.5-2.83 GHz dual quad core servers with 64-bit Windows 2008 Enterprise (SP1) without Hyper-V.  Each server has 32 GB of memory.  The SQL Server with the ReportServer databases is SQL 2008 (SP1) standard edition.  The Reporting Services installation is 2008 (SP1) enterprise edition.

    Since this has happened three times this week, I expect it to happen again.  Do you have any steps that you would recommend to help me troubleshoot the issue?

    Thanks

  • Same symptom right out of the box - development laptop, Win7pro 64bit, VS2008sp1+updates, SQLServer2008 updated to SP1. Just finished installing all the software (MSDN/BizSpark) and getting all the Windows Update patches up to Feb.1, reboots included.

    On next startup, from suspend, the fan noise prompted me to check - had to drill from Task Manager to Resource Monitor and there was ReportingServicesService.exe pegging the dual core AMD CPU for no apparent reason.

    Just another data point, hope you can track it down. Thanks.

  • Just hit this (or something with the same symptoms) on a Win7ent/Sql2008Dev 32-bit after resuming from suspend.  I don't have any custom anything on Report Services though.

  • I have a notebook PC running windows 7 professional 32 bit. I recently downloaded the free version of SQL Server 2008 with all the extras and examples, but have only been playing around with it, writing a few views and database procedures to get the hang of it.

    I have never used the reporting service, and rarely use SQL Server, but I frequently find that the reporting service is using all the available CPU on my machine, even though I have not started any application that uses SQL Server. It is easy enough to stop the process using task manager, but I would prefer this problem not to happen in the first place.

    Please could you tell me what I can do to stop it, short of uninstalling the reporting service.

    Alan Watson

Page 1 of 1 (4 items)