Just thought I would summarize the concept of a SOS Scheduler and a Deadlocked Scheduler since many have inquired over the years for a brief summary of these concepts
UMS/SOS Scheduler in SQL Server
Since SQL Server 7.0, SQL Server has used its own scheduling mechanism, called UMS (User-mode scheduler) in 7.0 and 2000 and later renamed to SOS (SQL on OS scheduler). UMS/SOS is a cooperative (non-preemptive) scheduler which means that it relies on threads to voluntarily give up CPU usage - yield - to the next thread waiting in line. That means that there are locations in the SQL Server code where the Microsoft developer built in yield points, causing execution to “pause” and gracefully let another citizen in the SQL kingdom to exercise its right to execute. Of course, once a thread "leaves" SQL Server SOS kingdom, that thread is still handed off to the Windows preemptive scheduler (ruled by another Master so to speak). The goal of SOS is for SQL Server to expose only one thread at a time per CPU and thus minimize competition among SQL threads exposed to the OS. The ultimate goal is to reduce the very expensive kernel-mode context switching from one thread to another.
Since SQL Server uses a cooperative scheduler, it relies on the good heart of each developer who writes code in SQL Server to call a Yield() function of some kind that prevents the thread from monopolizing the CPU. But even the best-intentioned developer could make mistakes (and introduce a bug in his code), or a SQL thread could be at the mercy of some external component – like calling into code outside of SQL Server. Because of that, the SOS scheduler has a dedicated thread -Scheduler Monitor - that periodically checks the state of each scheduler and reports any "irregularities". Examples of irregularities include a thread not yielding voluntarily (a non-yielding scheduler) or all schedulers are "stuck" not processing any requests (deadlocked schedulers). Typically this Scheduler Monitor thread will report problems every 60 seconds, though it monitors the IOCP thread that accepts connections every 15 seconds.
An entire white paper could be written about how SOS scheduler works but for this discussion I will limit the information to some summary points. Actually, a white paper has been written already - SOS Scheduler white paper
A deadlocked scheduler scenario is reported when ALL SOS Schedulers are reported to be "stuck". Technically, this means that since the last Scheduler Monitor check - say about 60 seconds earlier - nothing has changed and no work is being processed. Here are some of things checked:
If any of these conditions is met (not all), the Scheduler Monitor declares that schedulers are stuck - deadlocked.
A note on NUMA: SQL Server can report Deadlocked Schedulers if all the SOS schedulers on a single NUMA node are stuck, even if other NUMA nodes are processing tasks just fine.
When the Scheduler monitor detectes a "deadlock schedulers" condition, it reports an error in the Errorlog and triggers a memory dump to be generated by the SQLDumper.exe against SQL Server process.
What "Deadlocked Schedulers" is not?
The phraase "deadlocked schedulers" has confused many and rightly so. The term does NOT imply a classic Lock-Manager deadlock where multiple sessions are trying to access locks in the opposite order and permanently block each other. The term applies to SOS Schedulers and the fact that none of them are processing queries, logins, etc. So if you are thinking of "traditional" lock deadlocks, you may steer away from such thoughts.
Most Common Causes of Deadlocked Schedulers:
Even though many conditions can cause all schedulers to be stuck at the same time, in 95% of the time that I have looked at memory dumps, the reason has been that the majority of worker threads were tied up waiting for a lock resource. Yes, the good ol' blocking chains with a head blocker that you are so familiar with, is the most common example. This causes the number of available SQL worker threads to be exhausted. The worker threads themselves are tied up because they are waiting for a lock and when new tasks (work) is submitted to the SQL Server, there are no worker threads available to pick the tasks up and execute them. Of courses, Locks are not the only resource you see contention on. Here is the list of other types of resources that have caused long "blocking" (I use the term loosely here) chains.
How to Troubleshoot Deadlocked Schedulers
Some deadlock scheduler scenarios will resolve by themselves - as in the case of parallel queries, long-held I/O latch or, rarely, blocking. Orphaned structures like spinlocks or latches would require that you restart SQL Server. In most realistic cases, you will be aware of the issue only after it has been going on for awhile or after it has resolved itself. Therefore, in those cases, you will be focusing on post-mortem analysis of the problem. However, in rare cases, you may be able to take action while the problem is happening and resolve it (see using DAC connection below).
There are two ways to find the root cause of deadlocked schedulers:
1.Using DAC connection and examining some DMVs
2.Examining a memory dump
1. Troubleshooting Deadlocked Schedulers using the DAC connection
In essence, the sole purpose of the Dedicated Admin Connection (DAC) is to troubleshoot server problems like deadlocked schedulers. That is, if you catch it "live", while it is happening. You can find details on how to connect to SQL Server using DAC here. The most effective use of DAC to investigate and resolve deadlock scheduler problems is if the issue is caused by blocking. Luckily, this is the most common reason for deadlocked schedulers so DAC can be an effective tool.
So, once you connect with DAC (I would suggest using sqlcmd -A instead of SSMS, since the latter may attempt multiple connections to SQL Server) , you can investigate a couple of things:
1. The state of all schedulers - sys.dm_os_schedulers and specifically if there is any queued work that is waiting.
2. If there are hundreds of blocked threads - sys.dm_exec_requests - Wait_type = LCK_* (any type of lock wait)
3. Identify the head blocker from sys.dm_exec_requests by manually tracing which session is blocking all others, and terminate it by using the KILL command.
Note: If the issue is caused by an orphaned latch or orphaned spinlock, then the only way to get out of this situation is to restart SQL Server.
2. Troubleshooting Deadlocked Schedulers by Examining a Memory Dump
Most commonly the best way to find the root cause of deadlocked schedulers is to figure out what the majority of worker threads are stuck behind. And the way to do this by examining the memory dumps that SQL Server automatically generates for this purpose.
Here are some examples of stacks that you will observe if you were to examine a memory dump (details on how to do so using public symbols can be found here)
A typical call stack of all the threads that are stuck on Lock waits will include the LockOwner::Sleep() function. Here is a example
You will observe that most thread that are stuck behind a latch will contain the LatchBase::Suspend() call. Here is an example
ntdll!ZwWaitForSingleObject+0xa kernel32!WaitForSingleObjectEx+0x130 sqlservr!SOS_Scheduler::SwitchContext+0x623 sqlservr!SOS_Scheduler::SuspendNonPreemptive+0xc5 sqlservr!EventInternal<Spinlock<149,1,0> >::Wait+0x428 sqlservr!LatchBase::Suspend+0x42c sqlservr!LatchBase::AcquireInternal+0xfa sqlservr!CTraceController::FAcquireLock+0x53 sqlservr!CTraceController::CAutoLock::FAcquire+0x18 sqlservr!CTraceController::ProduceRecord+0xb58456 sqlservr!CRpcTraceHelper::TracePostRPC+0x347 sqlservr!CRpcTraceHelper::TracePostExec+0xfc8b44 sqlservr!CRPCExecEnv::OnExecFinish+0xaf sqlservr!process_request+0x399 sqlservr!process_commands+0x1ba sqlservr!SOS_Task::Param::Execute+0x11b sqlservr!SOS_Scheduler::RunTask+0xca sqlservr!SOS_Scheduler::ProcessTasks+0x95 sqlservr!SchedulerManager::WorkerEntryPoint+0x110 sqlservr!SystemThread::RunWorker+0x60 sqlservr!SystemThreadDispatcher::ProcessWorker+0x12c sqlservr!SchedulerManager::ThreadEntryPoint+0x12f msvcr80!_callthreadstartex+0x17 msvcr80!_threadstartex+0x84 kernel32!BaseThreadStart+0x3a
Here is what a stack might look like from a thread stuck behind a spinlock - note the SpinlockBase::Sleep()
ntdll!ZwDelayExecution+0xa kernel32!SleepEx+0xaf sqlservr!SpinlockBase::Sleep+0x15d5a1 sqlservr!SpinlockBase::SpinToAcquire+0x11b sqlservr!TSyncHashTable_EntryAccessor<...>::TSyncHashTable_EntryAccessor<...>+0x76 sqlservr!CQSIndexStatsMgr::UsedFor+0x68 sqlservr!CQueryScan::Uncache+0x4788a6 sqlservr!CXStmtQuery::SetupQueryScanAndExpression+0x252 sqlservr!CXStmtQuery::ErsqExecuteQuery+0x232 sqlservr!CXStmtSelect::XretExecute+0x110 sqlservr!CExecStmtLoopVars::ExecuteXStmtAndSetXretReturn+0x13 sqlservr!CMsqlExecContext::ExecuteStmts<1,0>+0x9e5 sqlservr!CMsqlExecContext::FExecute+0x4c5270 sqlservr!CSQLSource::Execute+0x36d sqlservr!ExecuteSql+0x700 sqlservr!CSpecProc::ExecuteSpecial+0x4c7 sqlservr!CSpecProc::Execute+0x1d5 sqlservr!process_request+0x312 sqlservr!process_commands+0x1c4 sqlservr!SOS_Task::Param::Execute+0xee sqlservr!SOS_Scheduler::RunTask+0xc9 sqlservr!SOS_Scheduler::ProcessTasks+0xb4 sqlservr!SchedulerManager::WorkerEntryPoint+0xe7 sqlservr!SystemThread::RunWorker+0x59 sqlservr!SystemThreadDispatcher::ProcessWorker+0x130 sqlservr!SchedulerManager::ThreadEntryPoint+0x128 msvcr80!_callthreadstartex+0x17 msvcr80!_threadstartex+0x84 kernel32!BaseThreadStart+0x3a