Put another way, can Windows preempt our worker thread and perform a context switch even though it has been “scheduled” by the SQL OS (SOS) and SQL thinks it is running? Of course it can.
SQL Server implements a cooperative scheduling mechanism to make the most efficient use of the CPUs as it can. However, Windows ultimately schedules the thread and uses a general purpose pre-emptive scheduler and does not give special considerations to SQL Server when the dispatcher selects a thread – there all just threads to Windows (though they can have priority). So if a higher priority thread comes in from an interrupt or if we’ve exhausted our quantum at the OS level and another ready thread is waiting from another process to run it’s timeslice, then our worker thread will get preempted, regardless of what SQL wants. SQL controls this behavior to the best of it’s ability amongst it’s own worker threads by making sure no more threads are viable for scheduling (from within SQL) than the number of CPUs available to SQL Server. Also, SQL Server realizes there is no sense in scheduling a thread that is waiting on a database resource – like a transactional lock – something that Windows doesn’t understand. If a thread reaches its quantum at the OS and no other threads are viable to the Windows dispatcher, we can get another quantum and maximize SQL Server’s use of the CPU. In this way, we hope to reduce context switches – pure overhead from an execution standpoint. SQL controls the scheduling of it’s own worker threads by placing them into wait states with APIs like WaitForSingleObject. Then when a thread is ready to run because a lock has been freed, a timer has expired, or a task / request has come in from a user, SQL Server signals it to come out of the wait state and execute. Even as it sits in the RUNNABLE queue, it is actually in one of these wait APIs (though the SQL status will be RUNNABLE). As the “running” thread leaves, part of it’s “good-citizen” cooperative scheduling responsibilities is to “signal” the next worker thread in the RUNNABLE queue so that it can get it’s 15ms of fame.
This is a good reason to limit as much as possible what other software executes on your SQL Server. However, let’s see an example stack taken from my machine where just this scenario happened.
First, let’s look at a “normal” example:
So here we have a thread that is part of a parallel query that goes into a wait. Ever see those CXPACKET waits? Well, you’re looking at one now. So whlie we wait on CXPACKET or any other wait type – we can’t do anything. Let’s “switch context” and have the SQL OS (SOS) take us off the scheduler – let someone else have their time – and we can be signaled when it’s time to go again. As we “switch”, the SOS scheduler puts us in a wait state (waiting on an event) with WaitForSingleObject. At the frame that reads ZwWaitForSingleObject, you are seeing the transition from user mode into kernel mode as Windows takes over. Then at the top of the stack, the Windows dispatcher records our context record and switches us off the CPU and finds which thread is ready to run – which is very likely NOT another SQL Server thread in *this case* – but on a dedicated SQL Server we’d like it to be – and most of the time it should be.
Now, let’s see if SQL can have the rug pulled out from under it… Here’s one:
Notice at the top – Windows has performed the context switch and let someone else run. Once again, Windows will save our execution context in something known as a context record (so it knows where it was and what we were doing when we finally do get back on the CPU) and let another thread execute – very likely a thread from a different process. Also, notice that the last frame from SQL Server (sqlservr!*) is more parallelism code. There is no call to sleep, no call to SwitchContext, nothing from the SQL OS at all. We are just rudely interrupted by the kernel (ntoskrnl.exe) and told – you’re done. Actually, at this point it was processing the APC queue (part of the kernel) and went into it’s own WaitForSingleObject call as a part of the context switch – but not one invoked by SQL Server. Also, kernel stacks work the same as user mode stacks – so other kernel code could have already run and been “popped” off the stack at this point. We’ll get more time in a matter of milliseconds, but the more this happens, the more your SQL code waits, the more the duration goes up – and yet no progress is made.
So you really want to limit how much non-SQL server software you execute on your server, SQL Server can’t control other processes or Window’s scheduling of those processes.
Finally, how did I get this? First, I used Mark Russinovich’s excellent Process Explorer utility to capture both the user and kernel mode stacks in one nice combined stack trace. To easily “create” this mess and find an offending thread, I ran a very busy query (a cross join) and also ran an external utility I wrote that does nothing but burn cpu, it’s literally called “EatCPU”.