Jake Oshins wanted to write about IRQLs and I am gladly letting him use my blog as a platform.  Here it is…

I’ve found myself explaining IRQL a lot lately, sometimes to people who want to know because they’re trying to write Windows drivers and sometimes to people who are accustomed to Linux or some other variant of Unix and they want to know why something like IRQL is required within Windows when those systems so clearly get by without it.

Penny Orwick covered this topic before, in the following two papers, with a lot of help from me and some others:

http://www.microsoft.com/whdc/driver/kernel/irql.mspx

http://www.microsoft.com/whdc/driver/kernel/locks.mspx

I’ll try to do it a little more briefly here.

Computers have many things within them that can interrupt a processor.  These include timers, I/O devices, other processors, internal processor performance counters, etc.  All processors have an instruction for disabling interrupts, somehow, but that instruction (cli in x64 processors) isn’t selective about which interrupts it disables.

The people who built DEC’s VMS operating system also helped design the processors that DEC used, and many of them came to Microsoft and designed Windows NT, which was the basis for modern versions of Windows, including Windows XP and Windows 7.  These guys wanted a way to disable (very quickly) just some of the interrupts in the system.  They considered it useful to hold off interrupts from some sources while servicing interrupts from other sources.

They also realized that, just as you must acquire locks in the same order everywhere in your code to avoid deadlocks, you must also service interrupts with the same relative priority every time.  It doesn’t work if the clock interrupts are sometimes more important than the IDE controller’s interrupts and sometimes they aren’t. 

Interrupts are frequently called “Interrupt ReQuests” and the priority of a specific IRQ is its Level.  These letters, all run together, are IRQL.

So if you lay out all the interrupt sources in the system and create a priority for each one, or sometimes a priority for each group, you can start to do interesting things. 

Consider a spinlock.  Spinlocks (at least in the traditional sense) are implemented by having a processor spin in a tight loop trying to atomically modify a variable.  The cache coherency hardware guarantees that only one processor can do that at a time, so lock acquisition goes only to the processor that succeeds.  Other processors keep spinning until they succeed.

The processor that “owns” the lock needs to release the lock as soon as possible, as the other (waiting) processors are burning up processor time waiting to acquire the lock.  So you really don’t want to interrupt that processor and schedule some other thread for execution, causing all the waiters to spin until the owning thread is rescheduled.

In this situation, some operating systems encourage the owner of the spinlock to disable all interrupts so that the code can’t be interrupted.  (Note, too, that interrupts really need to be disabled before trying to acquire the lock, or the thread might be interrupted between acquiring the lock and disabling interrupts.)

The designers of VMS and NT decided that they didn’t want to disable all interrupts just because some code somewhere acquired a spinlock.  Some things shouldn’t wait.  TLB flushes, are a good example.  So if only some interrupts are disabled while a spinlock is held, then you can still briefly interrupt the code that owns the lock for much more important tasks.  Perhaps even more importantly, you can interrupt the processors which are spinning, waiting to acquire a spinlock for these important tasks, causing them to do something useful instead of just spinning.

Note that this means that every spinlock has an associated IRQL, and you have to use that IRQL consistently, or the machine will deadlock.  In NT, by default, every spinlock has the same IRQL, called DISPATCH_LEVEL.  DISPATCH_LEVEL means, essentially, that the interrupts which can cause a thread to stop running are disabled.  (More about that later.)

Here’s a table of all IRQLs, as defined in the Windows NT header files (easily seen in the WDK.)

IRQL

X86 IRQL Value

AMD64 IRQL Value

IA64 IRQL Value

Description

PASSIVE_LEVEL

0

0

0

User threads and most kernel-mode operations

APC_LEVEL

1

1

1

Asynchronous procedure calls and page faults

DISPATCH_LEVEL

2

2

2

Thread scheduler and deferred procedure calls (DPCs)

CMC_LEVEL

N/A

N/A

3

Correctable machine-check level (IA64 platforms only)

Device interrupt levels (DIRQL)

3-26

3-11

4-11

Device interrupts

PC_LEVEL

N/A

N/A

12

Performance counter (IA64 platforms only)

PROFILE_LEVEL

27

15

15

Profiling timer for releases earlier than Windows 2000

SYNCH_LEVEL

27

13

13

Synchronization of code and instruction streams across processors

CLOCK_LEVEL

N/A

13

13

Clock timer

CLOCK2_LEVEL

28

N/A

N/A

Clock timer for x86 hardware

IPI_LEVEL

29

14

14

Interprocessor interrupt for enforcing cache consistency

POWER_LEVEL

30

14

15

Power failure

HIGH_LEVEL

31

15

15

Machine checks and catastrophic errors; profiling timer for Windows XP and later releases

For driver writers, the only IRQLs that are usually interesting are 0 through 2 and DIRQL.  It’s worth mentioning, though, that the NT kernel itself internally has spinlocks at DISPATCH_LEVEL and all the levels above that.

So, now for a tour of interesting IRQLs:

PASSIVE_LEVEL

This is the level at which threads run.  In fact, if you look at the specific definition of “thread” in NT, it pretty much only covers code that runs in the context of a specific process, at PASSIVE_LEVEL or APC_LEVEL.  Deferred Procedure Calls (DPCs) are not threads, in that sense.

Any interrupt can occur at PASSIVE_LEVEL.  User-mode code executes at PASSIVE_LEVEL.

APC_LEVEL

Windows NT has an interesting mechanism for getting into a certain thread context.  You can queue an interrupt to a thread, so that your function will run on that thread’s stack, with that thread’s address space, with that thread’s local storage.  This is useful for I/O completion.  When I/O completes, you queue an APC back to the requesting thread which does the last part of I/O completion in the initiator’s address space.  It’s a neat way to solve a bunch of problems.

If you want to disable interrupts to your thread, you raise to APC_LEVEL.  At least that was the original design.  APCs and the rules around them have grown much more complicated over the years.  At this point, the best that you can say is that if you care to disable APCs, call KeEnterCriticalRegion (http://msdn.microsoft.com/en-us/library/ms801955.aspx) or KeEnterGuardedRegion (http://msdn.microsoft.com/en-us/library/ms801643.aspx.)

Your code generally won’t need to run at APC_LEVEL at all, unless you use Fast Mutexes (http://msdn.microsoft.com/en-us/library/aa490219.aspx.)  Fast Mutexes are somewhat faster than Mutexes (http://msdn.microsoft.com/en-us/library/aa490228.aspx) or other dispatcher objects because, among other things, they hold off APCs by raising to APC_LEVEL.

APC interrupts, by the way, are sent by a processor, either to itself or to another processor.  No external device is involved.

DISPATCH_LEVEL

Windows NT doesn’t have a “scheduler” in the sense that most Unix variants do.  There is no process that decides which other processes should run.  Each processor “dispatches” itself by looking at runnable threads and deciding which one to run next.  This is a scheduler, of sorts, but not the same thing that many people coming from Linux will imagine.

The dispatcher is interrupt driven, in that it won’t allow a thread to run longer than its quantum before scheduling another thread.  But the scheduling clock doesn’t generate dispatcher interrupts directly.  The clock interrupt fires at CLOCK_LEVEL, somewhat more frequently than the thread scheduling quantum.  Various housekeeping tasks happen as a result of the clock interrupt, and one of them is that a dispatcher interrupt is generated by the processor to itself.  (Actually, this internal self-interrupt is often optimized away, but the architectural result is the same as if an interrupt were generated.)

If your code raises IRQL to DISPATCH_LEVEL, you have disabled the dispatcher on that processor, and only on that processor.  This means that your thread will not be pre-empted by another thread and it will not be moved to another processor until you lower IRQL.

Since, as noted above, I/O completion depends on code running at APC_LEVEL, and since APC_LEVEL code won’t run while the processor is at DISPATCH_LEVEL, page faults can’t be resolved at DISPATCH_LEVEL.  So code that holds a DISPATCH_LEVEL lock (like a spinlock) can’t reference memory which might be paged out.

Furthermore, most of the locking primitives that the NT kernel provides are what are called “dispatcher objects” (http://msdn.microsoft.com/en-us/library/aa490210.aspx.)  You can wait on dispatcher objects until they are signaled and, while your code is waiting, the processor is free to get other work done, on behalf of other threads.  This is nice, because, in contrast with the spinlock, which consumes the processor doing no useful work while it’s waiting, dispatcher objects allow the dispatcher to find other work until the reason for waiting can be satisfied.

What this means to you, though, is that you can’t wait on a dispatcher object at DISPATCH_LEVEL.  You’ve already disabled the dispatcher.  Your only choice at DISPATCH_LEVEL is a spinlock.

DIRQL

“DIRQL” is the shorthand that many people (internal to Microsoft and external) use when they mean “the IRQL that the PnP manager assigned to my device’s interrupt, and the associated interrupt spinlock and interrupt service routine.”  When a bus driver requests an interrupt for a device (as when the PCI driver finds the Interrupt Pin register set to some non-zero value, or when it discovers an MSI-X table) it tells the PnP manager two things.  First, it says that the device needs to register an ISR or a set of ISRs.  Next it says something about how the device is attached to any interrupt controllers present in the machine.  The PnP manager picks a processor to attach the interrupt to and picks the IRQL for that interrupt.  Sometimes that choice is constrained by the way the wires are laid out on the motherboard, sometimes not.  That topic is too big for this post.  (I might go into it later.  I wrote the code.)

As you can see from the table above, there is more than one DIRQL.  Unless your device generates more than one interrupt, you don’t really have to care.  Just pass along the values that you were given.  Your interrupt spinlock’s IRQL is that which was assigned to you.  The only thing you have to know about it is that acquiring that lock means that you’ve pre-empted everything happening at lower IRQL.  You haven’t pre-empted things like TLB updates, though, as those still come in at higher IRQL.

If your device does generate more than one interrupt, and if you need one spinlock that is used for both interrupt sources, you need to register your interrupt service routines with the highest of your DIRQLs as the SynchronizeIrql, which will avoid deadlocks by guaranteeing that all your interrupt-related code runs at the highest necessary IRQL.

In summary, IRQL is a concept that was intended to allow spinlocks to be sorted into more-important and less-important buckets, so that some interrupts can occur while other interrupts are disabled.

Most people agree that this is fairly complex to work with.  Whether you believe this was a necessary addition to the driver model is the source of a debate that’s been raging on the ‘net since before Windows NT actually existed.

- Jake Oshins