Larry Osterman's WebLog

Confessions of an Old Fogey
Blog - Title

Why does Win32 even have Fibers?

Why does Win32 even have Fibers?

  • Comments 23
Raymond's had an interesting series on fibers (starting here), and I thought I'd expand on them a bit.

Fibers were added to Windows (in NT 3.51 SP3, IIRC) because some customers (not just SQL server) believed that they could improve the performance of their server applications if they had more control over their threading environment.

But why on earth did these customers want fibers?

Well, it's all about scalability, especially on MP system.  On a multitasking system, it's easy to forget that a single processor can only do one thing at a time.

The ramifications of this are actually quite profound.  If you've got two tasks currently running on your system, then your operating system will have to switch between each of them.  That switch is called a context switch, and it can be expensive (just for the sake of argument, let's say a context switch takes 5000 instructions).  In a context switch, the operating system has to (at a minimum):

  1. Enter Kernel Mode
  2. Save all the old threads registers.
  3. Acquire the dispatch spinlock.
  4. Determine the next thread to run (if the next thread is in another process, this can get expensive)
  5. Leave the dispatch spinlock.
  6. Swap the old threads kernel state with the new threads kernel state.
  7. Restore the new threads registers.
  8. Leave Kernel Mode

That's a fair amount of work to perform (not outrageous, but not trivial).

The OS won't do this unless it has to.  In general, there are three things that will cause the OS to cause a context switch are (there are others, like page faults, but these are the big ones):

  1. When your thread goes to sleep (either by calling Sleep() or calling WaitFor[Single|Multiple]Object[s])
  2. When your thread calls SwitchToThread() or Sleep(0) (this is a special case of the Sleep() API that is identical to SwitchToThread())
  3. When your thread's quanta elapses.

A thread's quanta is essentially the amount of time that the OS will dedicate to a thread if there's another thread in the system that can also run.  A quantum is something like 5-10 ticks on a workstation and 10-15 on server, and each tick is typically somewhere between 10 and 15 milliseconds, depending on the platform.  In general, your thread will get its full quanta unless there is a higher priority runnable thread in the system (please note: this is a grotesque simplification, but it's sufficient for the purposes of this discussion).

The thing is, for a highly scalable application, context switches are BAD.  They represent CPU time that the application could be spending on working for the customer, but instead is spent doing what is essentially OS bookkeeping.  So a highly scalable application REALLY wants to reduce the number of context switches.  If you ever have a service that's performing poorly, one of the first things to look for is the number of context switches/second - if it's high (for some value of high), then there's invariably a scalability issue in the application that needs to be addressed.

So why fibers?  Because for highly scalable applications, you want each of your threads to get their full quanta - in other words, you want the only reason for a context switch to be reason #3 above. 

Remember the first cause of context switches: Calling WaitFor*Object.  What that means is that if you call EnterCriticalSection on a critical section with contention, then you're highly likely to cause a context switch. The same thing happens when you wait for an I/O to complete, etc.  You absolutely want to avoid calling any Win32 APIs that might block under the covers.

So fibers were created to resolve this issue.  A fiber is effectively removes steps 1, 3, 5 and 8 from the context switch steps above, switching from one fiber to another just saves the old register state, and restores the new register state.  It's up to the application to determine which fiber runs next, etc.  But the application can make its own choices.  As a result, a server application could have a dozen or more "tasks" running on each thread, and they'd radically reduce their context switch overhead, because saving and restoring registers is significantly faster than a full context switch.  The other thing that fibers allow is the ability to avoid the dispatcher spin lock (see John Vert's comment about context switches being serialized across all processors below).  Any global lock hurts your scalability, and fibers allow an application to avoid one of the global locks in the system.

Ok, so why have fibers remained obscure?

They've remained obscure first because of the reasons Raymond mentioned in his fibers caveat here - using fibers is an all-or-nothing thing, and it's not possible to use fibers from a shared library.  As Rob Earhart pointed out in this comment on Raymond's post, some of the idiosyncrasies of the fiber APIs have been resolved in the current versions of Windows.

They're also HARD to deal with - you essentially have to write your own scheduler.

Raymond also left off a couple of other gotchas: For example, if you're using fibers to improve your apps scalability, you can't call ANY Win32 APIs that might block (including filesystem APIs) because all the Win32 blocking APIs are also have thread affinity (not surprisingly :)) So if you're running 20 fibers on a single thread, when any of the fibers blocks, your thread blocks (however, the fibers can be run from another thread, because fibers don't have thread affinity, so if you have a spare thread around, that thread can run the fibers).

The other reason that fibers have remained obscure is more fundamental.  It has to do with Moore's law (there's a reason for the posts yesterday and the day before).

Back when fibers were first implemented, CPUs were a lot slower.  Those 5000 instructions for the context switch (again, this is just a guess) took .05 millisecond (assuming one cycle/instruction) to execute on a 100MHz machine (which would be a pretty fast machine in 1995).  Well, on a 2GHz machine, that .05 is .0025 millisecond - it's an order of magnitude smaller.  The raw cost of a context switch has gone down dramatically.  In addition, there has been a significant amount of work in the base operating system to increase the scalability of the dispatcher spinlock - nowadays, the overhead of the dispatcher lock is essentially nonexistant on many MP machines (you start to see contention issues on machines with a lot of CPUs, for some value of "large").

But there's another aspect of performance that has gone up dramatically, and that's the cost of blowing the CPU cache.

As processors have gotten smarter, the performance of the CPU cache has become more and more critical to their speed - because main memory is painfully slow compared to the speed of the processor, if you're not getting your data from the CPU's cache, you're paying a huge hidden penalty.  And fibers don't fix this cost - when you switch from one fiber to another, you're going to blow the CPU cache.

Nowadays, the cost of blowing the cache has leveled the playing field between OS context switches and fibers - these days, you don't get nearly the benefit from fibers that you did ten years ago.

This isn't to say that fibers won't become useful in the future, they might.  But they're no longer as useful as they were.

Btw, it's important to note that fibers aren't the ONLY solution to the thread quantization issue mentioned above.  I/O completion ports can also be used to limit context switches - the built-in Win32 thread pool uses them (that's also what I used in my earlier post about thread pools).  In fact, the recomendation is that instead of spending your time rewriting your app to use fibers (and it IS a rewrite), instead it's better to rearchitect your app to use a "minimal context" model - instead of maintaining the state of your server on the stack, maintain it in a small data structure, and have that structure drive a small one-thread-per-cpu state machine.  You'll still have the issue of unexpected blocking points (you call malloc and malloc blocks accessing the heap critical section), but that issue exists regardless of how your app's architected.

If you're designing a scalable application, you need to architect your application to minimize the number of context switches, so it's critical that you not add unnecessary context switches to your app (like queuing a request to a worker thread, then block on the request (which forces the OS to switch to the worker, then back to the original thread)). 

Significant Edit (1/10/2005): Fixed several issues pointed out by the base performance team.

 

  • I first found out about fibers from reading Chris Brummes excellent but lengthy blog post on CLR hosting.
    http://blogs.msdn.com/cbrumme/archive/2004/02/21/77595.aspx
    I'm glad to see some more information being posted.

    Its amazing how much I feel I've learned through these blogs in the last year. Yours and Raymonds especially. There seems to be very little information on these low level details of the OS.

    I have copy of the new edition of Windows Internals on its way to me, should be an interesting book to dip into for those who want to understand more about how things work under the covers.
    http://www.amazon.com/exec/obidos/tg/detail/-/0735619174/102-2480567-1461732?v=glance
  • On a somewhat historical note, the real reason we added fibers was not to give people control over their threads. Heck, it's your thread, you've got control, just allocate some memory for a stack then set ESP to point to it. Lots of people tried this, but it's not quite that simple. There are a lot of subtleties to get wrong. For instance, they would switch the stack base, but not update the stack limit. Or not switch the exception handler list. Or forget to swap the FP state. And down the line these little gotchas would bite them in some subtle unforeseen way.

    Since people kept getting it wrong (and in fact, it was not possible to get it right without doing some fairly undocumented stuff) and causing untold grief, we figured it was time to build it into the OS and let everybody use.

    Sometimes this is how APIs get created. Customers keep trying to do something that should be simple but is really much harder than it appears at first. So we make an API that encapsulates the subtle hard stuff.

    Another BIG reason context switches are bad is because they are serialized. So even if you have a 32-CPU machine only one CPU can be context-switching at a time. The more context switches your app does the worse it scales. Fiber switches do not acquire any locks, therefore they will scale much better.
    -John (ex-NT kernel dev without a blog)
  • John,
    Thanks for the update, it's appreciated.
  • This is enlightening, I was under the impression that Fibers were added to Windows to help with porting UNIX Server applications. In particular applications that already have there own threading logic.

    It's pretty clear I'm wrong about that , and it's nice to know the real reasons.

    Thanks for the enlightening post.

    --Fox Cutter
  • > just for the sake of argument, let's say a
    > context switch takes 5000 instructions

    As you say that's just a guess, which is no problem. My guess is faster by an order of magnitude though. The rough estimate that I had heard was that servicing a page fault takes about 5000 instructions, and that's doing a lot more than other kinds of context switches.

    By the way Microsoft Japan doesn't distinguish between page faults (ordinary happenings which can be remedied properly) and invalid page faults (resulting from bugs, resulting from pointers that got corrupted by various famous means or occasionally by actual bugs in the same program that was actually crashing). I think engineers do know the difference though.
  • Norman, I believe that you're off by several orders of magnitude on the cost of a page fault. Page faults instruction counts can run from the thousands of instructions to the millions of instructions (depending on what had to happen).

    After all, a page fault is likely to have to hit the disk (assuming that the page wasn't available in transition pages list), and when that happens, you might be looking at 2-200ms of time. There's a LOT of instructions in 200ms, even on a 100MHz processor.

  • Norman, Who is 'Microsoft Japan' you are talking about?
    Is there an official MS document stating for all MS Japan workers 'thou shall not distinguish between page faults and invalid page faults' or is it someone working at MS Japan who did not distinguish between these two?

    Sam
  • 1/5/2005 8:24 PM Larry Osterman

    > After all, a page fault is likely to have to
    > hit the disk

    Of course but that's elapsed time not instruction count. The CPU still gets to do useful work for other purposes during that time.

    1/6/2005 2:00 AM Sam

    > Norman, Who is 'Microsoft Japan' you are
    > talking about?

    If you're not satisfied with the English name for that company then hypothetically I should type the name in Japanese, but the server that hosts this blog has corrupted Japanese words that I have typed in the past. Sorry.

    > Is there an official MS document stating for
    > all MS Japan workers

    I don't know, the visible appearance of Windows is publicly visible but a lot of official MS documents aren't publicly visible. Even official MS documents which prove that OEMs are supposed to be responsible for supporting customers who get hit (sometimes very seriously) by MS bugs in MS Windows, aren't publicly visible. So sorry, I don't know.
  • Norman, actually it IS instruction count - if there are other runnable tasks on the system, they'll be executing instructions.

    I stand by my claim that context switches are less expensive than page faults (and I'm sure that the perf experts on the base team will correct me if I'm out of line :))
  • This is slightly off topic, but would you please comment on SwitchToThread being identical to Sleep(0)? The documentation suggests that the latter leaves the thread in a 'ready' state, wheras the former yields to another thread that's ready to run and returns an indication of whether it's successful.

    So far, so similar, but a specific consequence of using Sleep, as I read it, is that it will only relinquish to threads of an equal priority whereas SwitchToThread makes no such claim (for some definition of 'ready to run'). I'm also confused about the specification of "equal priority": my assumption is that this means that threads may be affected by the dynamic priority boost asssigned when returning from wait functions?

    I know that I've found the public documentation to be scattered and seemingly inconsistent. I prefer not to use these functions anyway, but a clarification or pointer would be greatly appreciated now that you've reminded me of reading about them :-)
  • >> It's somewhere between 10 and 15 milliseconds, depending on the platform. <<

    I'm not sure I understand this. That would imply, assuming a hundred threads, that a thread runs for 15 milliseconds and then sleeps for nearly a second. This doesn't seem right and certainly wouldn't make for great application responsiveness. I must be missing something. Any chance of a clarification? Thanks.
  • Twiddle: Yup, that's just about right. You're making a HUGE assumption though: That all 100 threads are runnable.

    To verify this, simply write an application that creates 100 threads that each do some CPU bound operation (like counting). You'll find that each thread only gets to run about once a second or so. You'll also find that the system responsiveness is pretty mediochre.

    The key thing to keep in mind is that on most systems, 99% of the threads on the system are not runnable.
  • Ahha, that makes sense. Thanks for the clarification. With this in mind, it is probably not a good idea to create more than a few CPU intensive threads. This also means that, in certain circumstances, the minimum Sleep() could be a lot longer than the usually cited 10-15ms.
  • 1/6/2005 5:08 PM Larry Osterman

    > Norman, actually it IS instruction count -
    > if there are other runnable tasks on the
    > system, they'll be executing instructions.

    The instruction count spent servicing a page fault is instruction count that is unavailable for serving other useful tasks. The elapsed time waiting for the disk to deliver is elapsed time (with a completely different instruction count, sure but so what) where the CPU sure does serve other useful tasks.

    > I stand by my claim that context switches
    > are less expensive than page faults

    Of course. Who said otherwise? It's just the overall estimates that we're guessing at. You heard a rumour that context switches take somewhere around 5000 instructions on average (where the CPU cannot serve other useful tasks), and yes page faults take more. I heard a rumour that page faults take around 5000 instructions on average (where the CPU cannot serve other useful tasks), and yes simpler kinds of context switches take less. We're both guessing, and there is nothing wrong with labelling guesses as guesses. I only commented because the rumours we heard differ by maybe an order of magnitude.
  • On server (e.g. compute-bound) applications, those 5000 instructions matter a whole lot. We're talking about workloads where people have considered writing unified filesystem-through-controller stacks to avoid the CPU penalty of dispatching through the layers of the filesystem to the partitioning layer to the class driver to the port driver.

    For a typical interactive workload, you're right, this would be lost in the noise.
Page 1 of 2 (23 items) 12