Oversubscription: a Classic Parallel Performance Problem

Oversubscription: a Classic Parallel Performance Problem

  • Comments 2

One of the most important things to pay attention to when tuning a multithreaded application is its performance pattern.  There is a set of common poor performance patterns that most developers of multithreaded applications will encounter.  These include, among other things, patterns such as oversubscription, serialization, lock convoys, and uneven workload distribution.  We have documented some of these poor parallel performance patterns in what we call the Rogues Gallery. We plan to extend this list based on input from the developer community as customers identify and share new patterns.

In this post, I will discuss oversubscription.  In the case of oversubscription, the number of threads trying to run exceeds the number of available logical cores.  This over-decomposition of the workload leads to additional unnecessary context switches.  Context switches have a nonzero cost and are especially costly when they cross cores.

To make this more clear, I will present a simple example.  In this case, I solve the same workload using four threads and then again with ten threads.  My machine has four cores so I know that creating ten threads is unnecessary.  The first two images depict the activity of four threads executing along with stats in the Visible Timeline Profile.  The second two images show the behavior when ten threads are used to solve the same problem as well as the associated Visible Timeline Profile.

Using four threads (3.41 sec):

image

image

Using ten threads (4.84 sec):

image

image

After running several trials for both numbers of threads, I timed both scenarios and found that on average, using four threads solved the problem in roughly 3.41 seconds while using ten threads took roughly 4.84 seconds.  In this example, we can see from the threads view that there is a significant amount of additional preemption caused by introducing more threads.  A substantially larger portion of thread time is spent waiting for resources to be made available. 

The active legend shows 4% of preemption when using four threads and 49% of preemption when using ten threads. There is a bit of preemption even with four threads due to the CPU contention of other processes running on the system.  Due to the increased context switching caused by additional threads, the ten-thread implementation ran more slowly.  If your algorithm lends itself to parallelization and it isn’t necessary to create more threads than there are cores, it is best to avoid oversubscription.

This is just one of the many possible poor performance patterns that one must keep in mind when writing multithreaded applications.  It is important to be aware of whether your application is affected by such a poor performance pattern and the Concurrency Visualizer is a great way to quickly identify if your application is afflicted by such a pattern.

James Rapp - Parallel Computing Platform

  • Oversubscription is indeed a classic problem, but we should be careful about oversimplifications. For example, the one thread per CPU idea that seems to be suggested by this article is valid for CPU-bound computations only.

    Even back in the unicore days, we used multiple threads to use resource efficiently when a program interleaves IO with computation. Of course, asynchronous IO alows doing this with less threads too, but it often requires the kind of complex scheduling that becomes available only now with the TPL.

    With synchronous IO, if a thread spends a fraction f of its time waiting for IO, the optimal number of threads may be closer to n/(1-f), where n is the number of hardware threads.

    Anyway, all I'm saying is that CPU-bound computation is a special case, and we should be careful to draw general conclusions based on a special case. In the end, as you correctly pointed out, only careful measurement can provide you with the answers you're after when tuning an application.

    Kris Vandermotten

  • I've written a multi-threaded internet bot that collects data scraped off of web pages. Nothing very computationally intensive, but lots of waiting for network I/O. I've found that on a logical 16 core box I can improve performance up to around 200 total threads in my application and above that it seems to crash. I wish I understood what was going on enough to use the right number of threads.

Page 1 of 1 (2 items)
Leave a Comment
  • Please add 3 and 7 and type the answer here:
  • Post