Larry Osterman's WebLog

Confessions of an Old Fogey
Blog - Title

Concurrency, Part 10 - How do you know if you've got a scalability issue?

Concurrency, Part 10 - How do you know if you've got a scalability issue?

  • Comments 21
Well, the concurrency series is finally running down (phew, it's a lot longer than I expected it to be)...

Today's article is about determining how you know if you've got a scalability problem.

First, a general principle: All non trivial, long lived applications have scalability problems.  It's possible that the scalability issues don't matter to your application.  For example, if you application is Microsoft Word (or mIRC, or Firefox, or just about any other application that interacts with the user)) then scalability isn't likely to be an issue for your application - the reality is that the user isn't going to try to make your application faster by throwing more resources at the application.

As I write wrote the previous paragraph, I just realized that it describes the heart of scalability issues - if the user of your application feels it's necessary to throw more resources at your application, then your application needs to have to worry about scalability.  It doesn't matter if the resources being thrown at your application are disk drives, memory, CPUs, GPUs, blades, or entire computers, if the user decides that hat your system is bottlenecked on a resource, they're going to try to throw more of that resource at your application to make it run faster.  And that means that your application needs to be prepared to handle it.

Normally, these issues are only for server applications living in data farms, but we're starting to see the "throw more hardware at it" idea trickle down into the home space.  As usual, the gaming community is leading the way - the AlienWare SLI machines are a great example of this - to improve your 3d graphics performance, simply throw more GPUs at the problem.

I'm not going to go into diagnosing bottlenecks in general, there are loads of resources available on the web for it (my first Google hit on was this web cast from 2003).

But for diagnosing CPU bottlenecks related to concurrency issues, there's actually a relatively straightforward way of determining if you've got a scalability issue associated with your locks.  And that's to look at the "Context Switches/sec" perfmon counter.  There's an article on how to measure this in the Windows 2000 resource kit here, so I won't go into the details, but in a nutshell, you start the perfmon application, select all the threads in your application, and look at the context switches/sec for each thread.

You've got a scalability problem related to your locks if the context switches/second is somewhere above 2000 or so.

And that means you need to dig into your code to find the "hot" critical sections.  The good news is that it's not usually to hard to detect which critical section is "hot" - hook a debugger up to your application, start your stress and put a breakpoint in the ntdll!RtlEnterCriticalSection routine.  You'll get a crazy number of hits, but if you look at your call stacks, then the "hot" critical will start to show up.  It sounds tedious (and it is somewhat) but it is surprisingly effective.   There are other techniques for detecting the "hot" critical sections in your process but they are not guaranteed to work on all releases on Windows (and will make Raymond Chen very, very upset if you use them).

Sometimes, your CPU bottleneck is simply that you're doing too much work on a single thread - if it simply takes too much time to calculate something, then you need to start seeing if it's possible to parallelize your code - you're back in the realm of making your code go faster and out of the realm of concurrent programming.  Another option that you might have is the OpenMP language extensions for C and C++ that allow the compiler to start parallelizing your code for you.

But even if you do all that and ensure that your code is bottleneck free, you still can have scalability issues.  That's for tomorrow.

Edit: Fixed spelling mistakes.


  • This series has been very good and much appreciated for sure...I just had to comment on something you said today.

    > For example, if you application is
    > Microsoft Word (or mIRC, or Firefox,
    > or just about any other application
    > that interacts with the user)) then
    > scalability isn't likely to be an
    > issue for your application

    I don't know...I would hazard a guess that Word is doing a lot in the background.

    * Saving a backup copy.
    * Checking for typos (AutoCorrect, spelling, grammar, etc.).
    * Rendering images in the document.
    * Being a container for embedded objects.

    Firefox (or any other browser) as well:

    * Downloading content off servers.
    * Rendering content.
    * Being a container for plugins (Flash, Shockwave, PDF, etc.).

    These things all depend on the horsepower of the machine. Those of us who have run Word on XP on a 500MHz box with 256M of RAM know very well how important scalability is on a local user basis. :)
  • A valid point. But I doubt that anyone's going to believe that adding more disk drives to their machine is going to speed up Firefox.

    Similarly, they're not likely to add more network adapters or more CPUs.

    On the CPU issue, there's an easy check: Look at your CPU usage when you're running Firefox (or mIRC, or Word, or whatever). If it's at 100% in the application's process, then there might be an issue, but if it isn't at 100% it's not likely there's a huge problem.

    If your application interacts with a user, then (with the exception of games), the application is going to be spending 99% of its time waiting on the user.
  • I was just wondering about how many context switches is too many. On an idle Win2k3 server I see system wide switches at about 500 or 600/sec. Under load that skurockets to 6000+, which imho is way too many, but that number is the total number of context switches. Your estimate of 2000 being too many - is that per thread? Or in the whole system?
  • Jerry, the 2000-3000 is per thread, not system-wide.

    That's the problem with providing real numbers - any number I provide may be too low depending on your circumstances.

    And if I gave a number that was safe (if, for instance I said that 10,000 is too high), people would assume they were ok if they had 9,000. Which might not be the case.
  • This definately has been a great series and I know .net isn't your area of expertise, but would you have any idea if we have a managed app how we can go in and look at what is locking what? I am just begining to think this is an issue with one of my earlier .net apps as it is now growing huge and is pretty much asyncronous so something in there is taking some time.

    Had another question I was going to ask you as well about this I thought about it late last night. How does multithreading affect multiple devices? Like when dealing with hardware. I thought of this last night. Ok I recently bought a Zen Micro, I love that it just plugs into Windows Media Player and Sincs and everything. However last night I bought a new CD I went to Rip it through WMP and then Sinc it to my device. Now I noticed that WMP only rips one track at a time, While Ripping it is playing the song as well and I can use the UI as well. So yeah there are a few threads going on there. But why is it only ripping one track at a time. Is that because the CD Rom can only be read from a single thread at a time? If so how are you playing, are you playing what was already written to disk? or is the CD Rom actually allowing multiple reads from multiple threads, then is this really a good idea if I am playing track 1 while recording track 17.

    My questions isn't Windows Media Player specific you can answer however you wish I am more currious about devices and threading to them. But last night that is what made me specifically think of that.
  • Jeff,
    Your question dovetails quite neatly into tomorrows article, I'll try to make sure that I discuss it.
  • Maybe your series is so long because you keep cc'ing things in the title?
  • CC'ing?
  • "CC'ing?"

    A subtle dig at the typo in the title?

    Great series though, very interesting stuff.
  • Jeff Parker:
    More then one thread can read from a CD-ROM at the same time. The ReadFile API doesn't care much about the device type. But the CD-ROM device itself has only one laser beam for reading data. Also the physical layout of a CD is optimized for playing audio files, so you've got a long spiral of data. Andrew Tannenbaum's "Modern Operating Systems"[1] gives a very detailed discussion of how a CD-ROM works :). Knowing this, you see that WMP would just hurt performance if it tried to rip more songs at the same time. You can try this and see it for yourself too: Insert a CD-ROM with at least two big files on it and try to copy both at the same time to a HDD. It should be quicker to do it sequentially rather then in parallell.

    Great articles, Larry! Looking forward to read the next :)

  • > First, a general principal

    Palese! ^_^

    3/3/2005 11:36 AM Anonymous Coward

    > I would hazard a guess that Word is doing a
    > lot in the background.

    It is indeed, but a lot of it isn't in your list and isn't anything I've guessed. One time I had Word 2000 displaying a document, just sitting there doing nothing with a portion of the document sitting there on the screen, no animations or funny stuff like that, using 99% of the CPU time. (Windows Task Manager was surely using the other 1% to display its green rectangle.) Eventually I left it sitting there on one computer and scrolled it occasionally, and used a different computer to type a translation. After about 8 hours it had used about 7 hours 59 minutes of CPU time. I have a feeling that throwing CPUs at it wouldn't have helped.
  • Ingrid, good catch, I missed that one.

    Larry, you made another principal/principle mistake too ;-)
  • Norman made me wonder, is it possible for one hung infinite loop thread to hose all the cpus on a standard (not virtualized) SMP system? Or would it kill one CPU, or flip from CPU to CPU every few quanta? Whoa.
  • Unless you've set the processor affinity (I KNEW there was another API set I missed yesterday), then it'll bounce from one CPU to another.
  • I would have thought that NT would by default try to maintain some form of affinity per thread anyway? Obviously it won't try very hard unless you set the processor affinity, but I would have thought that the scheduler would at least give it a go.
Page 1 of 2 (21 items) 12