Larry Osterman's WebLog

Confessions of an Old Fogey
Blog - Title

Concurrency, part 11 - Hidden scalability issues

Concurrency, part 11 - Hidden scalability issues

  • Comments 21
So you're writing a server.  You've done your research, and you've designed your system to be as scalable as you possibly can.

All your linked lists are interlocked lists, you're app uses only one thread per CPU core, you're using fibers to manage your scheduling so that you make full use of your quanta, you've set your thread's processor affinity so that it's locked to a single CPU core, etc.

So you're done, right?

Well, no.  The odds are pretty good that you've STILL got concurrency issues.  But they were hidden from you because the concurrency issues aren't in your application, they're elsewhere in the system.

This is what makes programming for scalability SO darned hard.

So here are some of the common issues where scalability issues are hidden.

The biggest one (from my standpoint, although the relevant people on the base team get on my case whenever I mention it) is the NT heap manager.  When you create a heap with HeapCreate, unless you specify the HEAP_NO_SERIALIZE flag, the heap will have a critical section associated with it (and the process heap is a serialized heap).

What this means is that every time you call LocalAlloc() (or HeapAlloc, or HeapFree, or any other heap APIs), you're entering a critical section.  If your application performs a large number of allocations, then you're going to be acquiring and releasing this critical section a LOT.  It turns out that this single critical section can quickly become the hottest critical section in your process.   And the consequences of this can be absolutely huge.  When I accidentally checked in a change to the Exchange store's heap manager that reduced the number of heaps used by the Exchange store from 5 to 1, the overall performance of the store dropped by 15%.  That 15% reduction in performance was directly caused by serialization on the heap critical section.

The good news is that the base team knows that this is a big deal, and they've done a huge amount of work to reduce contentions on the heap.   For Windows Server 2003, the base team added support for the "low fragmentation heap", which can be enabled with a call to HeapSetInformation.  One of the benefits of switching to the low fragmentation heap (along with the obvious benefit of reducing heap fragmentation) is that the LFH is significantly more scalable than the base heap.

And there are other sources of contention that can occur below your application.  In fact, many of the base system services have internal locks and synchronization structures that could cause your application to block - for instance, if you didn't open your file handles for overlapped I/O, then the I/O subsystem acquires an auto-reset event across all file operations on the file.  This is done entirely under the covers, but can potentially cause scalability issues.

And there are scalability issues that come from physics as well.  For example, yesterday, Jeff Parker asked about ripping CDs from Windows Media Player.  It turns out that there's no point in dedicating more than one thread to reading data from the CD, because the CDROM drive has only one head - it can't read from two locations simultaneously (and on CDROM drives, head motion is particularly expensive).  The same laws of physics hold true for all physical media - I touched on this in the answers to the Whats wrong with this code, part 9 post - you can't speed up hard disk copies by throwing more threads or overlapped I/O at the problem, because file copy speed is ultimately limited by the physical speed of the underlying media - and with only one spindle, it can only read or write to the drive one operation at a time.

But even if you've identified all the bottlenecks in your application, and added disks to ensure that your I/O is as fast as possible, there STILL may be bottlenecks that you've not yet seen.

Next time, I'll talk about those bottlenecks...

  • with respect to "one laser on a CD drive. . ."

    I've always wondered why hard drives didn't have more than one head per platter. . . one for reading, one for writing.

    Back in my DOS days, I always ordered my machines with two identical hard drives. My processing went like: Step1: Read file from Drive C:, write file to Drive D: Step2: Read File from Drive D:, write file to Drive C: etc. This made dramatic improvements to throughput.

    This was when a 512 Megabyte Drive was HUGE!

    :)


  • "you can't speed up hard disk copies by throwing more threads or overlapped I/O at the problem, because file copy speed is ultimately limited by the physical speed of the underlying media - and with only one spindle, it can only read or write to the drive one operation at a time."

    I thought with RAID, byte striping, SAN and the like you can't really assume what your underlying physical reality is anymore. I once heard this from a technical staff of a high-end RAID provider: "that anything you can throw into the SCSI bus, their RAID box will be able to write". Not really sure if it were true but their disks are really fast.

  • Mark, you're right, from the point of view of the application. But from the point of view of the system administrator (who ultimately is the person trying to make the system go faster), they absolutely DO know the topology - heck, they're staring at the disk, they should be able to figure it out :)
  • "One of the benefits of switching to the low fragmentation heap (along with the obvious benefit of reducing heap fragmentation) is that the LFH is significantly more scalable than the base heap."

    In what way is the LFH more scalable?
  • Daniel,
    I've been told that the LFH doesn't have the issues of the global heap critical section.

    Having no concrete experience with the LFH, I can't confirm that however.
Page 2 of 2 (21 items) 12