March, 2005

Larry Osterman's WebLog

Confessions of an Old Fogey
  • Larry Osterman's WebLog

    What is a BUGBUG?

    One of the internal software engineering traditions here at Microsoft is the "BUGBUG".

    Bugbug's are annotations that are added to the source code when the developer writing the code isn't sure if the code they're writing is "correct", or if there's some potential issue with the code that the developer feels needs further investigation.

    So when looking through source code, you sometimes find things like:

        // BUGBUG: I'm sure these GUIDs are defined somewhere but I'm not sure which library contains them, so defining them here.
        DEFINE_GUID(IID_IFoo, 0x12345678,0x1234,0x1234,0x12,0x12,0x12,0x12,0x12,0x12,0x12,0x12);

    The idea behind a BUGBUG annotation is that a BUGBUG is something that you should fix before you ship, but that won't necessarily hold shipping the product for - as in the example above, it's not the end of the world if the definition of IFoo is duplicated in this module, but it IS somewhat sloppy.  Typically every component has a P1 bug in the database to remove all the BUGBUG's - either turn them into real bugs, or remove them, or ensure that unit tests exist to verify (or falsify) the bugbug.

    As far as I know, the concept of a BUGBUG's was initially created by Alan Whitney, who was my first manager at Microsoft - I know he's the first person who explained their use to me.  Lately they've fallen out of favor in favor of more structured constructs, but conceptually, I still like them.

  • Larry Osterman's WebLog

    Still more misinformation about virtual memory and paging files


    The wired network in my building's being unusually flakey so I'm posting this from my laptop, sorry for the brevety..

    Slashdot had a front page story today about an article be Adrian Wong posted in his Rojak Pot: "Virtual Memory Optimization Guide".

    I've not finished reading it (the site's heavily slashdotted), but his first paragraph got me worried:

    Back in the 'good old days' of command prompts and 1.2MB floppy disks, programs needed very little RAM to run because the main (and almost universal) operating system was Microsoft DOS and its memory footprint was small. That was truly fortunate because RAM at that time was horrendously expensive. Although it may seem ludicrous, 4MB of RAM was considered then to be an incredible amount of memory.

    4MB of RAM?  Back in the "good old days" of 1.2MB floppy disks (those were the 5 1/4" floppy drives in the PC/AT) the most RAM that could be addressed by a DOS based computer was 1M.  If you got to run Xenix-286, you got a whopping 16M of physical address space.

    I was fuming by the time I'd gotten to the first sentence paragraph of the first section:

    Whenever the operating system has enough memory, it doesn't usually use virtual memory. But if it runs out of memory, the operating system will page out the least recently used data in the memory to the swapfile in the hard disk. This frees up some memory for your applications. The operating system will continuously do this as more and more data is loaded into the RAM.

    This is SO wrong on so many levels.  It might have been be true for an old (OS8ish) Mac, but it's not been true for any version of Windows since Windows 95.  And even for Windows 1.0, the memory manager didn't operate in that manner (it it was a memory manager but it didn't use virtual memory (it was always enabled and active swapping data in and out of memory, but the memory manager didn't use the hardware (since there wasn't any hardware memory management for Windows 1.0))).

    It REALLY disturbs me when articles like this get distributed.  Because it shows that the author fundimentally didn't understand what he's writing about (sort-of like what happens when I write about open source :) - at least nobody's ever quoted me as an authority on that particular subject)

    Edit: I'm finally at home, and I've had a chance to read the full article.  I've not changed my overall opinion of the article, as a primer on memory management, it's utterly pathetic (and dangerously incorrect).  Having said that, the recommendations for improving the performance of your paging file are roughly the same as I'd come up with if I was writing the article.  Most importantly, he differentiates between the difference between having a paging file on a partition and on a separate drive, and he adds some important information on P-ATA and RAID drive performance characteristics that I wouldn't have included if I was writing the article.  So if you can make it past the first 10 or so pages, the article's not that bad.


  • Larry Osterman's WebLog

    Open Source and Hot Rods


    I was surfing the web the other day and ran into someone linking to this article by Jack Lanier from Edmunds (the automotive newsletter people).

    The article's entitled "Friends Don't Let Friends Modify Cars".

    From the article:

    Today, it's difficult to make cars better and extremely easy to make them worse. Or dangerous.

    As a journalist driving modified cars, I've been sprayed with gasoline, boiling coolant, super-heated transmission fluid and nitrous oxide. (The latter was more entertaining than the former.) Several have burst into flames. Throttles have stuck wide open, brake calipers snapped clean off, suspensions ripped from their mounts and seatbelt mounting hardware has dropped into my lap. All this is on top of the expected thrown connecting rod, blown head gasket, exploded clutch, disintegrated turbocharger and broken timing belt.

    The vast majority of these vehicles were built by professionals. Many were from big-name tuners. Most performed as if they were constructed in a shop class at a high school with a lax drug policy. Once, after a suspension component fell off a car from a big-name tuner, the car actually handled better.

    For every modified and tuner car that performed better than stock, I've driven numerous examples that were slower. If they were quicker, it was often in an area that can't be used on the street. What's the use of gaining 0.2 second in the quarter-mile if the car is slower 0-60 mph? And costs $10,000 more?


    Recently, I autocrossed a pair of Subaru WRXs. One was a dead-stock WRX. The other, a tricked-out STi lowered with stiffer springs, shocks and bars and an exhaust kit and air filter. The STi is supposed to have an advantage of some 70 horsepower. Maybe the exhaust and filter moved the power up in the rev band where it couldn't be used. The lowered, stiffened STi regularly bottomed against its bump stops. When a car hits its bump stops, the spring rate goes to infinity and tire grip drops to near zero. This caused the STi to leap into the worst understeer I've experienced with inflated front tires. Meanwhile, in the unmodified WRX, I could be hard in the throttle at the same point. The result: The dead-stock WRX was at least as quick as the STi and far easier to drive. Easy to make worse, harder to make better

    I read this article and was struck by the similarities between this and the open source vs COTS model.

    COTS (Commercial Off The Shelf) software is equivalent to a stock automobile.  They're built by professional engineers, and tested as a whole.  But you don't get to mess with the system.

    On the other hand, open source gives you the ability to join the software equivalent of the tuner/modified market - you can tweak the system to your hearts content.  You may make it go faster, but you're not totally sure what it's going to do to the overall quality of the system.

    In fact, I constantly read that that's one of the huge benefits of open source - on an open source project, if you don't like how something works, you can just step in and fix it, while with COTS you don't have that ability.

    Software engineering is software engineering, whether it's open source or closed source.  Having the ability to tweak code (or an automobile) doesn't automatically mean that the tweak will be higher quality than what it's replacing.  It's entirely possible that it either won't be better, or that the improvement won't really matter.  On the IMAP mailing list, I CONSTANTLY see people submitting patches to the U.W. IMAP server proposing tweaks to fix one thing or another (even though it’s the wrong mailing list, the patches still come in).  And Mark Crispin shoots them down all the time, because the person making the patch didn’t really understand the system – their patch might have fixed their problem and their configuration, but it either opened up a security hole, or broke some other configuration, etc.

    Btw, the same thing holds true for system modifications.  Just because you can put a window into a hard disk doesn’t mean that the hard disk is going to work as well afterwards as it did before you took it apart.

    Just like in the automotive world, simply because you CAN modify something, it doesn't mean that it's a good idea to modify it.

  • Larry Osterman's WebLog

    Missed Anniversaries


    Somehow (and I'm not sure how I managed to do this), I missed the fact that last week was the one year anniversary of this blog!

    Since I'm usually pretty obsessive about dates, I can't really explain it, but it is.

    Last year on March 15th, this blog went live, since then, I've posted 303 posts.  I'll be honest and say that I didn't think I would be able to pull it off - it's been a challenge, but I'm reasonably proud that I've only missed one (or maybe two) days of posting since I started (given that if I'm not at work, I'm not posting).

    It's been a pretty wild year, I've survived a slashdotting (and surprisingly was pleased with the results), and a couple of very near calls.  I hit my 20th anniversary here during that time, and ended up writing a heck of a lot more than I'd ever have imagined.

    And I've never been called on the carpet for anything I've said, and I've only had to issue a retraction on something I said once (or rather, a significant correction, not a retraction).

    I'm looking forward to the next year :)

    And I want to especially thank everyone who reads my blatherings, I appreciate it.

  • Larry Osterman's WebLog

    Concurrency, part way too many3: Concurrency and Windows

    Work's heading up right now, so this one will be briefer than expected.

    Windows has its own idea of threading models.  In general, each window on the system is owned by a thread, and that thread is responsible for ensuring that the thread is constantly pumping messages for that window.

    This last can cause some real difficulties.  As an example, just recently someone posted on an internal mailing list that they were having problems with their application hanging.

    They'd started a really long process on their windows application's main thread, and they wanted to provide feedback to the user.  So they created a new thread, and had the thread put up a progress dialog while the main thread was busy.  They were upset because their dialog box didn't make progress, instead it seemed to get frozen just after coming up.

    Raymond immediately came to the rescue with the right answer: The new window was trying to send a WM_SETFOCUS to the old window to tell the old window that it was losing focus, and because the old window's thread wasn't pumping messages, the new window hung.

    This is why windows design books keep talking about having a "UI thread" and a "work thread" when designing your applications - it's absolutely critical to keep windows messages pumping on your main thread.

    So if your thread ever creates a window, you must pump messages for that window regularly.  The other critical thing to remember is that there are certain functions that require interaction with the windowproc for the window (by calling SendMessage), this is especially true for windows controls (which are typically built from one or more windows).  And there are other functions (like DestroyWindow) that MUST be called from the thread that owns the window.

    In general, you can't ever call SendMessage send a message to a window that's owned by another thread (you might get away with it but your chances of deadlocking are pretty high).  There are a couple of ways around this.  First, you could use the GetWindowThreadProcessId API to see if your current thread is the thread that owns the window (this is what the .Net framework "Control.InvokeRequired" method does).  If your thread owns the window, you can simply call SendMessage, if it doesn't, you can post a custom message to the window and do your work on that thread (more-or-less, that's what "Control.Invoke" does under the covers).

    Another method is to use the SendNotifyMessage API.  Of course, if you use SendNotifyMessage to send a message to another thread, then then you can't use this API to retrieve information from the destination window (like issuing a WM_GETITEM call).


    While writing this article, I found an old but still relevant article on MSDN here that describes all of this fairly well.

    Edit1: Fixed typo (thanks Ryan).

    Edit2: Added comment about DestroyWindow

  • Larry Osterman's WebLog

    Concurrency, part way too many2 - Concurrency and the Win32 API set

    As I mentioned the other day, I can't quite let go of the concurrency bug, so the 2nd of my 3 add-on posts to the concurrency series.

    One thing I didn't talk about explicitly in the original concurrency series is the Win32 API set.

    I implicated the concurrency issues when I mentioned that all DLLs need to be multi-thread safe, and the Win32 API set is no exception, but that really doesn't express the full state of the concurrency "issues"` w.r.t. the Win32 APIs. 

    In general, you can assume that all the Win32 APIs can be called from separate threads (please note, I'm NOT including the "User", "GDI" or "Shell" APIs in this list - I'm just talking about the functions in Kernel32.dll and Advapi32.dll - the "Windows" APIs have different threading requirements that I'll talk about tomorrow).

    But each family of APIs handle concurrency slightly differently (the lack of consistancy is simply a reflection of the underlying architecture of each API).

    For instance, the file I/O APIs handle concurrency issues by simply ignoring concurrency issues.  There are internal locks to protect the data structures but if you attempt to perform multiple I/Os on an ordinary file handle, the results are "undefined" - all bets are off.  You CAN use file handles from multiple threads simultaneously, but the system makes no guarantees of working correctly.  If you do want to do I/O on multiple threads, you need to tell the I/O manager that you're going to do that by specifying the FILE_FLAG_OVERLAPPED flag in the CreateFile call.  When you make this call, you're telling the I/O system that you're prepared to deal with the consequences of accessing files from multiple threads.  For example, when you specify overlapped I/O, the SetFilePointer API ceases to function in a reasonable fashion - the API won't fail, but it also won't necessarily set the current file offset - that's because files opened for overlapped I/O effectively don't have a current file offset, you need to specify the file offset on each read or write operation.

    Another example is the NT heap (which I mentioned before).  The NT heap manager (by default) protects its data structures by acquiring a lock across all heap accesses.  The only option you have for modifying the heap serialization behavior is to set the HEAP_NO_SERIALIZE flag, which tells the heap manager that you're prepared to own the serialization for that particular heap.  In that situation, the heap manager disables its internal locks and hands the responsibility for concurrency to your application.

    The Windows MME APIs (waveOutXxx, waveInXxx, mixerXxx, etc) also protect their internal data structures.  But they also support callback functions, and those callback functions are called with internal locks held (this has to happen to protect the integrity of the internal data structures in the winmm DLL).  As a result, there are some huge caveats on what you can do in the callback function (waveOutProc).  For the MME APIs, the reality is that the waveOutProc is (IMHO) sufficiently limited that the other callback options (CALLBACK_EVENT, CALLBACK_THREAD, or CALLBACK_WINDOW) are vastly better - they allow you to control the context in which the APIs called. 

    The good news is that MSDN does a pretty good job of laying out the concurrency issues associated with the particular APIs.  For example, the documentation for StartService calls out the fact that there's a lock held by the service control manager during service startup, and thus that a service cannot start another service in its startup routine before it's reported that it's started to the service controller.

    For COM, the concurrency contract for the COM objects is explicitly called out in the objects threading model.  I've written about this here, but here's a really quick summary: Apartment threaded objects can only be called from the thread which created the object, "free" threaded objects can be called from any thread (and thus internally deal with concurrency issues), and "both" objects can be called from either apartment model threads.

  • Larry Osterman's WebLog

    Concurrency, part way too many - concurrency and the C runtime library.

    For some reason, I can't seem to let this concurrency thing go, I keep on thinking of more and more relevant topics - there are a lot of issues surrounding concurrency.

    I realized today that my concurrency series didn't talk at all about libraries and concurrency, and I want to address that issue in a couple of articles.  To help avoid feature creep, here a brief outline of the articles:

    1. Concurrency and the C runtime library
    2. Concurrency and the Win32 API set
    3. Concurrency and Windows (GDI and User)

    That's it, I'm going to do my best to limit myself to those three articles - given my recent history on this topic, I'm not sure I'll succeed, but I'm going to try.

    Ok, concurrency and the C runtime library (to be specific, the Microsoft C runtime library, YMMV for others)...

    The C language specification is intentionally agnostic with respect to threads - there are no mentions of multithreading issues in the C specification at all, it's left up to the implementation.

    To deal with potential multithreading issues, Microsoft produces three different versions of the C runtime library (actually there are 6 different versions - the three variants I'm describing come in debug and retail versions).  Those three versions are:

    • Non thread safe C runtime library (-ML)
    • Thread safe C runtime library (-MT)
    • DLL C runtime library (-MD)

    The reason that there are three different libraries is that the performance characteristics of the three are different.  The non thread safe C runtime library is faster than the thread safe library because (a) it doesn't do locking internally, and (b) the C runtime library contains several functions (like strtok and errno) that are stateful (in other words, they rely on the results of previous operations that occurred on the thread).  The second is actually more important than the first - the overhead of locks for a single threaded application is quite small, but the stateful operations need to be implemented differently - they can't rely on global variables to hold their state, and instead have to rely on thread local storage.

    The thread safe C runtime library exists to resolve the above issues - in the non thread-safe C runtime library, errno is a global variable, in the thread-safe C runtime library, it's a call to the function _errno() - the differences are hidden by a macro in the multithread case.  If you're writing a multithreaded application, it's critical that you use the thread safe C runtime library instead of the non thread safe library.

    The third library (the DLL C runtime library) is also thread safe, and has the additional value of sharing code with all other applications that use the C runtime library in your process.  I've written about the DLL C runtime library before, so I'll just include that by reference.  I strongly recommend using the DLL C runtime library, simply because of the code sharing opportunities.

    Microsoft has a set of articles describing multithreading and the C runtime library in more detail here.

    Now, having said that, what about C++ and/or ATL?  In particular, what about the STL and ATL collection classes?  It turns out that they are explicitly NOT thread safe.  This is largely an artifact of their implementation - the ATL and STL template libraries are built in-line, they're not in a runtime library, and it's not trivial to be able to ensure concurrency (it's possible, for example, the iostream classes guarantee thread safety if you're using the thread safe runtime library, and the STL collections have specific semantics regarding thread safety that can be found here).

    I've separately been informed that the ATL collection libraries for ATL 7.0 (CStringT, CAtlMap, CAtlMap, CAtlArray, CRBMap and CRBMultiMap at a minimum) have the same thread safety guarantees as the STL - a single object is safe for either one writer or many readers, but it's up to the user of the template to ensure that the guarantee is met.  And, as Pavel Lebedinsky pointed out the rationale for this can be found here.


    Edit: Added clarification of the concurrency rules for ATL collections.


  • Larry Osterman's WebLog

    Keeping Kids safe on the internet, revisited.

    This is sort-of a follow-up for the article "Keeping kids safe on the internet" that I wrote a while ago.

    First off, I want to point to the excellent article "What you don't know can hurt kids".  Based on that article, my comments were hopelessly naive.  I'm still hoping that my methods (only public computers get internet access) work but I'm a bit worried.

    But the dangers of allowing a kid unsecured internet access are very, very real.

    Recently, I upgraded the firmware on our hardware firewall (I'm kind-of a nut about keeping stuff up-to-date).  Unfortunately, that means that my filters were erased, and thus the kids computers had access to the internet.

    Daniel didn't figure it out, but Sharron noticed it.  And she did what a 10 year old would do - she started surfing the net.

    She found all sorts of cool game sites on the net, and clicked on a bunch of surveys, downloaded the free icons for emails, etc.

    About 2 days later, Valorie noticed that Sharron was doing this and let me know about it - I fixed the filter issue immediately and downloaded the Antispyware beta.  It found 4 different pieces of spyware installed on her machine, and I scrubbed them off.

    Sharron was quite embarrassed about this, and she clearly had learned her lesson.  Problem solved, right?

    Well, no.

    We recently got our phone bill.  Valorie noticed it was about $30 higher than usual, so she went digging.  On the bill, there was a charge:

    "Billing for ILD Teleservices
    The following charges appear on your Verizon bill as a service to ILD Teleservices.
    Direct your billing questions to the phone number on the right.
    Billing on behalf of Member's Edge, LLC.
    NOTICE: New Service Provider This Month
      Billing Questions call 1 800 433-4518"

    Following that were two charges for "Members Edge EMail" - one for setup and another for a monthly fee

    I'm actually happy with Verizon for pointing out that this was a new charge, and for giving us the billing dispute number - that was pretty cool.

    So yesterday I called the number listed above.  They asked me for my phone number, and then the service rep and I had a conversation:

    Service Rep: Who is "Sharron Osterman"?
    Me: She's my 10 year old daughter.
    Service Rep: Well, she filled out an online survey and said that it was ok to bill this number for the charges.
    Me: Oh.
    Service Rep: You said she was 10?
    Me (letting a bit of the ire sneak into my voice): Yes.
    Service Rep (realizing what they'd done): Oh.
    Service Rep:  I'll cancel the account right away.  We'll process a refund for this and it'll be credited to your account in one to two billing cycles.
    Me: Thank you.

    I've got to say that ILD Teleservices was quite professional and they didn't give me any hassle.  Of course the fact that they may have violated COPPA may have had something to do with their contrition.

  • Larry Osterman's WebLog

    What's wrong with this code, part 10 - the answers


    Yesterday's article was a bit of a trick question, but was a real world example.  Our group encountered this in some code we were testing last week (in some pre-production code - it was not part of any product).

    Somewhat surprisingly, it turns out that the code, as written, wasn't incorrect.  The hr = ERROR_NOT_SUPPORTED line was in fact correct - the code was using the fact that it was a success error code as a signal to a higher level component.  As many of you pointed out, that looked like an absolute error, and I was bitten by it as well: I was making an unrelated modification to the routine and noticed the hr = ERROR_NOT_SUPPORTED; expression and fixed it to be hr = HRESULT_FROM_WIN32(ERROR_NOT_SUPPORTED);

    But that turned out to be a HORRIBLE mistake.  Even though the code had passed my unit tests, and several other developers had tested the code, one of our testers encountered a fairly reliable test failure on his machines.  One of the developers in our group (after spending a long day debugging) finally chased the problem down to this change.  It turns out that on certain audio cards, PerformAnOperation() was returning a failure (but only on those audio cards).

    And my change propagated a failure code out of the routine (which used to be a SUCCESS return, with an informational error code).  And, because I'd changed the SUCCESS error code to a FAILED error code, the caller of this routine didn't handle the error correctly.

    So what was wrong?  The mistake was that the semantics associated with returning a success return code was not called out anywhere in the code.  The code was correct (or rather, functioning as intended), but some of the assumptions associated with the code were not documented.  So someone (me) came along later and "fixed" the code and thus introduced a bug.  Correcting the function was as simple as documenting the assumption...

    // ----------------------------------------------------------------------
    // Function:
    // CThing1::OnSomethingHappening()
    // Description:
    // Called when something happens
    // Return:
    // S_OK if successful.  If an error occurs while performing an operation, returns a success
    // error code with the error number set to ERROR_NOT_SUPPORTED
    // ----------------------------------------------------------------------
    HRESULT CThing1::OnSomethingHappening()
        HRESULT hr;

        <Do Some Stuff>
        // Perform some operation...
        hr = PerformAnOperation();
        if (FAILED(hr))
            hr = ERROR_NOT_SUPPORTED;  // Swallow the error from PerformAnOperation and return an indicator to the caller.
        IF_FAILED_JUMP(hr, Error);

        return hr;

        goto Exit;

    If that had been done, it would have saved us a great deal of pain.

    And, as I mentioned above, this was NOT a problem in any production systems.  And we're reviewing our unit tests to see how we can improve them to ensure that problems like this issue get caught even earlier.

    Ok, Kudos and comments...

    Michael Ruck spotted the root cause of the error: That the failure was not specified, and thus the next poor schlump (me) coming along into the code fixes what appears to be a simple bug and thus breaks something.

    Some other comments: Skywing was, of course the first person to notice the ERROR_NOT_SUPPORTED (he/she was also the first person to post) - I sort-of used him as bait because I intentionally didn't comment that the hr = ERROR_NOT_SUPPORTED was not an error - upon reflection, I realize that that was a mistake.

    A number of people were concerned about the IF_FAILED_JUMP macro, that's just a convention we use in my group, I should have edited it out since it complicated the problem, that was my bad.

    Another group of people were concerned about the goto Error/goto Exit paradigm.  This is actually a convention I picked up several years ago - you move all the error handling to the end of the function (out-of-line) and then jump back to a common "cleanup" or "exit" label.  That allows the "normal" case to be in-line and isolates the error handling logic to one location.

    Skywing also pointed out that returning ERROR_NOT_SUPPORTED STILL isn't correct, the facility should have been set to FACILITY_WIN32, he's absolutely right.

    A number of other people suggested that the routine should be changed to return either BOOL or bool, but (as was immediately pointed out) that throws away error code information.

    And again, several people mentioned throwing exceptions, the code's neutral w.r.t. exception handling (although I disagree with the whole "throw an exception to return an error" paradigm - that's a different issue).


  • Larry Osterman's WebLog

    How to destroy the earth


    I've seen this twice so far this morning, and it'll get a lot more play RSN, but I'm going to violate my normal "don't repeat a meme" policy because this is just too darned cool...


    Several places have pointed out this site which gives detailed instructions on how to destroy the earth.  Very fun reading.

    From the introduction:

    You've seen the action movies where the bad guy threatens to destroy the Earth. You've heard people on the news claiming that the next nuclear war or cutting down rainforests or persisting in releasing hideous quantities of pollution into the atmosphere threatens to end the world.


    The Earth was built to last. It is a 4,550,000,000-year-old, 5,973,600,000,000,000,000,000-tonne ball of iron. It has taken more devastating asteroid hits in its lifetime than you've had hot dinners, and lo, it still orbits merrily. So my first piece of advice to you, dear would-be Earth-destroyer, is: do NOT think this will be easy.

    This is not a guide for wusses whose aim is merely to wipe out humanity. I (Sam Hughes) can in no way guarantee the complete extinction of the human race via any of these methods, real or imaginary. Humanity is wily and resourceful, and many of the methods outlined below will take many years to even become available, let alone implement, by which time mankind may well have spread to other planets; indeed, other star systems. If total human genocide is your ultimate goal, you are reading the wrong document. There are far more efficient ways of doing this, many which are available and feasible RIGHT NOW. Nor is this a guide for those wanting to annihilate everything from single-celled life upwards, render Earth uninhabitable or simply conquer it. These are trivial goals in comparison.

    This is a guide for those who do not want the Earth to be there anymore.

  • Larry Osterman's WebLog

    What's wrong with this code, part 10

    Ok, time for another "what's wrong with this code".  This one's trivial from a code standpoint, but it's tricky...

    // ----------------------------------------------------------------------
    // Function:
    // CThing1::OnSomethingHappening()
    // Description:
    // Called when something happens
    // Return:
    // S_OK if successful
    // ----------------------------------------------------------------------
    HRESULT CThing1::OnSomethingHappening()
        HRESULT hr;

        <Do Some Stuff>
        // Perform some operation...
        hr = PerformAnOperation();
        if (FAILED(hr))
            hr = ERROR_NOT_SUPPORTED;
        IF_FAILED_JUMP(hr, Error);

        return hr;

        goto Exit;

    Not much code, no?  So what's wrong with it?

    As usual, answers and kudos tomorrow.

  • Larry Osterman's WebLog

    NPR Has a FASCINATING profile of Donald Knuth this morning...


    I'm listening to NPR right now (getting ready for work/school) and I realized they had an article on Knuth.

    Very cool, and worth listening.

    And he's just about done with Volume 4!

  • Larry Osterman's WebLog

    Microsoft's director of Shared Source is now blogging!


    Busy weekend, I'm about to go out-of-town to a seminar with Daniel, and Sharron's got another riding competition (the first one I won't be able to see :().  So a quick note instead of an article


    I just noticed (via Rick Schaut's blog) that Jason Matusow, Microsoft's director of the Shared Source program is blogging!

    As Rick points out, his first post is a doozy.


  • Larry Osterman's WebLog

    Concurrency, Part 15 - Wrapping it all up.

    Today I want to wrap up my concurrency series (finally).  There are some more topics I'll be covering in the future (like the debugging concurrency issues I talked about yesterday :)), but I think I've said a reasonable amount about the issue (and frankly, I'd like to go on to other topics).

    The most important thing to realize about concurrency is that while programming for concurrency is harder than programming in a single threaded environment, if you follow a relatively straightforward set of rules, it's not that much harder.

    My series started off with a discussion about concurrency and what concurrency is.

    I next introduced my principles of concurrent programming:

    1: If your data is never accessed on more than one thread, then you don't have to worry about concurrency.

    2: Critical sections can be your best friends (unless they're your worst enemy).

    3: Know your lock order and never, ever violate it.

    4: Don't call into other objects with locks being held, and make sure that you reference count all your objects.

    5: Reference counting is hard if you're not really careful

    Next, I talked about some of the reasons for introducing concurrency into your application as an introduction to discussing scalability

    Finally, I discussed what it means to use concurrency as a mechanism of achieving application scalability, and I talked about some of the APIs that are useful when writing scalable applications.

    And then I talked a bit about determining when you've got a CPU based scalability issue.

    Finally, I spent two articles talking about hidden scalability issues - what happens when components out of your control have scalability issues, and what happens when your code collides with the design of the computer itself to cause scalability issues.

    Those articles led to my final principle: If you're looking to concurrency to make your application scalable, you need to be really, really smart - there are bottlenecks in places you didn't think about.

    I also included a short article about the CLR and concurrency, and some other odds and ends.

    Other resources I've come up with over the course of this series:

    Eric Lippert had a great example of how you can get concurrency bottlenecks here:

    There's an ancient, but awesome (and still relevant) article on MSDN written by John Vert here:

    Jeff Parker pointed out the following posts about the CLR's threading model by Rick Brewster: Article 1 and Article 2.

  • Larry Osterman's WebLog

    Concurrency, Part 14 - Odds and Ends

    Ok, the light's at the end of the tunnel.  I couldn't figure out where to slot this in, so it got stuck at the end...

    Someone (I'm forgetting who) asked about what happens when someone's producing data to be handed to a pool of threads.  The simplest example of this is a network packet handler handing messages to worker functions where the worker functions run on different threads. 

    So the packet handler takes requests from the network, validates the message, and sticks it into a worker thread pool.  The worker threads pick up the items and process them.

    This works great, until you run into protocol elements that have ordering requirements.  For example, consider an abstract network filesystem protocol.  The protocol has verbs like "open", "read", "write" and "close" and when an open request is received, the worker thread handler for the "open" verb opens a local file.  Similarly, the "read" and "write" verb handlers will read and write from the file handle.  For efficiencies sake, the protocol also supports combining these messages into a single network message (protocols do exist that have these semantics, (trust me :)))

    In this hypothetical example, when the protocol handler receives one of these combined requests, it breaks up these combined high level messages into low level messages, and drops them into the respective lower level protocol handlers.

    Now what happens when an "Open&Read" request is received.  The open request is dispatched to worker thread 1.  And the read request is dispatched to worker thread 2.

    And the scheduler decides to schedule thread 2 before thread 1.


    So when you're dealing with this kind of situation, you need to ensure that your architecture handles it reasonably.  But this can be tricky, because you don't want to hurt scalability just because of this one corner case.  There are lots of solutions, the first that came to mind is to have a manual reset event associated with the file and when the file was opened, set the event to the signaled state.  Before each read or write (or close) wait on the event with an infinite timeout.  But this penalizes all read and write operations after the open - they've got to make a system call (even if its a relatively inexpensive system call) for each operation after the open's succeeded.

    Another solution would be to add the ability to queue requests to other requests, and when a request completed queue the chained requests to the worker thread pool.  This has some real advantages (and would be the solution I'd chose) because it allows chaining of arbitrary operations.  For example, you could chain an open, read, and close into a single packet - chain the close to the read, and the read to the open.  Or if you had open, read, write, read, you could chain the read and write to the open, and re-queue both the read and write once the open completes.

    Tomorrow, I want to talk about debugging concurrency issues, then it'll be wrap-up time :)

  • Larry Osterman's WebLog

    Concurrency, Part 13 - Concurrency and the CLR

    I think I've got two more articles in this series left, I didn't really intend for this to overwhelm my blog for an entire month, but it sort-of got out of hand - I'd originally planned for a total of 6 topics and here I am on the lucky 13th.

    Anyway, all throughout this series, people have been asking for me to write about how the CLR handles concurrency.  This is a bit sticky for me, since I'm not a CLR expert by any stretch of the imagination.  So I can't write authoritatively on the subject, which puts me in a rather uncomfortable state.  Instead I'll point to Rico Mariani's blog for general purpose CLR performance tips - he's a developer on the CLR perf team, and he DOES know what he's talking about.

    On the other hand, I can talk a bit about some of the features that the CLR provides for enabling concurrent programming, although what I write won't be very insightful :(.

    The first (and biggest) thing that the CLR provides is automatic lifetime management of resources.  So if you're using the CLR, all the issues I pointed out in part 6 aren't relevant - the CLR manages all the reference counting for your objects, so you don't have to.  The CLR also manages heap lifetime, and it does a pretty good job of it - Rico's got a bunch of articles that pretty clearly demonstrate that the CLR's garbage collector performance is going to be better than anything that you're likely to come up with.  It also resolves the heap scalability issues I discussed in my previous articles.

    The .Net framework also has support for threads, although the CLR threading model is subtly different from the Win32 threading model (no, I don't know what the differences are, I just know they're subtly different).  The easiest place to see the differences in the threading model is to consider the three different timers available in the CLR (System.Timers, System.Threading.Timers, and System.Windows.Forms.Timers.  This MSDN article spends a bit of time talking about the differences.  The biggest difference between the various timers is that the "server timers" as exposed by the System.Timers class are intended to be used in a high performance multithreaded environment (like under IIS).  System.Timers timers can run on any one of a set of threads.  System.Threading.Timers, on the other hand are effectively a wrapper around the Win32 timer queue APIs.  And, of course, System.Windows.Forms.Timers are a wrapper around the User32 window message based timers (WM_TIMER).

    Most of the concurrency support in the CLR is provided by the System.Threading namespace, it provides semantics for things like interlocked operations, thread creation and destruction, reader/writer locks and monitors (monitors are closely linked to Win32 critical sections, reader/writer locks are not provided by Win32 directly).

    The CLR supports monitors instead of the more traditional Win32 critical sections because in many ways, they're more flexible than Win32 critical sections - a monitor can be attached to any variable and can serialize access to that variable.  C# actually promotes the use of monitors as first class language features via the "lock" statement - under the covers, the lock statement creates a monitor object for the locked variable, and enters the monitor.  You can see this in the language specification definition of lock.

    Reader/Writer locks are actually fascinating beasts.  They are used to handle the multiple consumers single producer design pattern - you often see this with databases, for example - there are hundreds of clients reading the data in the database, but only one client writing data into it.  The reader/writer lock provides support for this pattern, it can be extremely useful.  The biggest problem with reader/writer locks is that they're very prone to deadlocks and resource starvation issues - for instance multiple readers can starve writers if the author of the reader/writer lock isn't very careful.

  • Larry Osterman's WebLog

    Concurrency, part 12 - Hidden scalability issues, part 2


    Last time, I mentioned that even when you get everything right, you can STILL have scalability issues...  To explain what's going on, a digression.

    One of the major features of Exchange 2000 was the ability for the store to access multiple databases (before, the store could only access one database at a time).

    My dev lead and I had a long series of discussions about whether it would be better to implement multiple databases by running multiple store.exe processes with one database loaded into each process, or to do the work to load multiple databases into a single store.exe process.

    Our concern with the multiple process solution was that the context switch overhead of switching from one process to another would overwhelm the performance of the system (since more work is involved in switching from one process to another than is involved in switching from one thread to another).

    To resolve the discussion, I wrote a small test application.  It would spawn 100 threads doing some "work" (where "work" was defined as incrementing a counter as many times as was possible during a 5 minute period).  It would either do the work by spawning 100 threads in one process or by spawning 10 processes with 10 threads in each process.

    So in both cases, 100 threads would be created, the only difference would be how many processes would be used.  Our belief was that the 10 process scenario would be slower than the 1 process (due to the address space switch issue).

    Well, I ran the test, and discovered, much to my surprise the multithreaded version was significantly slower than the multiprocess version.

    This was utterly counter-intuitive, so I asked the base team what was going on.

    They asked to look at my code, and immediately spotted the problem.

    You see, in order to avoid the CPU lock issues associated with InterlockedIncrement (because I knew it would mess the numbers up), I allocated an array with one DWORD for each thread.  Each thread incremented its value (once again, using my first principle (hah, I got it right :))), and after the 5 minute period had expired, the main thread of the application would add up all the values in the array.  For the multiprocess version, each child process wrote to a shared memory region which was again summed by the main thread.

    At this point, the people who have worked on scalability are all smirking at my utterly boneheaded mistake.

    You see, my mistake was to assume that there would be no side-effects associated with dedicating an entry in the array for each thread.  I assumed that each thread could read and write to its dedicated slot with impunity.

    And on modern processors, this simply isn't true. 

    You see, on modern computers, the system memory is significantly slower than the processor - while the CPU might be running at 3GHz, the memory bus might only be running at 1GHz or so... But computers spend a LOT of time accessing memory.  If they had to spend all their time accessing system RAM, then the performance of the processor would be massively throttled. 

    To resolve this, the processors have several levels of cache.  The first level (cleverly named L1) of cache physically sits on the chip.  The level 1 cache is typically fairly small (for the Pentium Pro, it was 8K, for the Pentium M, it's 32K (for the P4, they've changed the architecture and replaced the L1 cache with a 16K trace cache)).  The thing about L1 cache is that it's REALLY, REALLY fast - the L1 cache operates at full CPU speed.  Usually, the L1 cache is per-cpu core, there are no cache coherency issues to deal with (principle 1 - if it's your data, you don't have to protect it from anyone else)

    But the L1 cache is really intended for predictive branching, etc - it's really small and blindingly fast because it's about pulling instructions into the processors core.  There's a need for a second, higher level cache.  This higher level cache is cleverly named the level 2 cache, or L2.

    Now, the L2 cache is MUCH bigger than the L1 cache - it's typically hundreds of kilobytes in size (the P4 extreme edition has 2M of L2 cache, for example).  But the Level2 cache may be shared between multiple CPU cores (this depends on the processor configuration).  As I mentioned before, if you have a resource that's shared, you need to have some kind of locks to protect that resource.  And the CPU is no different from any other system.

    Internally, the x86 cpu cache divides memory up into "cache lines", or cache blocks (the cache line size varies by CPU).  When you perform an interlocked operation, it locks the cache line, writes the memory, and releases the cache line.

    So in my little test application, what was really happening was that my application was thrashing the CPU cache - on the machine I was testing, the cache line was 32 bytes in size, so each block of 8 threads was contending for the same cache line, so they were effectively serialized by the processor.    On the multi-process version, instead of having one 400 byte array (100 threads, 4 bytes/thread) occupying 13 cache lines, each process had a 40 byte array (10 threads, 4 bytes/thread) occupying two.  In the single process version, there were 96 threads, each contending for 12 cache lines and 4 threads contending for the 13th. And in the multi-process version, there were 8 threads contending for one cache line, but only two threads contending for the other cache line (aggregated over the 10 processes, there were 80 threads contending for 10 cache lines, and 20 threads contending for 10 cache lines).

    So if you look at this, the cache line thrashing was less on the multi-process version  of the test (although it still existed).

    When I rewrote the application to remove the cache line thrashing, the differences between the multi-thread and multi-process version of the test application totally disappeared - the multi-process version had exactly the same throughput as the multi-thread version.

    Of course, this was a stupid example, with an unreasonable work load, it was totally context switch bounded (because I had way more threads than CPUs), etc.  But fundamentally, the measurement paradigm was sound - since the aggregate number of threads in both cases was the same, the only difference would be the cost of switching address spaces, and that turned out to be negligible.

    So, the next principle of concurrency: "If you're looking to concurrency to make your application scalable, you need to be really, really smart - there are bottlenecks in places you didn't think about".

    Edit: megabytes->kilobytes, multithreaded->multiprocess in example.


  • Larry Osterman's WebLog

    Concurrency, part 11 - Hidden scalability issues

    So you're writing a server.  You've done your research, and you've designed your system to be as scalable as you possibly can.

    All your linked lists are interlocked lists, you're app uses only one thread per CPU core, you're using fibers to manage your scheduling so that you make full use of your quanta, you've set your thread's processor affinity so that it's locked to a single CPU core, etc.

    So you're done, right?

    Well, no.  The odds are pretty good that you've STILL got concurrency issues.  But they were hidden from you because the concurrency issues aren't in your application, they're elsewhere in the system.

    This is what makes programming for scalability SO darned hard.

    So here are some of the common issues where scalability issues are hidden.

    The biggest one (from my standpoint, although the relevant people on the base team get on my case whenever I mention it) is the NT heap manager.  When you create a heap with HeapCreate, unless you specify the HEAP_NO_SERIALIZE flag, the heap will have a critical section associated with it (and the process heap is a serialized heap).

    What this means is that every time you call LocalAlloc() (or HeapAlloc, or HeapFree, or any other heap APIs), you're entering a critical section.  If your application performs a large number of allocations, then you're going to be acquiring and releasing this critical section a LOT.  It turns out that this single critical section can quickly become the hottest critical section in your process.   And the consequences of this can be absolutely huge.  When I accidentally checked in a change to the Exchange store's heap manager that reduced the number of heaps used by the Exchange store from 5 to 1, the overall performance of the store dropped by 15%.  That 15% reduction in performance was directly caused by serialization on the heap critical section.

    The good news is that the base team knows that this is a big deal, and they've done a huge amount of work to reduce contentions on the heap.   For Windows Server 2003, the base team added support for the "low fragmentation heap", which can be enabled with a call to HeapSetInformation.  One of the benefits of switching to the low fragmentation heap (along with the obvious benefit of reducing heap fragmentation) is that the LFH is significantly more scalable than the base heap.

    And there are other sources of contention that can occur below your application.  In fact, many of the base system services have internal locks and synchronization structures that could cause your application to block - for instance, if you didn't open your file handles for overlapped I/O, then the I/O subsystem acquires an auto-reset event across all file operations on the file.  This is done entirely under the covers, but can potentially cause scalability issues.

    And there are scalability issues that come from physics as well.  For example, yesterday, Jeff Parker asked about ripping CDs from Windows Media Player.  It turns out that there's no point in dedicating more than one thread to reading data from the CD, because the CDROM drive has only one head - it can't read from two locations simultaneously (and on CDROM drives, head motion is particularly expensive).  The same laws of physics hold true for all physical media - I touched on this in the answers to the Whats wrong with this code, part 9 post - you can't speed up hard disk copies by throwing more threads or overlapped I/O at the problem, because file copy speed is ultimately limited by the physical speed of the underlying media - and with only one spindle, it can only read or write to the drive one operation at a time.

    But even if you've identified all the bottlenecks in your application, and added disks to ensure that your I/O is as fast as possible, there STILL may be bottlenecks that you've not yet seen.

    Next time, I'll talk about those bottlenecks...

  • Larry Osterman's WebLog

    Concurrency, Part 10 - How do you know if you've got a scalability issue?

    Well, the concurrency series is finally running down (phew, it's a lot longer than I expected it to be)...

    Today's article is about determining how you know if you've got a scalability problem.

    First, a general principle: All non trivial, long lived applications have scalability problems.  It's possible that the scalability issues don't matter to your application.  For example, if you application is Microsoft Word (or mIRC, or Firefox, or just about any other application that interacts with the user)) then scalability isn't likely to be an issue for your application - the reality is that the user isn't going to try to make your application faster by throwing more resources at the application.

    As I write wrote the previous paragraph, I just realized that it describes the heart of scalability issues - if the user of your application feels it's necessary to throw more resources at your application, then your application needs to have to worry about scalability.  It doesn't matter if the resources being thrown at your application are disk drives, memory, CPUs, GPUs, blades, or entire computers, if the user decides that hat your system is bottlenecked on a resource, they're going to try to throw more of that resource at your application to make it run faster.  And that means that your application needs to be prepared to handle it.

    Normally, these issues are only for server applications living in data farms, but we're starting to see the "throw more hardware at it" idea trickle down into the home space.  As usual, the gaming community is leading the way - the AlienWare SLI machines are a great example of this - to improve your 3d graphics performance, simply throw more GPUs at the problem.

    I'm not going to go into diagnosing bottlenecks in general, there are loads of resources available on the web for it (my first Google hit on was this web cast from 2003).

    But for diagnosing CPU bottlenecks related to concurrency issues, there's actually a relatively straightforward way of determining if you've got a scalability issue associated with your locks.  And that's to look at the "Context Switches/sec" perfmon counter.  There's an article on how to measure this in the Windows 2000 resource kit here, so I won't go into the details, but in a nutshell, you start the perfmon application, select all the threads in your application, and look at the context switches/sec for each thread.

    You've got a scalability problem related to your locks if the context switches/second is somewhere above 2000 or so.

    And that means you need to dig into your code to find the "hot" critical sections.  The good news is that it's not usually to hard to detect which critical section is "hot" - hook a debugger up to your application, start your stress and put a breakpoint in the ntdll!RtlEnterCriticalSection routine.  You'll get a crazy number of hits, but if you look at your call stacks, then the "hot" critical will start to show up.  It sounds tedious (and it is somewhat) but it is surprisingly effective.   There are other techniques for detecting the "hot" critical sections in your process but they are not guaranteed to work on all releases on Windows (and will make Raymond Chen very, very upset if you use them).

    Sometimes, your CPU bottleneck is simply that you're doing too much work on a single thread - if it simply takes too much time to calculate something, then you need to start seeing if it's possible to parallelize your code - you're back in the realm of making your code go faster and out of the realm of concurrent programming.  Another option that you might have is the OpenMP language extensions for C and C++ that allow the compiler to start parallelizing your code for you.

    But even if you do all that and ensure that your code is bottleneck free, you still can have scalability issues.  That's for tomorrow.

    Edit: Fixed spelling mistakes.


  • Larry Osterman's WebLog

    Concurrency, Part 9 - APIs that enable scalable programming

    Sorry, I just noticed that this didn't get posted yesterday - I don't know what happened, but...

    As I mentioned the other day, Win32 provides a number of tools for writing scalable applications.

    Today I want to catalog some of them.  Remember, that the #1 goal you have when attempting to design MP scalable applications is to avoid blocking your thread.  You absolutely never, ever want to let your thread go through the CPU scheduler, because that means that your processor is spending time switching threads when it could be doing work.  So what are some of the tools that Win32 provides for this?

    First off, the two I mentioned yesterday, Fibers and Interlocked operations.  Fibers allow an application to switch between tasks without exhausting the currently running threads' quantum, but their cost can be quite high in terms of complexity - you essentially have to write your own scheduler to effectively use them.

    The Interlocked functions, however are really, really neat.  You can't really appreciate the power of the APIs from their description.

    Interlocked Increment and Decrement are straightforward - they increment and decrement a long, returning the new value of the variable.

    InterlockedExchangeAdd is again straightforward - it adds (or subtracts) from a 32bit value.  Again, straightforward.  So is InterlockedExchangePointer - it swaps two pointer values.

    The magic of the interlocked functions starts with InterlockedCompareExchangePointer - It turns out that if you're clever, you can use InterlockedCompareExchangePointer to implement a LIFO queue that doesn't require any external synchronization structures.  I'd write down how to make this work, but the good news is that I don't have to, because the NT base team formalized it with the Interlocked Singly Linked List primitives.  These allow you to add and remove entries to a list, again without requiring a critical section.

    The next tools that are available for scalability are, somewhat paradoxically, the Win32 critical section structures.  It turns out that there are two big issues with critical sections.  One is that critical sections can block on contention.  The other is that if you use a critical section to protect a commonly accessed variable (like the head of your work queue), a phenomenon known as a lock convoy.  Lock convoys are horribly bad when you're designing an application to be scalable, because all your threads eventually wind up being serialized on a single critical section.  So a great deal of effort has been put into reducing the likelihood of lock convoys in critical sections.  The first of the changes made (for NT4 SP3 (and Windows 98+)) was to add the InitializeCriticalSectionAndSpinCount API.  This API allows the user to specify a "spin count" for critical sections - essentially, when you attempt to enter the critical section, if the critical section is owned by another thread, instead of simply blocking (and throwing away your quantum), the EnterCriticalSection API will "spin" trying repeatedly to acquire the critical section.  So if your critical section is "hot" (meaning it's a high contention critical section, and the critical section is held for a very short period of time), then the odds are very good that the critical section will be released while the other thread is spinning.  And that means that the thread that's trying to grab the critical section won't have to give up its quantum.

    Now ICSASC isn't a panacea - if you own the critical section for any length of time, then spinning on contention just wastes time.  But for some critical sections, it can be a huge win.

    In addition, for Windows 2003, SP1, the EnterCriticalSection API has a subtle change that's intended tor resolve many of the lock convoy issues.  Before Win2003 SP1, if 10 threads were blocked on EnterCriticalSection and all 10 threads had the same priority, then EnterCriticalSection would service those threads in a FIFO (first -in, first-out) basis.  Starting in Windows 2003 SP1, the EnterCriticalSection will wake up a random thread from the waiting threads.  If all the threads are doing the same thing (like a thread pool) this won't make much of a difference, but if the different threads are doing different work (like the critical section protecting a widely accessed object), this will go a long way towards removing lock convoy semantics.

    As I mentioned in my "So you want a worker thread pool" article, another incredibly useful way of achieving scalability is to use an I/O completion port for a work queue - because the completion port queue is protected by the same kernel structure that protects the scheduler in NT, if there are any work items on the queue, your thread won't block when it calls GetQueuedCompletionStatus.

    And, in addition to all these, there's one other facility - Process Affinity.  The SetProcessAffinityMask, SetThreadAffinityMask and SetThreadIdealProcessor APIs can be used to "pin" a thread to a CPU.  Otherwise the thread will be run on whichever processor is available.  Switching a thread from one processor to another can have serious negative consequences, since there is internal per-CPU state that gets tossed when a thread is moved.

    There may be other APIs and facilities for improving the scalability of your application in Win32, but (honestly) they're not coming to mind immediately :)

    Tomorrow, I want to talk about a couple of the real-world problems that you start encountering when you build highly scalable applications.

    Edit: Added a paragraph about the Win2003 SP1 EnterCriticalSection behavior changes (I just got approval to post that behavior change).

    Edit2: Fixed InterlockedIncrement/Decrement description, thanks Dave

Page 1 of 1 (20 items)