Larry Osterman's WebLog

Confessions of an Old Fogey
Blog - Title

Riffing on Raymond - When caches hurt

Riffing on Raymond - When caches hurt

  • Comments 12

On Tuesday, Raymond posted "A cache with a bad policy is another name for a memory leak".

Of course that immediately reminded me of a war story from back in the days of Exchange 5.0.  One of the major features of Exchange 5.0 was an NNTP server. 

NNTP (and POP3) are rather interesting protocols.  In general, clients don't tend to stay connected to the server for a long time, instead they tend to connect, authenticate, perform a couple of operations nd terminate the connection.  Many protocols have similar usage patterns (HTTP comes to mind immediately). 

Of course, when you're building a server, you need to do performance analysis to determine how fast the server can go (or in the case of an email/news server, how many clients it can support).  For the purposes of performance analysis, we measure requests in terms of "transactions".  For the POP3 or NNTP server, each of these connect/authenticate/operations/terminate sequences was is considered a "transaction".  By measuring the amount of time it takes for each of these transactions to execute, you can easily determine the scalability of your server.  If each transaction takes <n> milliseconds, then you can only perform 1000/<n> transactions per second.  Obviously if you can perform <m> transactions at the same time, then the number of transactions per second becomes (1000/<n>)*<m>.  But <m> is typically a relatively small number (if you're CPU bound, then <m> is typically the number of CPUs on your machine, for instance).  Calculating <m> can be quite complicated, and is way out of scope for this post.

Obviously if you increase <m> or decease <n>, you'll improve your performance, and when doing perf analysis, you do both.  When we started breaking down <n>, we realized that there were a couple of huge items that were effecting <n>.  The first cost was authentication, the other was reading information about the newsgroup from the Exchange DS (that was required during the "operations" step).  The combined cost of these two items was about 500 milliseconds of the 510 total cost per transaction.

There were clear benefits here, so I built two separate caches, one for the directory lookups and the other for the authentications.  Both worked flawlessly, and our performance jumped dramatically.

Things went great until we started getting client stress tests run against the NNTP server.  All of a sudden, our testers started reporting that the NNTP server was leaking memory like crazy.  It was consuming hundreds of megabytes (on a machine with 64M of physical memory), and the machine was essentially unusable.

Not a problem, the Exchange store had extensive memory leak tracking logic, we just stopped the server to let the leak detection logic do its stuff and tell us where the leak was.

Of course you know what happened: The leak detection logic reported that there were no leaks.  So, being the typical arrogant developers that I am, I went back to the testers and said "There are no leaks, you must be dreaming".  Needless to say, that didn't go over too well, so I dug deeper.

I hooked a debugger up to the process and started poking randomly at allocated memory, and discovered that the DS cache was huge.  I dug a bit deeper and discovered that the problem was buried deep in the DS cache logic.  It turns out that the there was an uninitialized variable in the hash function that I was using in my cache, which caused all the entries in the DS cache to have their own buckets.

So yes, a cache with a bad policy (or a bad cache function) is undistinguishable from a memory leak.

 

  • "So, being the typical arrogant developers that I am"...

    Thats how you get so much done, you're several people!  Seriously though, I rather enjoy hearing these stories. They make me think about things to do, and to remember to be a bit more humble.

    -Jeff
  • Is there something in the water that makes developers think that "using a lot of memory" and "leaking memory" are the same thing?

    Also, when testers report a memory leak it's probably a good idea to ask them how they know that there's a memory leak. When they say that they see a huge number for memory usage in the task manager, then you would be able to work out that there's an alternative explanation.

    Sure, a broken cache is a bad thing, but it's still not actually a memory leak.
  • I don't get it: No, "using a lot of memory" is different from "leaking memory" is different from "using too much memory".

    But the second and third are bugs, and the first may or may not be a bug.

    On my Vista box, the indexing service uses about 100M of virtual memory (not surprising, I have many megabytes of email and a ton of documents and music on my hard disk).  But it's "using a lot of memory".  It's not "using too much memory" or "leaking memory".  In fact, the search service only uses 10M of working set - the remaining memory is allocated to hold indexes, but unless it's actively indexing my machine or being searched, those pages just take up space in the paging file - they don't affect my machine at all.
  • Until the issue is identified, there is no point even trying to characterize something as a memory leak or using too much memory.  As far as the user is concerned, he is out of memory and the machine is running very poorly.  It is a memory leak.

    It is like asking your grandmother talk to you in tech jargon when all she knows is that she can't see those nice needlepoint patterns on that site she likes.

    Larry's mistake was that he took what the user said too literally.  (Not trying to speak for Larry)  We all do this, some people more than others.
  • 1) If I drive a Hummer to work, I'm gonna use a lot of gas.  That's by design.

    2) If I drive a moped with a hole in the tank, I'll still end up using less gas, but it's a bug.  (That is, it can be patched.)

    3) If I take the engine out of my Hummer, put it on my moped, and expect it to hang together, that's user error.  (Don't try to run Exchange on a cell phone.)
  • > It turns out that the there was an uninitialized variable in
    > the hash function that I was using in my cache

    If your code was written in some language approximating and/or based on C or Pascal or Fortran or Basic or something along those lines, then even though compilers are not required[*] to warn you about bugs such as uninitialized variables, modern compilers generally do offer to warn you because modern compiler makers generally do want to help programmers fix programmers' bugs.

    Please tell us what compiler you were using, so that mortals among us can know what compiler to avoid.

    > It was consuming hundreds of megabytes (on a machine
    > with 64M of physical memory), and the machine was
    > essentially unusable.

    Hey, I had a machine just like that!  I solved it by reverting that machine to Windows 2000.  Then it only consumed 1 hundred of megabytes on a machine with 64M of physical memory.  I'm wondering whether to revert it to Windows 98 so it won't thrash the pagefile so much, but then I won't have any practical use for it in the kinds of experiments I do.  Sigh.

    Friday, May 05, 2006 12:19 PM by Maurits
    > 1) If I drive a Hummer to work, I'm gonna use a lot of gas.
    > That's by design.

    Yes, and it provides a good example of why bugs in design are as bad as bugs in execution.  Thank you.

    [* Exception, if I recall correctly:  Pascal compilers are required to offer a mode in which they will diagnose every violation of the standard, though of course arrogant programmers permanently disable such modes.]
  • Norman: If the cache was written in C++ (likely) and was moderately object-oriented, it's quite likely that the leak in question could have been caused by an uninitialised member variable.

    Most C++ compilers tend not to warn about uninitialised members in constructors, even if the normal "uninitialised variable" warnings are turned on. I figure this is probably because if you've got a moderately complex class with multiple c'tors (where this sort of warning is most needed) then I've found it quite common for a private "init()" function to do the actual initializing, and for each c'tor to call it to avoid code duplication.

    With this pattern, it's pretty much impossible for a compiler to figure out whether or not member vars are initialized, and most members will look like they're not giving you spurious compiler warnings. Which you'll want to turn off so you don't miss "real" errors.

    Just a thought.
  • >So yes, a cache with a bad policy (or a bad cache function) is undistinguishable from a memory leak.
    Until it is flushed, of course.
  • Yuhong,
     I believe that you're making the assumption that the purpose of the cache is to delay writes (I'm assuming that's the relevance of being flushed).

     In the example above, the cache is never flushed (until the service is stopped) because it was allowed to grow unbounded.

     All a cache is is a mechanism to avoid doing something expensive.
  • Monday, May 08, 2006 4:53 AM by Adam
    > If the cache was written in C++ (likely)

    Of course.  That is included in my phrase "approximating and/or based on C or [...]".

    > Most C++ compilers tend not to warn about uninitialised
    > members in constructors, even if the normal "uninitialised
    > variable" warnings are turned on.

    I hadn't thought about that.

    > I've found it quite common for a private "init()" function to
    > do the actual initializing, and for each c'tor to call it to avoid
    > code duplication.

    Of course.

    I must take your posting as a warning that I might have some uninitialized member variables in some of my own classes without ever having been warned about them.

    After thinking about it the way that you did (or at least the way that you persuaded me to ^_^), it does seem pretty hard for a compiler to guess which member functions will be called in what order.  A compiler won't know if some function will use a member variable that some other function "tried" to initialize but the caller "forgot" to call the other function first.
  • > Most C++ compilers tend not to warn about uninitialised
    > members in constructors, even if the normal "uninitialised
    > variable" warnings are turned on.
    Because uninitialised members are supposed to be zero by the standard.
    There are compilers (gcc versions comes in mind), that theated such variables as uninitialized. I don't know how is the state with Visual C++.
  • >>So yes, a cache with a bad policy (or a bad cache function) >is undistinguishable from a memory leak.
    >Until it is flushed, of course.
    Or otherwise emptyed.
Page 1 of 1 (12 items)