On Tuesday, Raymond posted "A
cache with a bad policy is another name for a memory leak".
Of course that immediately reminded me of a war story from back in the days
of Exchange 5.0. One of the major features of Exchange 5.0 was an NNTP
NNTP (and POP3) are rather interesting protocols. In general, clients
don't tend to stay connected to the server for a long time, instead they tend to
connect, authenticate, perform a couple of operations nd terminate the
connection. Many protocols have similar usage patterns (HTTP comes to mind
Of course, when you're building a server, you need to do performance analysis
to determine how fast the server can go (or in the case of an email/news server,
how many clients it can support). For the purposes of performance
analysis, we measure requests in terms of "transactions". For the POP3 or
NNTP server, each of these connect/authenticate/operations/terminate sequences
was is considered a "transaction". By measuring the amount of time it
takes for each of these transactions to execute, you can easily determine the
scalability of your server. If each transaction takes <n> milliseconds,
then you can only perform 1000/<n> transactions per second. Obviously if
you can perform <m> transactions at the same time, then the number of
transactions per second becomes (1000/<n>)*<m>. But <m> is typically a
relatively small number (if you're CPU bound, then <m> is typically the number
of CPUs on your machine, for instance). Calculating <m> can be quite
complicated, and is way out of scope for this post.
Obviously if you increase <m> or decease <n>, you'll improve your
performance, and when doing perf analysis, you do both. When we started
breaking down <n>, we realized that there were a couple of huge items that were
effecting <n>. The first cost was authentication, the other was reading
information about the newsgroup from the Exchange DS (that was required during
the "operations" step). The combined cost of these two items was about 500
milliseconds of the 510 total cost per transaction.
There were clear benefits here, so I built two separate caches, one for the
directory lookups and the other for the authentications. Both worked
flawlessly, and our performance jumped dramatically.
Things went great until we started getting client stress tests run against
the NNTP server. All of a sudden, our testers started reporting that the
NNTP server was leaking memory like crazy. It was consuming hundreds of
megabytes (on a machine with 64M of physical memory), and the machine was
Not a problem, the Exchange store had extensive memory leak tracking logic,
we just stopped the server to let the leak detection logic do its stuff and tell
us where the leak was.
Of course you know what happened: The leak detection logic reported that
there were no leaks. So, being the typical arrogant developers that I am,
I went back to the testers and said "There are no leaks, you must be dreaming".
Needless to say, that didn't go over too well, so I dug deeper.
I hooked a debugger up to the process and started poking randomly at
allocated memory, and discovered that the DS cache was huge. I dug a bit
deeper and discovered that the problem was buried deep in the DS cache logic.
It turns out that the there was an uninitialized variable in the hash function
that I was using in my cache, which caused all the entries in the DS cache to
have their own buckets.
So yes, a cache with a bad policy (or a bad cache function) is
undistinguishable from a memory leak.