Virtual Memory and Performance

There’s an interesting article about performance of server apps in the July 2010 Communications of the ACM somewhat provocatively titled “You’re Doing It Wrong”.  In it, Poul-Henning Kamp, the architect of an HTTP cache called Varnish, describes the “ah ha!” moment (on a night train to Amsterdam, no less) where he realized that traditional data structures “ignore the fact that memory is virtual”.

The problem is that very large address spaces encourage very large “in-memory” data structures (“The user paid for 64 bits of address space,” Kamp writes, “and I am not afraid to use it”) but those structures are not really in (physical) memory.  Depending on the virtual memory “pressure” (i.e. how many logical VM pages are actually on disk) memory accesses can vary by a factor of ten (if a page has to be retrieved from disk, even a very fast disk obviously cannot match the performance of RAM).  So algorithms that pretend that all memory accesses are equal can be quite inefficient compared to the algorithm for the Varnish cache that Kamp describes which "actively exploits” the fact that it is running in a VM environment.

There is a bigger point here, which is that all the layers of abstraction we’ve built (virtual machines, IL, class libraries, etc.) can be quite seductive. I made an entire class serializable into XML yesterday by adding one line of code (“[Serializable]” to the class declaration).  But part of me thought, hmmm, I wonder what the compiler actually generates now to do that?  It’s important that especially in hotspots we are aware of what the cost of abstraction is and actively look for ways to reduce it. 

Many of the comments about this article on the ACM site for this article can be summed up as “ho hum – you think you’ve discovered something new?”  But others are more in agreement with my perspective: I have been a practicing computer scientist (i.e. I write and design computer software for a living) for thirty years and I have not seen something as clearly written as this article about how to write code to exploit a VM environment.  The fact that some can point to theoretical pieces that have been written just shows how removed from practice that theory is.  But it is also a sign of how much the CACM has changed that it is running articles like this interesting to practitioners.