Sometimes it’s interesting to get some context about how a research project started out, what its original goals were, and how they changed over time. This post goes into a bit of that history of the Dryad project.
From mid-2003 until early 2005, I took some time away from most of my research to work with what was then called the MSN Search product group, helping to design and build the infrastructure for serving queries and managing the datacenter behind the first internally-developed Microsoft web search engine. Towards the end of 2004 I started thinking seriously about what research project I wanted to work on when V1 of the search engine shipped. The MapReduce paper had just appeared at OSDI, and I thought it would be interesting to come up with a generalization of that computational model that would allow more flexibility in the algorithms and dataflow while still keeping the scalability and fault-tolerance. It was clear to me at the time that the Search product group would need such a system to do the kind of large-scale data-mining that is crucial for relevance in a web-scale search engine. I was also interested in medium-sized clusters, and it was an explicit goal from the start to explore the idea of an “operating system” or “runtime” for a cluster of a few hundred computers (the kind of size that a research group, or small organization, could afford), that would make the cluster act more like a single computer from the perspective of the programmer and administrator.
There is no top-down research planning at Microsoft. In my lab (MSR Silicon Valley, run by Roy Levin) we have a flat structure, and projects usually get started whenever a few researchers get together and agree on an interesting idea. I sent out an email with my thoughts on what I was calling a “platform for distributed computing,” and was able to convince a few researchers including Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly to join in what we ended up calling the Dryad project, which got started in earnest early in 2005.
In order to have a working system we knew we needed not only an execution engine but also a storage system and a cluster-management infrastructure. Fortunately, by this time the search product group had set up a team to design and build their data-mining infrastructure, and we decided on a very close collaboration, with the product team building all the cluster services and the storage system, and the researchers delivering the execution engine. This is, in my experience, a pretty unusual arrangement at Microsoft. I think it was possible because I had recently worked closely with the search team, and so they knew I was capable of shipping code that worked (and adhered to their coding conventions, etc.), and because the product team was small and on a very tight deadline, so they were glad of the help. As a consequence, Dryad was always designed to be a ‘production’ system, and we didn’t cut some of the corners one normally might when building an initial research prototype. This had advantages and disadvantages: we didn’t get a running system nearly as soon as we could have done, but on the other hand the system we built has survived, with minor modifications, for three years in production and scaled to over 10,000 computers in a cluster, ten times the initial goal. At this point we are ready to refactor and simplify the code based on lessons learned, but overall it worked out very well.
Once we got the initial Dryad system running, we sat back and waited for users, but they turned out to be hard to convince. We had supplied an API for building graphs, and claimed in our paper that it was a wonderful thing, but nobody really wanted to use it. Meanwhile our goal of making a cluster look like a single computer had received little attention compared to the pressing problem of delivering a working, scalable, high performance execution engine to the search team. At this point our agenda diverged a little from that of the product group. They needed to get their developers using the available search data as fast as possible, and they needed to share their clusters between lots of different users and ensure that nobody would write a program that would starve out other vital production jobs. They therefore set off down a path that led to SCOPE, a query language that sits on top of Dryad and is well suited to the highest priority applications for their team, but is deliberately a little restrictive in the constructs, transforms and types it supports. We, on the other hand, were eager to get back to our original agenda of letting people write programs in the same sequential single-machine style they were used to using, but execute them transparently on our clusters.
Fortunately, just when we were wondering what to replace Dryad’s native graph-building API with, Yuan discovered the LINQ constructs, which had just been added to .NET. He saw that LINQ operators could be used to express parallel algorithms within a high-level programming language, while still keeping the algorithm abstract enough to allow the system to perform sophisticated rewrites and optimizations to get good distributed performance. Armed with this insight, he started the DryadLINQ project, and fairly rapidly had a prototype running on our local cluster.
This time the reaction of our colleagues in the lab was very different: instead of ignoring DryadLINQ they came back and asked for improvements, because they actually liked using it. The subsequent development of DryadLINQ has been closely informed by the features our users have found useful, and we have gradually gained a reasonable understanding of what programs are and are not suited to the Dryad computational model, which again will inform our work on the systems that come after Dryad.
By late 2008 we were beginning to have a minor success disaster in that a lot of people were sharing our research cluster running DryadLINQ programs, and they started to complain when one job hogged too many computers for too long, and they couldn’t get their work done. Up until that time, the logic for sharing the cluster between concurrent jobs was primitive, to say the least, and we realized that it was time to seriously look at centralized scheduling and load-balancing (not coincidentally, another key requirement to make a cluster act like a single computer). Another group of researchers got together to look at this problem, and we ended up building the Quincy scheduler that is the subject of a recent SOSP paper (and a forthcoming blog post). Since deploying that scheduler in March, we have essentially stopped getting complaints about issues of sharing, except of course when people find bugs in Quincy…
A number of other research projects have developed around the Dryad ecosystem, involving debugging, monitoring, caching, and file systems. We are still a long way from the goal of making the cluster “feel” like a single big parallel machine, but we think we are still steadily moving in that direction. Overall I think Dryad and its associated projects have been a strong validation of the old systems research dictum to ‘build what you use and use what you build.’ It does take more effort, and you can’t write as many papers as quickly, if you are trying to build a system that is solid enough for other people to depend on. But it’s very clear to us that many of the most interesting research ideas that have come out of these projects would never have occurred to us if we hadn’t had the focus that comes from helping real users address their practical problems.