Windows Azure SQL Database Marketplace
Kate Keahey is a fellow at the Computation Institute at the
University of Chicago and works as a scientist at Argonne National Laboratory
Computation Institute. She is the creator of Nimbus, an open source toolkit for
turning a cluster into an infrastructure as a service (IaaS) cloud, primarily
targeted at making IaaS available to researchers and scientists. Her past
positions included being a technical staff member at Los Alamos National
Laboratory. More information about Kate and her work is available at www.mcs.anl.gov/~keahey and www.nimbusproject.org.
interview, we discuss:
Robert Duffner: Kate,
can you take a minute to introduce yourself and the Nimbus
Keahey: Sure. I'm a scientist at Argonne National Lab, and I'm a fellow at
the Computation Institute at the University of Chicago. My background is grid
computing, and many years ago, I realized that one barrier to computation for
scientists using grids is that they cannot control the environment on grid
For many people that was a deal breaker, because their code is
very complex and hard to port. I started working at combining virtualization and
distributed computing at King Lab with the idea of deploying virtual machines
After a few years of research deploying remote resources, we
released something called the Work Space Service, which right now is the
infrastructure of the service part of Nimbus. A year later, Amazon EC2 came
online, which was a lot of fun because it enabled us to experiment with the
concept of larger scales.
Nimbus today is a cloud computing toolkit, one part of which
is just an open source infrastructure-as-a-service implementation. It includes
rough equivalents of the compute and storage cloud components EC2 and S2. Another
goal of Nimbus is to make it possible for people to take advantage of infrastructure-as-a-service
So for example, we have a tool called Context Broker that
takes virtual machines deployed on Nimbus clouds, EC2 clouds, RackSpace clouds,
or whatever, and brokers configuration security context between those virtual
machines. In other words, after the context broker is done, you get a virtual
cluster, which most scientists are used to getting.
Another very important goal of Nimbus is to provide for open
source implementation and community. By open source, I am not referring to something
that we wrote and just put up a link so everybody can download it. We are
committed to building the community and developing software in an extensible
That effort includes creating a thoughtful design that is extensible
in various directions and providing a framework that allows people to easily
test what they have in the larger context of the cloud. That part of Nimbus has
been going particularly well. We have managed to attract significant
Right now, there are four developers funded by the
University of Chicago and Argonne National Lab working on Nimbus. There are
three other committers on Nimbus and there are a total of maybe 15
There are other projects that are clearly finding this
infrastructure useful and they are finding that it makes sense for them to
experiment with it and extend it, which I think is particularly important at
this time in cloud computing.
As a cloud computing evangelist for the scientific community,
I have worked with many projects and tried to make it easier for them to use in
particular infrastructure-as-a-service clouds and to figure out what their
needs are and what the obstacles to adoption are.
Robert: One of
the goals of Nimbus is to enable turnkey
virtual clusters, so can you take a minute to break that down? What are the
Kate: Say you deploy
some virtual machines in Amazon's cloud, for example, and several others in
some Nimbus cloud, or some other infrastructure-as-a-service place. Those sets
of virtual machines are unconnected, and they don't know about each other. They
don't share a security context. If you look at a typical scientific cluster,
you'll see that it's connected in some ways.
In other words, when you run let's say MPI on that cluster,
you can do a send from one node to another without having to type the password, because those nodes
exchange host certificates.
But when you deploy virtual machines on somebody's data
center, how do they exchange those host certificates? If somebody from the
Internet that you don't know about comes to you and says, "Here's my host
certificate, can you put it in your office file?" it might not happen
because you don't have a trusted relationship with that person.
Therefore, there needs to be some route of trust, and
somebody needs to broker that trusted relationship for various members of the
cluster. Those also become configuration conflicts. For example, in your
typical scientific cluster based on MPI, it would be configured with something
like NFS: Network File Service.
The nodes that are clients to NFS need to know information about
the NFS server like the location of the volume. The server also needs to know
about the clients. There's a concept in MPI called MPI COMM WORLD that defines
which nodes you're going to be communicating with. If you've got a barrier, you
need to contact all those nodes and synchronize. That information has to be
exchanged between those nodes somehow.
That's what the context broker does. It sets up a configuration
exchange, provides a trust route, and establishes a trusted security exchange
between the nodes of a cluster. The result is a cluster that shares a security
context and a configuration context, so that your typical scientist can come in
and say, "All right. This cluster has NSS, it has PBS, it has whatever
else I need to do my job." They can just treat that cluster as if it were
a cluster in their computational center.
University of Chicago and several other universities are offering science clouds. Can you talk a
little bit about that?
Kate: That actually began as a grassroots movement. About three
years after we released Nimbus, we did some experiments with various scientific
applications and Amazon. At some point, the University of Chicago said,
"All right, this is an interesting project. Let's give them a little
partition on one of our existing clusters." Some other universities also thought
it was a good idea, so the University of Florida and Purdue University
configured clusters on their own resources.
The one at Purdue is particularly interesting, because they
configured a cloud on TeraGrid resources. TeraGrid is a large national
infrastructure that typically runs the traditional grid software, but Purdue
decided to set up one as a cloud. That was the first cloud within TeraGrid, and
they've been doing some interesting experiments with that lately.
This is a very loose federation of various universities.
There is no obligation as to policies or anything like that, and there is no
common governance between them. People just know each other, and if some users need
to have accounts on multiple clouds for some reason, they get recommended and
People can use those clouds for any kind of scientific
purpose. You can't use them to run computations for your startup, but those
clouds are open to anything that is academically viable. Actually, they are probably
being made obsolete right now by a new project called FutureGrid.
It's also a national infrastructure, TeraGrid type of
project, but this one has been specifically set up to experiment with new
technologies and new paradigms such as cloud computing. And some of the
machines on FutureGrid are configured with Nimbus to form clouds. Traffic is
slowly moving from the science clouds, which are very small, toward the clouds
set up in FutureGrid.
understanding of FutureGrid is that there's a lot to it other than cloud
computing. Could you speak to that a bit?
Kate: Sure. It
was set up to provide an experimental environment. Infrastructures like
TeraGrid and Open Science Grid cater to the needs of production files. In other
words, if oceanographers, astrophysicists, or high energy physicists have some
simulations or other computations they want to run, they can do so on those
That arrangement has one problem, though, which is that it's
very difficult for computer scientists to use, because computer science
experiments typically inject some failures into the system or experiment with
new technologies like cloud computing that create instabilities. Those factors
are fundamentally incompatible with the needs of production environments.
There is a test bed in France called Good 5000 that is
specifically built for experimental computer science, and it has been running
for many years now. They have worked out how the governance of a test bed like
that should work, what mode of usage patterns is required, and so forth.
At some point, it became clear that it would be good to have
something like that in the US, and that's essentially what FutureGrid is. It
gives us an excellent opportunity to experiment with new paradigms, including
cloud computing but also, for example, new networking paradigms. It also has a
private network that you can inject delays and failures into, so it's a very
interesting experimental tool, and it's coming online this fall.
It's possible now to get accounts on FutureGrid. In fact,
there are quite a few people who did, and they are running interesting projects
on Nimbus and other setups on the FutureGrid. It's not supporting all modes of
usage yet, but within a few months, we're hoping to support minimal users and
Robert: Here at
Microsoft, we have a pretty healthy research arm, and Dan
Reed of Microsoft Research said that the cloud may be the sweet spot for
many scientists. He also talked about this idea of democratizing research. What
do you see as the potential for the cloud to enhance research through things
like the STAR Experiment, ALICE (the ion
collider experiment), Ocean
Observatory Initiative, and others?
I think there is enormous potential. I alluded earlier to the problems that people
were encountering with grids, when they couldn't control an environment on a
remote resource. That issue went away with virtualization, since if you can
deploy a virtual machine that you have configured yourself, it's going to
support exactly the environment that you want. That is a very, very powerful
thing, because often in science, you don't need computation on an ongoing basis,
but only on demand.
You mentioned STAR. They ran a simulation that's fairly
famous by now, where they really needed to get results in time for a paper
deadline. There was one last simulation they needed to squeeze in, and we
helped them run their experiment on Amazon using hundreds of nodes.
Of course, paper deadlines are fairly trivial in the scheme
of things, but we have also worked with epidemiologists at the University of
Utah. You can imagine that there could be a far more significant incentive to
get something done very quickly in the case of monitoring an epidemic.
At the Ocean Observatory Initiative, they are monitoring
earthquakes. That data sometimes gets delayed, and then it comes out in big
bursts. It has to be processed as soon as possible, because if there are earthquakes
happening somewhere, we want advanced information on that as soon as possible.
There are a lot of events that benefit from timely
processing. For example, in the case of an oil spill, you want to run
simulations about projected movement of the spilled oil, effects on marine
life, and so on. Hurricanes, tsunamis, and a wide spectrum of fairly sudden
phenomena make it very powerful to be able to go and get resources at a drop of
a hat and then come back to your normal processing needs.
Another factor is that in many branches of science, there
are periodic rapid changes. For example, in bioinformatics, the cost of
sequencing machines went down dramatically. It used to be that many centers
simply could not afford those machines and therefore were not producing that
data, but now there's a sudden explosion of data, and people's computational
needs are growing very, very dramatically.
Given that those patterns are very difficult to predict, how
do you accommodate that growth? Why not outsource the computation? There are
many economic and convenience factors; scientists simply don't want to run
They want to do the climate or the physics, or whatever
other discipline they are good at. They would prefer not to spend their resources
acquiring expertise to run these computation labs on site, so they are happy to
outsource the problem, if possible.
Another important issue has to do with democratization:
creating the computing middle class, if you will. It used to be that if you
started a research team, you needed some computational resources to back you
up. You needed to buy a cluster. Well, not necessarily anymore.
Many of the same factors that make cloud computing
compelling for business also make it compelling to academia. In academia, one
huge usage pattern is coming to the fore right now, and that's the ability to
deal with unpredictable phenomena.
And this is what we're doing in the Ocean Observatory
Initiative. People are putting sensors in the ocean, and based on the
information that those sensors return, we need to run simulations. We need to
scale elastically and on demand, and sometimes we need to scale in very short
amounts of time.
This implementation of that pattern didn't used to be
possible. Now it is, and it's changing things dramatically. We'll have to see
how the availability of that pattern will change things further and whether it
will speed up this process of outsourcing computation. It's certainly a very
Another pattern that we see emerging is that not all of
these scientists have the skills to configure their own virtual machines to run
on the cloud infrastructure. As a result, there is a new type of role emerging
for specialists who take care of creating the right virtual machines for the
community, validate them in special ways, and maintain them.
The stuff that sys-admins used to do for the communities is now
being done not on a per-cluster basis, but on per-community basis. Since it's
done on a per-community basis, they are putting all kinds of stuff on those
virtual machines and performing all kinds of maintenance tasks that cluster
administrators didn't do before, because they were doing it on a per-cluster
We collaborate with a bioinformatics project at University
of Maryland called Clover that essentially developed a set of mechanisms that
make cloud computing easier, customized to their specific community. I predict
that there will be many more projects like that emerging, and it will lead to a
new technical role that we didn't really have before.
recently ran an Azure academic pilot to
start building relationships with researchers and better understand their cloud
computing environments. We wound up striking a deal with the National Science
Foundation to provide free cloud resources for NSF projects. What would be your
message to an organization like Microsoft that's interested in providing
compute resources to researchers and students?
Kate: The only serious obstacle I can see is that the prevalent
platforms in the scientific community are Linux or Unix-based. From the
perspective of many researchers, particularly those that have been established
for a long time, transitioning to Windows is a major obstacle. In order to go
to Azure, they would not only have to port it, but they would also have to take
on maintenance responsibilities on a platform other than their primary one.
Working with newer communities that do not have those strong
legacy requirements, bringing them in while they're developing computational
capabilities, and providing them with encouragement and help while showcasing Azure's
features is probably the way to go. It's very hard to move a community that has
already been developing things on a Unix-like platform for many years onto a
definitely a great opportunity for Microsoft, and I think you bring up some
very good points. As we start to realize fully the benefits of offering a
platform-as-a-service, where you're just paying for access to the data center, I
wonder whether these distinctions between various platforms start to go away. Do
you have any thoughts on that?
Kate: From my
perspective, infrastructure-as-a-service is the most flexible computing model,
because it truly gives people the freedom of doing whatever they want to do. If
you go further up the stack and provide a platform, which is what Azure is, on
one hand you provide something that is potentially more convenient, but on the
other hand, you restrict some degree of control that people have on the
And I think you're perfectly right that if there is a useful
platform and people really find the specific paradigm useful, it eventually
becomes a utility. For the transitional period and for applications that do not
easily lend themselves to specific computational patterns, and I personally
think that are many such applications, the infrastructure-as-a-service paradigm
is going to be more interesting in the long run. And then the choice of an
operating system platform is going to matter.
Robert: That's a
fascinating point of view. We talked a lot about researchers and scientists,
but tell us about the student side. What aspects of cloud computing seem to be
the most interesting to the students that you interact with?
Well, students are typically interested in their degrees. I think that cloud
computing from their perspective is like a gold rush, because all of a sudden,
the paradigm changed. And when a paradigm changes, that means that many
pathways emerge that were not explored before. So from the perspective of
somebody looking for a thesis topic, all of a sudden there are all these thesis
topics there for the taking.
To give you an example, you very often just blast a lot of
copies of virtual machine images to nodes. Then you also snapshot them so in a
given time slot, you have to take those images and store them on your storage
In principle, this is nothing new. We have done data
management and storage management and distributed computing in general before,
but perhaps we focused on something else. Maybe we didn't focus on the specific
pattern when one image, let's say a five- to 15-gigabyte image or file, goes
out to so many nodes.
Now, by optimizing this pattern, we have some interesting
issues there that we can explore. There are intellectually challenging problems
that get this change in paradigm all of a sudden exposed. So for computer
science students in particular, I think it's wonderful. And we've been working
with quite a few under various initiatives.
We've had great success working with Google Summer of Code,
which sponsors students to do open source involvement in the summer. Would
Microsoft consider doing something like that?
Robert: I think
it's a very interesting idea, and there have actually been conversations within
Microsoft regarding a similar approach. I can't really provide you with any specifics,
but there's definitely a lot of interest. There are certainly differences
because of our historical approach to developing software, but we've been
engaging with open source communities more and more. It's absolutely not out of
the realm of possibility for us to move rapidly in that direction.
certainly wonderful for students, because they find a lot of interesting,
challenging problems. Throwing a lot of young minds at specific problems could
be very interesting.
Robert: From a
strategy perspective, Microsoft wants the Windows Azure platform to be the most
open platform out there. Whether you want to develop in .NET, C#, Java, PHP,
Ruby on Rails, Python, or whatever, we want to be able to run all of those in
Windows Azure. That's clearly a stated direction for Microsoft, because ultimately
we believe that whoever's cloud is the easiest to leave will be the most successful
Kate: I totally
agree with that.
slightly shift gears, in a talk called "My
Other Computer is a Data Center," Robert Grossman of the Open Cloud
Consortium said, "A programmer can develop a program to process a
container full of data, in this case, a shipping container, with less than a
day of training using MapReduce." How much hardcore programming does a
science student need to know to be able to take advantage of the cloud?
Kate: I think
we've got two issues here. One is the issue of taking advantage of resources
that are external to your lab. And then the other issue is the issue of a
convenient paradigm. Certainly MapReduce has proved to be a very convenient
paradigm, maybe not for physics, but for bioinformatics, certainly, or other
biological sciences or any kind of problem that is in some way, shape or form
similar to search.
So the paradigm is convenient, and then you get the easy
access to external resources, in a sense as a bonus. Easy access to external
resources has to be provided to you via a paradigm that is easy to use. Sometimes
it will be a platform paradigm, and sometimes it will be an infrastructure
There was a huge area where the cloud computing paradigm was
hard and now it is easy. And that makes a very, very significant difference to
Robert: What are
some of the other uses of Nimbus that you're seeing outside of the scientific
clouds are being used fairly extensively for education. I don't think it has
really been significantly adopted in industry. We talked earlier on to some
commercial companies, but at this point, there are many other projects that
cater to larger companies like that, in particular projects that provide a
better business model for support and so forth. And our target is also
primarily science and education.
Robert: So can
you talk a little bit about the future of Nimbus? What's on the road map?
We've got a couple of releases planned, one of which is going to include fast
propagation, so we'll have an improved tool to support the need to suddenly
deploy hundreds of images, which I referred to earlier.
One of my key members developed something called LAN
Torrent, which works roughly like BitTorrent, but on a LAN, and it just streams
images. A node becomes sourced and streams images to other nodes, which significantly
reduces the deployment time for large clusters.
We tested something like 1, 000 images in 10 minutes on the
Magellan Cloud. In scientific clouds, the prevalent model with
infrastructure-as-a-service is to allow users service-on-demand requests.
Conventionally, you have to significantly over provision your cluster. You get
like 10 percent utilization, or if you have much higher utilization, you have
to reject a large proportion of those requests because somebody else is already
That's not a very good trade off, so we said, "All right,
how about if while nobody is asking for on-demand resources, you just deploy a
default virtual machine on those resources?" And that default virtual machine
could join for example, Condor Pool, or some sort of study-at-home pool, or be
used by some infrastructure that is very failure-tolerant.
Condor is a system for high-throughput computing that is
used to work in environments such as desktops. The owner of the desktop can
come any time and interrupt it, and some of the computations are effectively
lost and have to be rerun.
We ran some studies on that, and you can have a 100 percent-utilized
cloud. That's coming out in 2.7, which we hope to release on Thanksgiving,
because most of those features have been contributed by our wonderful open
So we have lots to be thankful for. And then later on, we're
going to be releasing capabilities that we've worked on for several years now,
and in particular, the capability to scale elastically. The Alease project was
our first venture into management of elastic scaling.
They had a queue of jobs that they run on a global test bed,
and the jobs are managed by a group scheduler. And they said, well, we would
like to monitor that queue, and if it's very large, we would like to spin up some
virtual machines on a cloud to pick up those jobs. And if the queue gets
smaller, we'll kill those virtual machines, because we won't need those
In other words, it was elastic scaling to extend their test
bed. We did that about three years ago, and we have done several projects in
that vein since then. More recently, we've been working with the Ocean
Observatory Initiative to provide a highly available elastic scaling service
like that. We're hoping to release that early next spring.
Robert: Those are
all of the prepared questions I have. Is there anything else you'd like to talk
about or something that you would like an opportunity to address to our Windows
Kate: I see open
source as an extremely important vehicle in progress, in particular in cloud
computing. And from my vantage point, watching the community for quite a few
years, many of the breakthroughs that have led to the development of cloud
computing can be traced back to open source.
For example, we had VMware for many years, and it worked
great. But at the beginning, when I was trying to convince users to use virtual
machines with distributed computing, they would say, "Why would I buy a
virtual machine if I could buy a real one for the same amount of money?"
Because licenses were quite hefty.
And then when open source virtualization came out, people could
improve it and adapt it to what they needed to do, and it was very fast. It
really created a breakthrough, and eventually it led to the development of
services like EC2, and in some ways to the whole cloud revolution.
When there is a paradigm change, it's very important to
invest in open source software. I'm not saying that all software should be open
source, and I don't think anybody would advocate that quite completely. But
it's important that there is software that people can experiment with, extend,
and contribute to.
It's extremely important to generate progress and allow people
to experiment and do new things. I think that is a very important aspect of
what we're going to be seeing in the future.
Robert: Thank you
very much Kate.
Kate: Thank you.