Windows Azure SQL Database Marketplace
Todd Papaioannou is currently at Yahoo!, in the role of vice
president of cloud architecture for the cloud platform group. Before taking on
that role, Todd was responsible for new product architecture and strategy at Teradata,
including driving the entire Cloud computing program. Before that, he was the
CTO of Teradata's client software group. Prior to joining Teradata, he was
chief architect at Greenplum/Metapa. Dr Papaioannou holds a PhD in artificial
intelligence and distributed systems.
In this interview, we discuss:
Todd, could you introduce yourself and describe your background and your
current role at Yahoo?
I'm the chief architect for cloud computing here at Yahoo. My official title is
VP of Cloud Architecture, and I'm responsible for technology, architecture and strategy
across the whole of the cloud computing initiatives here at Yahoo!.
The Yahoo! cloud is the underlying engine on which we run
the business, and we like to think of it as one of the worlds largest private clouds.
So my responsibilities span edge, caching, content distribution, multiple structured
and unstructured storage mechanisms, serving containers and the underlying
cloud fabric we're focused on rolling out that makes it all possible.
I'm also responsible for Hadoop
and the cloud -serving container architecture, as well as all of the data
capture and data collection across the whole of the Yahoo network. We dedicate
a lot of energy to pulling together a very wide range of technologies as an
internal platform as a service.
Robert: I imagine
that making the move from Teradata to Yahoo! was significant for you personally.
Given that your career has been focused on cloud computing for some time now,
has the move to Yahoo made a difference in terms of what kinds of projects,
initiatives, and solutions you can personally lead or develop in the cloud
At Teradata, I was responsible for driving the cloud computing program from a
blank piece of paper through launch and delivery of multiple products, so I have
been focused on the cloud for a number of years as well as all of the other big
data initiatives and future-facing stuff.
I played the role of looking to the future and helping to
drive product strategy and product architecture across the Teradata portfolio.
But I became very involved with all of the cloud stuff as I drove that program
and saw that this was a very compelling and very interesting part of the market
So coming to Yahoo! was an opportunity to help drive one of
the largest private clouds in the world. There are probably only two or three
other companies in the world that deal with these issues at the scale that Yahoo!
does, so it's a fantastic opportunity. It also allows me to work closer to the
Robert: You said in
a presentation that you developed while at Teradata that "virtualization
is the megatrend of the next decade." Do you still feel that's the case?
And what do you think has the potential to supplant it, either in this decade
or the next?
I still think it's turning out that way. Virtualization is a megatrend that's
going on in data centers around the world right now, and virtualization is
actually just one component of cloud computing. There's a lot more that goes
into cloud computing as a layer above virtualization, and I think the self-service
and elasticity aspects are particularly interesting.
As I look to what is going to change stuff in the future,
ubiquity of devices is clearly another megatrend, as is the explosion of data. When
you take massive data, massive sets of devices, and cloud computing together, you
start to see a slightly different vision of how software needs to be built,
abstracted, and developed to support both the enterprise and the consumer,
Let's talk a little bit about Hadoop. In a quick set of back
and forth tweets with Barton George, you
clarified who has the largest Hadoop clusters, with Yahoo, Facebook, and eBay
being the largest, in that order. Who also is in the top 10, in your
estimation, and are there any particularly interesting implementations that
fall outside of the top 10 that you think bear watching?
Todd: The top 10
is probably made up of West Coast, Bay Area web companies that are generating
huge amounts of data, particularly social graph data, and are finding that
traditional tools are not that great for analyzing it. Outside of the ones you
mentioned, it's Twitter, Facebook, LinkedIn, Netflix, and those types of folks.
We're also starting to see penetration into the financial
industry, where they have huge amounts of data to process as well. US government
agencies like the CIA and NSA are using Hadoop now, and they use the Yahoo!
distribution of Hadoop to do their processing. They won't tell me what they're
doing, but I'd love to know. [laughs]
mentioned when you were part of a panel discussion at Structure 2010 that cloud
computing enables business
users to be separated from infrastructure cycles, so each can move at a
different pace. Can you unpack that statement a little bit and tell me what
benefits you think cloud computing provides to both smaller and larger
To take a simple view of the IT business, there's infrastructure you need to
purchase, put in place, and manage. Infrastructure buying cycles tend to be fairly
long, because it's a big investment and you want to make sure that you're doing
the right thing.
That can be a challenge for a small business, a business
unit, or someone that's close to the customer, because they need to move at a
much faster pace in today's business climate. Cloud computing, in my mind,
allows you to decouple the business logic from the underlying infrastructure
and allow those two things to move at separate paces.
As an analogy, consider the fact that building a road, which
is infrastructure, takes quite a long time, but small businesses can spring up or
shut down along that road, and people can build houses much more quickly. In
the same way, cloud computing enables the business to iterate much more
quickly, because they don't have to worry about purchasing infrastructure.
tweeted that you see an enormous amount of
innovation ongoing today with Hadoop. What excites you the most about the
future of that project as a whole?
Todd: I think
we're at an inflection point for Hadoop. Obviously, we at Yahoo! are extremely
proud that we have created and open sourced Hadoop. Over the last four or five
years, we've continued to invest in that environment, and right now we have around
40,000 machines running Hadoop at Yahoo!, which is clearly a huge number, and
growing all the time.
another set of folks now who are starting to use Hadoop at a smaller scale, and
the exciting thing, I think, is that there is now an ecosystem springing up.
There are vendors coming into the ecosystem with new tools and new products, and
people starting to innovate around the Hadoop core that we built.
Robert: Could you
comment on how important the innovation around the core software for the cloud
is, in terms of everything that has to happen around running the operations at
Todd: If you
think about the entire business, the data center is the lowest level of
infrastructure, and then you have the cloud running above that, and then in our
case, the business of web properties running above that. There's a huge amount
of innovation that has to happen on a vertical basis.
We've been driving a lot of innovation in how we design our
data centers. We recently opened a new data center that got some awards for its
design. It's designed [laughs] like a chicken coop, basically, so it's self
cooling in some respects. That was a great, novel approach to some of the
problems when you roll out a cloud, you're basically trying to build
I want to be able to shunt workloads around from data center
to data center depending on changing conditions. For example, we need to
respond to it if the data center is getting too hot or we are getting a lot of
surge traffic because the U.S. is waking up, and that sort of thing.
That can't really be done by humans in front of a keyboard.
What you really need to be thinking about is trying to automate everything. One
of the big initiatives I've been pushing is to automate everything in the cloud
so we have can have more of a high level thought process around control, rather
than a low-level, tactical one that focuses on shunting around specific
workloads on an as-needed basis.
Robert: How can
organizations that want to move to the private cloud benefit from the lessons
learned by big companies like Microsoft and Yahoo! that have gone before them?
If it's not your business to be running data centers, don't do it. You need
to make it Someone Else's Problem. Yahoo!, Microsoft, and a few others out
there are in the business of running data centers, and smaller companies should
take advantage of that availability.
Companies that do need to run their own data centers, for
whatever reason, can benefit from the fact that we have open sourced our
infrastructure code. One of our stated goals is to open source all the
underlying software that sits in our cloud, and we've done that today so far
with Hadoop, which is our big data processing and analytic environment, and
also more recently with Traffic Server, which is our caching and content
distribution network software.
And we do that for a particular reason, which is that if
you're building software internally, the minute you deploy it and no one else
externally is using it, it's already on a path to legacy. You can continue to
invest in that software, but you're continuing to invest in a one-off solution.
We like the open source world, because if we can build a
community around a piece of our software and drive it to be a de facto standard,
we can build a measure of future-proofing into our software. If people are
already working with it outside of the company, we can also hire people who
have previous experience with the software.
For the first time, we recently acquired a company that built
its product on top of Hadoop, which helps validate our belief that open
sourcing our infrastructure software benefits not only us, but the rest of the
world as well.
the role of Linux to the enterprise on servers, do you see an analogous
software package developing for the cloud?
Todd: I don't
think that has quite resolved itself yet. There's a lot of competition among
the big players like Microsoft, Amazon, and Rackspace. Amazon clearly has a
lead, but it's not insurmountable. And then there's obviously the open source
world, which includes Eucalyptus, OpenStack, Deltacloud, and others.
It's an exciting time to be working in this landscape, and that's
one of the reasons I came to Yahoo!. There's a huge amount of innovation going
on at every level of the stack, from way down at the hardware level, all the
way up to the cloud service level.
Virtualization, a massive expansion in server computing
power, and low prices have really acted as catalysts. I really see the cloud as
an abstraction layer above a set of underlying compute, storage, bandwidth, and
memory resources. That abstraction allows you to get access to those resources
Because of that, one of the big initiatives I'm driving here
at Yahoo! is to think of cloud computing resources as a utility just like
electricity or cell phone minutes. You should just be paying for the utility
when you need it, as you need it.
Robert: During a
panel discussion on big data, you mentioned that Yahoo is analyzing
more than 45 billion events per day from various sources to help direct
users to the right content and resources on the web. From the user perspective,
how does an emphasis on cloud computing technologies enhance their experience
with Yahoo as a portal?
Todd: First, just
to correct the number there, either I said the wrong number, or I was just
talking about audience data. We actually deal with 100 billion events a day. That covers audience data, advertising data, and a bunch of other events that happen across the Yahoo! network.
Our goal at Yahoo! is basically to offer the most compelling
and personally relevant experience to our end users. To do that, we need to
understand stuff about you, such as whether you're into sports, travel,
finance, or other topics. And we need to do that as you span across our
At Yahoo!, we have hundreds of different web properties, each
with a different focus and context. So even if you were interested in sports,
it may not be so relevant for us to show you a piece of sport content when
you're on Yahoo! Finance.
Because of that, we use all of the events that we collect,
and we use Hadoop to do all the processing, so we drive better user
understanding, and we're able to do better content targeting and ultimately,
better behavior targeting from an advertising standpoint.
Our ultimate goal is to understand you across all of our
properties, and depending upon what context you're in, to understand the
content you'll be interested in. Based on that, we want to be able to put a
contextually relevant advert close to that content to better drive engagement
for our advertising customers.
that same panel, focused on big data, there was a portion of the discussion about
on the data problem that the Fortune 1000 are having. To quote you for a
moment, you said, "They all have the same problem, but they haven't
figured out how much they're going to pay to solve it." Can you expand on
that a little bit, and how you think cloud computing technologies can help the
Fortune 1000, both in the short and long term?
Todd: For any
business, there's a spectrum of data that is vitally important right now,
whether it's investment management, supply chain management, or user
registration. Businesses are willing to pay a certain dollar value for that
data, whether it's available or active so they can access it immediately,
whether or not it's online.
There is a set of data that you don't know the dollar value
of yet, because you haven't discovered what it may teach you. But you know that
somewhere within that data, there's value to be found, whether it's better user
understanding or better insight into how to run your business.
I think the question was, "How do we know if you have
big data?" And my response was, "Everybody has big data. They just
don't know how much they want to pay for that big data." And by that I
mean, whichever business you go to, you can say you have a whole bunch of data
that you can really gain insight out of around your business, which you are
currently just dropping on the floor.
On the other hand, you probably believe you should pay a lot
less for that data as a business than you would for a more traditional
enterprise data warehouse or data mart like you might get from Teradata or
Still, the insights you can get from that data are huge. So
what you want to do is find the platform that matches your dollar cost profile
and that allows you to work on that data, discover insight, and then start to
promote it up into a more fully featured platform that ultimately ends up
costing you more.
You can stick a bunch of data in a public cloud, and it's
going to cost you a lot less to store than if you're buying a whole bunch of
filers or disks locally, for most people. There's also a set of technologies
like Hadoop that allow you to discover value in that data at a much lower cost
than you would pay a traditional vendor.
Because of that, the cloud is a great place for people to
process big data or unstructured data that they don't know the value of and are
looking for insights into their business.
Robert: That concludes
the prepared questions I had for you. Is there anything else you would like to
address for our Windows Azure community?
We dabbled a little bit on the public/private cloud question, but we didn't
really get into that too much. In fact, I think the future is going to be dominated
by the hybrid cloud. Companies are going to have a menu of options presented to
Say you're the CIO of some Global 5,000 company, or even a
small 10 person business. You've got to look at this menu and say "Given
the business service that I want to run, what are my criteria?" For example, sensitive
data or high security demands are likely to push me toward a private cloud.
On the other hand, if I have huge amounts of data that I
don't need to be highly available, that's the sort of thing that I would put
into the public cloud. That would prevent my having to make a large infrastructure
investment, and as we talked about earlier, it also lets me move quickly.
I really think the future for all businesses is to look at
this hybrid model. So, what's my service, what's my data, where do I want to
put it, how much do I want to pay and why do I want to pay that? And rather
than one menu, you'll have a set of rate cards from vendors that you can go and
Windows Azure platform appliance announcement concerned the ability to take the
services we offer in the public cloud and offer them on premises, while still
keeping it very much as fundamentally a service.
Todd: I think
that makes a lot of sense when you look at what am I going to be worrying about
as a CIO. In the life cycle of an application, I may even move it up and down between
the layers of the cloud. I may start off in a public cloud and then bring it
I think one of the areas for innovation and investment that the
industry needs to make is in enabling that. I do not want to be locked into a
single place where I can't move my application and I'm stuck with a single source
Being able to move my workload from vendor to vendor,
private to public, to me is an important element of what will make a successful
you guys are a quintessential example of the public cloud. What are you looking
at with regard to public customers?
Todd: We actually
don't offer a public cloud like Amazon or Google App Engine. In many ways,
though, we are the cloud. People
don't think of Yahoo that way, but we're the personal cloud. In terms of where
people's resources such as emails, photos, fantasy sports teams, and financial
portfolio, among other things, are to be found, Yahoo is a personal cloud
service for 100Ms of people. It's just that they don't think of us that way.
Considering whether we would move our workloads out into the
public cloud, we finally come to the conclusion that probably, we would not. At
the scale we deal with, it doesn't make sense.
There's a certain scale, I think, where it makes sense for
you to make it someone else's problem until it becomes a critical part of your
business. For us at Yahoo!, running technology and trying to scale technology
with 600 million registered users around the world, that is our problem, and it
has to be. It's the only way that we can successfully execute on that.
You see this with other folks as well. At the Structure
Conference, Jonathan from Facebook was saying they have actually come to that
same conclusion and that now they're actually creating their own data center,
pouring their own concrete and building up.
And they did that because they realized that, you know what,
it was their problem. And they needed to have the level of control and the
level of efficiency that you can derive by owning your own infrastructure.
So for folks of our size, it's unlikely we're going to move
our workload to Amazon or Azure. It just wouldn't make sense for us.
thanks for your time.
Todd: Thank you. It
was great to talk to you.