Introduction

Now that we have introduced you to the concept of Power Usage Effectiveness (PUE) and provided a mechanism to calculate this value, I wanted to talk about an idea that may be considered extreme. But just as the extreme nature of Formula One technology eventually ends up in a Ford Focus (and, as I discovered while commuting to campus the other day, so does the driver!), maybe what we are discussing could one day be a feature of a data center near you.

The Challenge—Achieving the Power of One

The central premise of this post is what would a data center look like with a PUE of 1.0? A PUE of 1.0 essentially means that all the power consumed in the data center generates compute operations and there is no additional cooling or power overhead. OK, so we would have to turn off the techies’ coffee machine (or maybe attach a USB-powered one to a customer’s server. Only joking, some people would call that gaming the metric!)

Thinking Out of the Box

While this goal is impossible with current data center designs, we thought this might be a good way to encourage “out of the box” innovation, so last year we came up with the idea was to create a temporary “data center” that used uncontrolled outside air for cooling and would be filled with servers that were cheap enough so that we didn’t care if they failed, as long as these failures were manageable.

The Benefits of Failure

There is a possible benefit in having servers fail, in that this failure forces obsolescence and ensures timely decommissioning of servers. In a future blog entry, I will lay out the details of this concept. For now the key message is that we needed to develop servers that can run in expanded environmental ranges. However, manufacturers will need to be convinced.

One of my ongoing missions (cue Star Trek music) is to drive the industry to broader operating temperature environments to enable huge efficiency opportunities. My last salvo on this front was to attend an ASHRAE event to push this opportunity. Unfortunately, the meeting was dominated by vendors rather than end users and the vendors were all saying that they didn’t have any failure data to justify this aggressive change. Yet I find it interesting that if you check their specifications online, they all support much broader ranges, with most supporting 95F and 80% relative humidity. Nowhere on their Web sites do these vendors even mention ASHRAE standards. So why the mixed messages from the vendors?

In a nutshell, server vendors are unwilling to expand their operating environments because there is no reliable data on reliability. And the reason there is no reliability data is that no-one is collecting this information, hence my interest in “Tent City”. In the meantime, my mission to update the ASHRAE guidelines continues…

Thinking Even Further Outside the Box

As a former server designer, I know that server vendors “sandbag” their hardware. Sandbagging refers to the practice of knowing you can do more but holding back to hedge your risks; in reality, I believe that manufacturers can take greater risks in their operating environments and still achieve the same reliability levels. Knowing about the chronic sandbagging in the industry, I thought that if I could run some servers in the Building 2 garage or somewhere were the equipment is at least protected from the rain, we could show the world the idea is not that crazy and is worth exploring.

Going Outside the Box Completely

I discussed my ideas with my co-conspirator, Sean James, at that time the Facility Program Manager for one of our data centers. Sean was one of the first guys I met here in Microsoft and I absolutely loved his can-do attitude. When I explained my thoughts, his response was simple and brief: “Let’s do it!” Luckily, Sean had some spare decommissioned servers we could put in a rack. So, like good Boy Scouts, we installed this rack under a large metal framed tent behind his data center in the fuel yard.

Here are some pictures of our setup. First, a couple of shots of our tent, nestling in the corner of the fuel yard. Sean coined the phrase “Tent City” to refer to our efforts.

clip_image002

Inside the tent, we had five HP DL585s running Sandra from November 2007 to June 2008 and we had ZERO failures or 100% uptime.

In the meantime, there have been a few anecdotal incidents:

  • Water dripped from the tent onto the rack. The server continued to run without incident.
  • A windstorm blew a section of the fence onto the rack. Again, the servers continued to run.
  • An itinerant leaf was sucked onto the server fascia. The server still ran without incident (see picture).

clip_image002[9]

You should bear in mind that we used servers that have already performed enough processor cycles to give a reasonable probability of discovering intelligent life in the universe.

While I am not suggesting that this is what the data center of the future should look like (although I think the Marmot Halo Data Center has something of a ring to it), I think this experiment illustrates the opportunities that a less conservative approach to environmental standards might generate. If we could achieve a PUE that is closer to the magic 1.0, we can substantially reduce the cost associated with running data centers. Just imagine never having to buy a chiller….

Investigating the Next Steps

As we mentioned in previous posts, our goal is to drive PUE as close to 1 as possible, with 1.125 as our goal in 2012. So what are the steps one would have to take to get there?

  1. Make aggressive use of outside air through a process called air economization. We are already using outside air for cooling in our Ireland data center for most of the year, as the temperature very rarely goes about 80F (if you’ve been to Ireland, you’ll know why—it’s called the Emerald Isle because the grass is so green from so much rain). However, we still need mechanical cooling for those rare days when the mercury climbs above 80F. If servers had a wider environmental range, we could then eliminate the use of mechanical cooling completely and we could also use outside air in countries a little closer to the equator. Vendor’s specifications say servers can operate at 95F, so why are we running air under the floor at 60 F or below? (The reason is bad data center design). We should be running our data centers at temperatures that maximize efficiency. End users should push vendors to design their equipment to run at even higher temperatures (at 120F, you could run the data center in a nomadic tent in Saudi Arabia). The point is that there many opportunities to improve power efficiency - data centers are for computers and not for people.
  2. Use offline UPS technologies. Most UPSs are online, in that they rectify alternating current (AC) to battery-level direct current (DC) and then invert the battery-level DC to 120 or 240 volts AC) Although this process provides better power smoothing, the double conversion is inefficient. Offline UPSs eliminate this double conversion, thus providing greater UPS efficiency. While there are the issues associated with putting unfiltered utility power to servers, there are mechanisms (such as larger capacitors in the power supplies) that can mitigate the effects of unsmoothed power. If we can achieve the replacement of online UPSs with online models, the power saving could be enormous.

Summary

In summary, my point is not that we should rush out and build all our data centers under canvas (although I would love to watch Mike Manos’s face if someone did propose putting all the Microsoft data centers into tents). However, I think we should continue to research how we can deploy ultra low cost, power-efficient infrastructure in situations where it makes sense. Areas that would be particularly appropriate are where applications are more resilient to failure and organizations can tolerate minor outages.

We aren’t the only ones working on this approach. You might want to take a look the work that Intel started at the same time as us, here: http://weblog.infoworld.com/sustainableit/archives/2008/09/intel_air_side.html.

I leave it to you to think about where this option could make sense. Our team has already committed itself to investigate these possibilities.

Christian Belady, P.E., Principal Power and Cooling Architect