Welcome to MSDN Blogs Sign in | Join | Help
Intense Computing or In Tents Computing?

Introduction

Now that we have introduced you to the concept of Power Usage Effectiveness (PUE) and provided a mechanism to calculate this value, I wanted to talk about an idea that may be considered extreme. But just as the extreme nature of Formula One technology eventually ends up in a Ford Focus (and, as I discovered while commuting to campus the other day, so does the driver!), maybe what we are discussing could one day be a feature of a data center near you.

The Challenge—Achieving the Power of One

The central premise of this post is what would a data center look like with a PUE of 1.0? A PUE of 1.0 essentially means that all the power consumed in the data center generates compute operations and there is no additional cooling or power overhead. OK, so we would have to turn off the techies’ coffee machine (or maybe attach a USB-powered one to a customer’s server. Only joking, some people would call that gaming the metric!)

Thinking Out of the Box

While this goal is impossible with current data center designs, we thought this might be a good way to encourage “out of the box” innovation, so last year we came up with the idea was to create a temporary “data center” that used uncontrolled outside air for cooling and would be filled with servers that were cheap enough so that we didn’t care if they failed, as long as these failures were manageable.

The Benefits of Failure

There is a possible benefit in having servers fail, in that this failure forces obsolescence and ensures timely decommissioning of servers. In a future blog entry, I will lay out the details of this concept. For now the key message is that we needed to develop servers that can run in expanded environmental ranges. However, manufacturers will need to be convinced.

One of my ongoing missions (cue Star Trek music) is to drive the industry to broader operating temperature environments to enable huge efficiency opportunities. My last salvo on this front was to attend an ASHRAE event to push this opportunity. Unfortunately, the meeting was dominated by vendors rather than end users and the vendors were all saying that they didn’t have any failure data to justify this aggressive change. Yet I find it interesting that if you check their specifications online, they all support much broader ranges, with most supporting 95F and 80% relative humidity. Nowhere on their Web sites do these vendors even mention ASHRAE standards. So why the mixed messages from the vendors?

In a nutshell, server vendors are unwilling to expand their operating environments because there is no reliable data on reliability. And the reason there is no reliability data is that no-one is collecting this information, hence my interest in “Tent City”. In the meantime, my mission to update the ASHRAE guidelines continues…

Thinking Even Further Outside the Box

As a former server designer, I know that server vendors “sandbag” their hardware. Sandbagging refers to the practice of knowing you can do more but holding back to hedge your risks; in reality, I believe that manufacturers can take greater risks in their operating environments and still achieve the same reliability levels. Knowing about the chronic sandbagging in the industry, I thought that if I could run some servers in the Building 2 garage or somewhere were the equipment is at least protected from the rain, we could show the world the idea is not that crazy and is worth exploring.

Going Outside the Box Completely

I discussed my ideas with my co-conspirator, Sean James, at that time the Facility Program Manager for one of our data centers. Sean was one of the first guys I met here in Microsoft and I absolutely loved his can-do attitude. When I explained my thoughts, his response was simple and brief: “Let’s do it!” Luckily, Sean had some spare decommissioned servers we could put in a rack. So, like good Boy Scouts, we installed this rack under a large metal framed tent behind his data center in the fuel yard.

Here are some pictures of our setup. First, a couple of shots of our tent, nestling in the corner of the fuel yard. Sean coined the phrase “Tent City” to refer to our efforts.

clip_image002

Inside the tent, we had five HP DL585s running Sandra from November 2007 to June 2008 and we had ZERO failures or 100% uptime.

In the meantime, there have been a few anecdotal incidents:

  • Water dripped from the tent onto the rack. The server continued to run without incident.
  • A windstorm blew a section of the fence onto the rack. Again, the servers continued to run.
  • An itinerant leaf was sucked onto the server fascia. The server still ran without incident (see picture).

clip_image002[9]

You should bear in mind that we used servers that have already performed enough processor cycles to give a reasonable probability of discovering intelligent life in the universe.

While I am not suggesting that this is what the data center of the future should look like (although I think the Marmot Halo Data Center has something of a ring to it), I think this experiment illustrates the opportunities that a less conservative approach to environmental standards might generate. If we could achieve a PUE that is closer to the magic 1.0, we can substantially reduce the cost associated with running data centers. Just imagine never having to buy a chiller….

Investigating the Next Steps

As we mentioned in previous posts, our goal is to drive PUE as close to 1 as possible, with 1.125 as our goal in 2012. So what are the steps one would have to take to get there?

  1. Make aggressive use of outside air through a process called air economization. We are already using outside air for cooling in our Ireland data center for most of the year, as the temperature very rarely goes about 80F (if you’ve been to Ireland, you’ll know why—it’s called the Emerald Isle because the grass is so green from so much rain). However, we still need mechanical cooling for those rare days when the mercury climbs above 80F. If servers had a wider environmental range, we could then eliminate the use of mechanical cooling completely and we could also use outside air in countries a little closer to the equator. Vendor’s specifications say servers can operate at 95F, so why are we running air under the floor at 60 F or below? (The reason is bad data center design). We should be running our data centers at temperatures that maximize efficiency. End users should push vendors to design their equipment to run at even higher temperatures (at 120F, you could run the data center in a nomadic tent in Saudi Arabia). The point is that there many opportunities to improve power efficiency - data centers are for computers and not for people.
  2. Use offline UPS technologies. Most UPSs are online, in that they rectify alternating current (AC) to battery-level direct current (DC) and then invert the battery-level DC to 120 or 240 volts AC) Although this process provides better power smoothing, the double conversion is inefficient. Offline UPSs eliminate this double conversion, thus providing greater UPS efficiency. While there are the issues associated with putting unfiltered utility power to servers, there are mechanisms (such as larger capacitors in the power supplies) that can mitigate the effects of unsmoothed power. If we can achieve the replacement of online UPSs with online models, the power saving could be enormous.

Summary

In summary, my point is not that we should rush out and build all our data centers under canvas (although I would love to watch Mike Manos’s face if someone did propose putting all the Microsoft data centers into tents). However, I think we should continue to research how we can deploy ultra low cost, power-efficient infrastructure in situations where it makes sense. Areas that would be particularly appropriate are where applications are more resilient to failure and organizations can tolerate minor outages.

We aren’t the only ones working on this approach. You might want to take a look the work that Intel started at the same time as us, here: http://weblog.infoworld.com/sustainableit/archives/2008/09/intel_air_side.html.

I leave it to you to think about where this option could make sense. Our team has already committed itself to investigate these possibilities.

Christian Belady, P.E., Principal Power and Cooling Architect

Building a Green Windows Home Server

And now for something a little different. So far we have focused on the energy costs of datacenters, but since patterns & practices is a development team, we obviously have a lot of team members that have, shall we say, a little hardware at home as well.

Ade Miller, our development manager, has an interesting series of posts on his blog describing his foray into building an energy efficient Windows Home Server to back up his other PCs, serve music, and act as a print server. You can read the series starting at http://www.ademiller.com/blogs/tech/2008/09/building-a-windows-home-server-choosing-the-hardware/.

Now that he’s conquered this project maybe he can get started on the TARDIS he keeps promising me.

RoAnn Corbisier

Charging Customers for Power Usage in Microsoft Data Centers

Introduction

Christian’s previous post talks about altering user behavior by changing chargeback models. I would like to thank Christian for his efforts to raise awareness about this new approach to charging for data center services. I believe that implementing chargeback models based on power usage will encourage customers to consider power efficiency more seriously and reduce our overall impact on the environment.

I would like to illustrate this point by using the analogy of getting a first car for a teenager. By the way, if you currently have one of these creatures at home, then you have my commiserations.

Comparing Chargeback Models

Let’s say you are investigating the conditions under which you will allow your teenager to have a car. Consider the following two options:

  1. The teenager buys the car, but you pay for the gas.
  2. You buy the car and the teenager pays for the gas.

For each option, what sort of car do you think your teenager will want?

With option 1, you may find that they come back with a clapped out 351 cubic inch monster with the thirst of a Jentil, maybe something like the “striped tomato” from Starsky and Hutch. And who cares? You’re paying for the gas, right?

But there’s no point in having the coolest car in the neighborhood if you can’t even afford to take a date to a drive-in movie. If you select option 2, you may find your teenager develops a more healthy interest in how many miles per gallon (MPG) something more sensible can manage (like the new Prius hybrids at the Redmond Campus), rather than whether they have enough torque to leave tire slicks longer than a 747 landing at Princess Juliana International airport.

Applying Chargeback Models to Data Centers

I’ve kept this analogy simple to reinforce the message, as driving user behavior through chargeback models in data centers involves more factors than equipping a teenager with a suitable set of wheels. With cars, you have the main mileage-related input from the cost of gas. The service you receive is the number of miles that a set amount of gas will take you. (I’m assuming typical Seattle traffic here, so the fact that you could run a more powerful car with the potential to go faster isn’t an factor, as you’ll still be sat in the same queue over the floating bridge.) If you increase the MPG, you get more output for the same input.

With servers, the situation is more complex; the main issue that we struggle with is the fact that there is no direct equivalent to miles per gallon. How do you measure application output? It isn’t just about CPU utilization. What about an application that makes repeated calls to hard disk but doesn’t use much in the way of processor resources? If you fit more memory to the server, it can may be able to cache the disk access calls, but you’re then consuming more power in the additional memory module.

All the standard performance monitoring areas, such as processor, memory, hard disk, network, and cache make varying contributions to power consumption. What we needed at Microsoft was a chargeback model that is easy to understand, straightforward to administer, and allocates data center costs to customers proportionately.

Investigating Chargeback Models

For the Microsoft data centers, the effort to change our chargeback model was not a simple conversion, as it took us one and a half years to move from  our previous model of charging for floor space based upon rack utilization to the new model. Not surprisingly, it was not the tooling or process modifications that posed the biggest hurdle but the cultural and political changes that were required. Even today, I frequently have to remind customers that ‘DC space is power.’

For the implementation, we reviewed several methodologies for the chargeback model, ranging in complexity and ease of application. We rapidly discovered that power monitoring at the server is too expensive and complex, even with the newer server motherboards that enable you to measure power usage directly.

Our Customers

You may be wondering as to the identity of our customers. Our customers include all the external Internet services that Microsoft provides, many of which you may already use. These services include:

  • Windows Live Hotmail
  • Windows Live Messenger
  • Live Search
  • Microsoft Online Services (MSN)
  • Microsoft Passport
  • Microsoft Web sites, including microsoft.com
  • Microsoft IT (internal services)

Internal services, such as Outlook Web Access, mobile e-mail, and SharePoint publishing, are supported by the regional Microsoft IT teams.

Defining the Chargeback Model

The final model that we now use has two basic components:

  • Floor Space. This component is billed per kilowatt (kW) of usage and includes all the floor space costs.
  • Power and Cooling. This component is billed per kilowatt hour (kWh) of usage and includes the cost of electricity as billed by our energy suppliers.

I’m not using real figures here, but you should be able to see the basis of how we implemented charging based on power consumption.

Measuring Data Center Consumption

The first figure we can work from is the total power consumption of a data center. That’s fairly easy to discover, as our energy suppliers have a strong vested interest in charging us accurately for the power we use.

Where possible, we select power tariffs that include a proportion of energy generated from renewable resources. The proportion and type of renewable energy varies according to geography. For example, data centers in mountainous regions such as Canada can draw on hydroelectric resources but tend not to do so well with solar energy. Some European countries with a history of windmills (think mice in clogs) are generating significant amounts of electricity from offshore wind farms. Middle-eastern locations look to solar power as a stable and increasingly cheap power source.

Calculating Power Usage Effectiveness

We start by calculating Power Usage Effectiveness (PUE).

PUE = Total Utility Load
          Total IT Equipment Power

Note that for PUE, a lower figure is better. There is also the reciprocal of PUE, Data Center infrastructure Efficiency (DCiE), which is expressed as a percentage and defined as:

DCiE  =  1  =   Total IT Equipment Power x 100  
            PUE   Total Utility Load

At 100% utilization, a typical data center consumes 10 megawatts (MW). A more typical utilization figure is 70%, or 7 MW, of which 3 to 4 MW is IT equipment power. These figures give a PUE rating of between 1.75 and 2.33.

The difference between the Total Utility Load and the Total IT Equipment Power comes from items such as losses in uninterruptable power supplies (between 3 and 7%), lighting, cooling, air conditioning, and fire alarms. We should also not forget the most essential item in any data center: the coffee machine. Obviously, we have to ensure that our charge out rate covers these additional overhead costs.

Rating Devices

To determine the amount of floor space that each customer is using, we rate each model of device currently in our data centers. Not surprisingly, our customers frequently ask, ‘Why are you billing me based on the manufacturer ratings, which are higher than the actual power we use?’ The answer is that we do not utilize the manufacturer ratings directly but carry out our own extensive testing at different utilization levels. This testing enables us to rate each device according to its average expected utilization when online and under load.

An interesting effect of moving our chargeback model driver from racks to kW was that many customers with older servers saw their bills go down. This change occurred because newer servers use more power per rack than older models and are typically less power efficient. However, I should emphasize that this lower efficiency energy usage pattern is because server manufacturers currently concentrate on obtaining higher compute densities, rather than maximizing power efficiency. I would expect newer generations of servers that have been designed for minimal power usage to be significantly better than both the current generation and any older server models.

Increasing Power Efficiency

Examples of power efficiencies include using higher capacity memory modules, installing lower Total Design Power (TDP) processors, and powering down redundant components, such as network interface cards and power supplies. I will be discussing these areas in greater depth in later posts.

From our device rating, we then apply a chargeback model similar to the following one:

  • Floor Space: $100 per KW
  • Power and Cooling (Electricity): 10¢ /KWH

If you are a customer with a rack of ten devices rated at 300W, then we would charge you monthly as follows:

  • Floor Space: 300 x 10 = 3kW * $100 =    $300
  • Power and Cooling: 3kW x 730 x $0.10 =  $219

Total per month:                                            $519

By the way, if you are one of our customers, I did say that these aren’t the real figures that we use. So, no, please don’t contact me for a refund!

Determining Billing Rates

The simplest way to describe how we determine rates is to say that we divide the total costs by utilization. In the case of Floor Space billing, we look at all the operational costs of our facilities, such as lease, depreciation, electrical and mechanical equipment maintenance and support, and so on and compare these costs to the number of kW we expect to use. This approach includes our overhead costs in the chargeback model.

Incorporating Power Usage Effectiveness

We then need to include the cost of electricity to the data center. This leads to another question that customers regularly ask, which is “Why isn’t my rate for electricity the same as our power tariffs from the supplier?” For example, the power rates published by Grant County Public Utility District are published at http://www.gcpud.org/customerService/billing/ratesFees.htm. For our Quincy, Washington site, the rate is $.022 per kWh. To answer this question, we explain the PUE associated with each site, and how this factors into the billing rate.

Imagine the same customer with 3kW of servers installed. In one month, using an average of 730 hrs per month, those servers will consume 2,190 kWh. Using a PUE of 2.0 (not our actual PUE), results in a total utility consumption of 4,380kWh, which, at the rate of $.022 per kWh, creates a monthly cost to Microsoft for this customer of $96.36. Incorporating the PUE ratio of 2.0 means that the effective rate that we charge the customer for the 3kW they use is $.044 per kWh.

Improving Power Usage Effectiveness

You can easily see that lowering the PUE immediately reduces the charging rates on customers. This correlation of energy efficiency to the size of their bills enables us to interest customers in the possibility of reducing the energy overhead at the site, for example, by replacing servers with more power efficient designs. Eventually, this new charging model will lead to long-term benefits, not just in terms of costs but also in terms of overall environmental impact.

Replacing Less Power Efficient Servers

Going back to our original example, if you can replace each of those ten devices with servers that provide similar application throughput but only use 120W, then your total monthly bill will shrink to $208, a saving of $311 per month or $3,732 each year. If those new servers cost $1,000 each, your expenditure is paid for in just over two and half years. If power costs go up, then this payback time will shorten considerably.

Summary

Changing our chargeback model to one that uses power as the basis for floor space makes sense, both for us and for our customers. As older equipment is retired and replaced, we expect to see greater emphasis on power efficiency rather than raw output. Reducing power consumption on individual servers results in a reduction in the total power consumption for the data center, helping to conserve our power bandwidth and minimize our impact on the environment.

Aio!

Amaya Souarez

P.S.

Calling all IT Professionals!

  • Do you run a server at home?
  • Does this server run on your old workstation or a former desktop PC?
  • Would you like to compete in our power saving server competition?

We’re looking to set up an exciting competition for anyone who wants to try their hand at building the most power efficient server possible. The details aren’t finalized yet, but expect a blog posting soon about the terms and conditions, and most importantly, the prizes!

Changing Data Center Behavior Based on Chargeback Metrics

On July 8, 2008, 150 attendees joined in at the Microsoft-hosted National Data Center Energy Efficiency Strategy Workshop. The sidebar opposite summarizes the overall aims of this workshop.  Image 4.1

During the workshop, I delivered a presentation on “Incenting the Right Behaviors in the Data Center.” If you want to see my presentation, you can review the content at http://www.energetics.com/datacenters08/pdfs/Belady_Microsoft.pdf

And if you would like to see the response from industry observers, check out this link: http://www.networkworld.com/news/2008/070908-good-incentives-boost-data-center-energy.html?page=1.

The two main points in my presentation were:

  • Costs in the data center are proportional to power usage rather than space.
  • Power efficiency is more of a behavior problem than it is a technology problem.

I then went on to discuss the background to these claims.

Charging for Space

Historically, data centers have always charged for space. As a consequence, organizations charged for space in the data center did everything they could to increase server compute density, demanding more processor cores, memory and IO into each U in the data center. Server manufacturers responded (as they occasionally have been known to do) by providing just that: servers that are space efficient when measuring processing power against rack space. The downside of these space efficient designs was that power consumption in the racks increased significantly, which were much more difficult to cool.

Coping with Energy Cost Increases

In my Electronics Cooling Article from 2007 (http://electronics-cooling.com/articles/2007/feb/a3/), I wrote about the fact that data center infrastructure and energy costs have increased substantially to the point where they actually cost more than the IT they support. The article speaks to the fact that in the last decade these costs were negligible and didn’t even show up on the radar screen relative to the IT costs. However, an inflection has occurred and they have now become the primary cost drivers in the data center. Since publishing the article, this effect has been compounded even further by significant increases in the cost of energy. Who could have predicted just a year ago energy costs would increase (although I did 7 years ago ( http://www.greenm3.com/2008/03/christian-belad.html) With oil spiking to $140 a barrel and electricity costs on the rise, the unpredictability of business costs are around energy and not space. So wouldn’t it make sense to for businesses to incent the organization for efficiency.

Data centers must charge customers in a way that more closely reflects the overall costs of running a data center.

Breaking Down Data Center Costs

A breakdown of US data center costs at Microsoft produces what Mike Manos calls the data center “PacMan” – a pie chart that bears a certain resemblance to a ghost gobbling game that I remember from my student days. You can see the “PacMan” chart on Page Four of my presentation.

What this chart shows is the following cost ratios:

Area Percentage
Land 2%
Architectural 7%
Core and Shell Costs 9%
Mechanical/Electrical 82%

Analyzing the Figures

Analyzing this chart further, we see that over 80% of the costs for a data center scale with power consumption and less than 10% scale with space. So, on this basis, why the heck were we charging our customers for space? Our unambiguous conclusion was that our charging models were driving the wrong type of behavior. Basically, dense was dumb. We needed a charging model that reflected the costs that we experienced, and this charging model would then change the behavior of the users of IT in the data center.

Changing the Charging Model

In my presentation, I described how Microsoft now charges for data center services based on a function of kW used. If someone upgrades to a high-density blade server, they do not reduce their costs unless they also save power. This change created a significant shift in thinking among our customers, together with quite a bit of initial confusion, requiring us to answer the stock question “You’re charging for WHAT?” with “No, we’re charging for WATTS!”

Recording the Changes

From our perspective, our charging model is now more closely aligned with our costs. By getting our customers to consider the power that they use rather than space, then power efficiency becomes their guiding light. This new charging model has already resulted in the following changes:

  • Optimizing the data center design
    • Implement best practices to increase power efficiency.
    • Adopt newer, more power efficient technologies.
    • Optimize code for reduced load on hard disks and processors.
    • Engineer the data center to reduce power consumption.
  • Sizing equipment correctly
    • Drive to eliminate Stranded Compute by:
      • Increase utilization by using virtualization and power management technologies.
      • Selecting servers based on application throughput per watt.
      • Right sizing the number of processor cores and memory chips for the application needs.
    • Drive to eliminate stranded power and cooling—ensure that the total capacity of the data center is used. Another name for this is data center utilization and it means that you better be using all of your power capacity before you build your next data center. Otherwise, why did you have the extra power or cooling capacity in the first place...these are all costs you didn’t need.

I will be discussing the concepts of stranded compute, power, and cooling in greater detail in later posts.

Moving the Goalposts

I think it will take quite a bit of time for manufacturers to realize that the goalposts have moved. At present, it is quite difficult to get the answer to questions such as “What is the processing capacity of your servers per kilowatt of electricity used?” However, I do believe this change will come, which will drive rapid innovation along an entirely different vector, where system builders compete to create the most energy efficient designs. The benchmarking body, SPEC, has already started down this path with their SPECpower benchmark, but this needs to be done with applications.

Summarizing the Vision

I would like to end with a quote from my friend James Hamilton, who in his blog wrote about a forum he participated in around the time of the the EPA Workshop.

“Our conclusion from the session was that power savings of nearly 4x where both possible and affordable using only current technology. ”

James’s comment (about his session) is exactly correct, the technology is already there, we just need to grab it. This supports my initial point that we need modified behaviors to do that. Today’s charging models that align costs with the volume a customer uses in the data center do not provide any motivation to save power. However, with the right level of incentives, power-based charging could drive a new and dazzling era of change in the computing industry. We should do what we can to help achieve this vision.

Author

Christian Belady, P.E., Principal Power and Cooling Architect

Part 3—"What's Your PUE Strategy?"

This third and final article describes how you can start adopting PUE in your datacenter, and how Microsoft has benefited from its long-term use of the PUE metric.

If you do not have anything to adjust that can change your PUE value, you will not be able to take action. However, it is surprising just what you can do to improve your PUE, as you will see when we look at some results Microsoft has achieved.

Implementing PUE Strategies

Examples of best practices you can use to improve PUE are listed in Microsoft’s Best Practices for Energy Efficiency in Microsoft Datacenter (see http://download.microsoft.com/download/a/7/b/a7b72ab1-ca17-4589-923a-83b0ff57be6d/Energy-Efficiency-Best-Practices-in-Microsoft-Data-Center-Operations-CeBIT.doc).

The following "top ten" best practices will help you to develop your own strategy to improve your PUE by looking at the big picture:

  1. Engineer the datacenter for cost and energy efficiency.
  2. Optimize the design to assess multiple factors.
  3. Optimize provisioning for maximum efficiency and productivity.
  4. Monitor and control datacenter performance in real time.
  5. Make datacenter operational excellence part of organizational culture.
  6. Measure power usage effectiveness (PUE).
  7. Use temperature control and airflow distribution.
  8. Eliminate the mixing of hot and cold air.
  9. Use effective air-side or water-side economizers.
  10. Share and learn from industry partners.

Three Stages of PUE Implementation

Stage 1 – Sneaker Net

Measuring PUE can start with a "sneaker net" approach (MBWA, or measuring by walking around with a clipboard) to collect the meter readings for the entire datacenter facilities, and the associated amount that can be accounted for by the IT load. It could be as simple as reading the output of your critical load UPS units. Finding your own PUE is the first step. It sounds so simple, but it is amazing how few employees know their PUE. When you do find your PUE, share the number. Identify the range of your PUE by collecting data regularly, and displaying the results for others to see.

Stage 2 – Instrumented Data Acquisition, Some Sneaker Net

The next step is to implement automated meter reading, but don’t expect to have a solution for all devices before you begin this stage. Most power and cooling equipment has the capability to report meter readings, but getting data from all the equipment can be difficult at first until you discover the protocols for each device and network all of the devices. You can build your own meter reading solution, use one supplied by the equipment vendor, or purchase specialist third-party software. Many IT management tool companies have started to integrate power management capabilities into their tools. Alternatively, there are standalone products like Microsoft Dynamics or the OSIsoft Pi System.

Stage 3 – Real Time Data, 100% Automated Meter Reading

The ideal stage is when you have real-time energy consumption data available for the whole datacenter, including environmental conditions. Then, once you are collecting all the numbers, you can build a dashboard such as that shown in Figure 1.

image

Figure 1 - Screenshot of Microsoft Data Center Dashboard

In the same way as a Network Operations Center (NOC) watches the datacenter operations, additional screens integrate monitoring and provide automated alerts for power and cooling system efficiency. This type of dashboard is useful in the previous stages, of course, but is most effective at this stage.

With constantly changing equipment deployments, load, and environmental conditions, PUE is a dynamic number that will move within a range. Advanced users have the right tools and information to measure continuously how far they are away from the optimal PUE, and automatically detect out-of-range conditions.

PUE in Action

Most of the news around Microsoft’s datacenters is about its state of the art facilities currently under construction (see http://download.microsoft.com/download/a/7/b/a7b72ab1-ca17-4589-923a-83b0ff57be6d/Energy%20Efficiency%20in%20Datacenters-022808-Med.wmv). Less well known is the effort Microsoft puts into improving datacenter energy efficiency in legacy facilities. Figure 2 shows the PUE value for a Microsoft facility from August 2004 to August 2007. After 2 years of energy improvements, the PUE for the site improved by 25%.

image

Figure 2 Actual data from a Microsoft Datacenter from Aug 2004 to Aug 2007

Examples of non-intuitive changes made to the site, which proved effective in reducing PUE, were cleaning the roof and painting it white, and repositioning concrete walls around the externally-mounted air conditioning units to improve air flow. Both of these changes were validated as effective by measuring the effect on PUE.

Experience at Microsoft has shown that using PUE as a common metric when designing new datacenters and evaluating new technologies can save energy. Figure 3 shows Microsoft’s goals in new datacenter construction, aiming to apply the equivalent to Moore’s Law by doubling the datacenter power and cooling infrastructure efficiency every two years.

image

Figure 3 Annual Avg PUE Targets for New Datacenter Construction

Voluntary PUE Disclosure

Electricity costs are rising faster than any other costs in the datacenter. Electricity used by datacenters is one of the fastest growing segments of energy use. Combine the rising cost per watt with rising overall consumption and you can see why government officials are concerned about the energy consumption of datacenters. To collect more information about datacenter efficiency, the Environmental Protection Agency (EPA) has asked for a voluntary disclosure of PUE (see http://www1.eere.energy.gov/industry/saveenergynow/partnering_data_centers.html).

Call to Action

By now, you should be convinced that measuring and monitoring your PUE, and other factors associated with datacenter efficiency, is vital in today's economic climate. Here are some action points that you can use to drive your transformation into an environmentally friendly and reduced cost datacenter operation:

Don’t wait for the perfect time or the perfect tool
What is your PUE? What was your PUE? What should your PUE be? PUE is a simple number that all your datacenter and IT teams should know.

Start collecting data for PUE
Get a clipboard, read the meters, calculate your PUE.

What happens after you've done PUE? You measure other stuff.
What do you do after PUE? Start collecting data on power subsystems, cooling subsystems, IT equipment performance, and carbon emissions.

Provide organizational incentives
Ultimately, PUE provides the ability to measure efficiency of your datacenter. If you want to improve the efficiency of your datacenter, you need to think about how you integrate PUE into your organization’s metrics. Microsoft has made uptime and PUE its top metrics for datacenter managers to report on.

Base chargebacks on power efficiency
Instead of basing chargeback costs of a datacenter on total power consumed, base it on a portion of power consumed that relates to PUE so that you incentivize managers to improve efficiency.

Consider datacenter manager bonuses
Finally, do not forget to reward energy savings and PUE improvements. Bonuses and recognition awards for energy savings make sense when you want to make energy efficiency part of everyday tasks.

This is the last of a short series of articles that describe how Microsoft uses Power Usage Effectiveness (PUE), an industry standard metric for the efficiency of a datacenter. Being able to measure and monitor the effective power consumption of a datacenter in terms of the computing power it contains provides a way to ensure that you make best use of resources while minimizing your environmental footprint.

Authors

  • Mike Manos, General Manager Data Center Services
  • Christian Belady, P.E., Principal Power and Cooling Architect
Part 2—"Why is Energy Efficiency Important?"

This second article explains why energy efficiency is vitally important in today's economic climate.

Figure 1 shows a graph of the annual amortized cost of a 1U server plotted against datacenter infrastructure costs and power costs. In 2001, the sum of infrastructure and energy costs was equal to the cost of a 1U server. In 2004, the infrastructure cost alone was equal to the cost of the server. In 2008, just the energy cost was equal to the cost of a server. These energy costs are numbers that all IT staff should be aware of when calculating TCO. Yet, most companies do not have ability to provide specifics for their company, because they are still living in the 1990s when infrastructure and energy costs were not really significant.

Image2-1

Figure 1 - Christian Belady Feb '07 Electronics Cooling Magazine
http://www.electronics-cooling.com/articles/2007/feb/a3/

Obstacles to the Adoption of PUE

Our observation is that companies are slow to adopt PUE. Why? As you improve the efficiency of your datacenter, it will actually be more difficult to identify opportunities to save energy. The interrelationships between systems make analysis more complex, and require you to look at the problem in a holistic way. This provokes the interesting question of how to the approach the problem of energy efficiency in the datacenter. Can you see the holistic view, and tell whether your energy efficiency changes work?

From the experience Microsoft gained during presentations at industry events, the number of people using PUE has been increasing. However, it is still an amazingly low proportion for something that we believe every datacenter operator should use in order to understand the efficiency of their datacenter operations.

What are the obstacles? We believe they are:

Fear of the new
Are datacenter managers frightened of a new metric? How do they know whether the number that they calculate for their datacenter is good or bad? Maybe that is why the early adopters have been those people willing to take risks to discover how their datacenter is performing. IT can be a risk-averse culture that discourages change, but changes are required for efficiency. Change can be painful, and PUE tells you if the change is worth the effort in the big picture of datacenter efficiency.

It's nobody's job
In some organizations, measurement is a function that straddles organizations or departments. Therefore, IT cannot successfully measure performance, or may not have access to the required information to be able to carry out the calculation or interpret the result.

Wrong incentives
What are your metrics for your datacenter? Do you have any for energy efficiency? Can you negotiate between energy efficiency and SLAs? Do you base your datacenter manager bonuses on PUE improvement? Are chargebacks based on PUE improvement? When chargeback costs in the data center are based on portion of power consumed, managers are more diligent about right sizing for power and minimizing the power consumed. As a side note, much of the industry is still charging for space because result density has been the driver in server design rather than power efficiency.

Don’t understand how simple it is
PUE is a simple metric, and requires only two numbers as input. The result is the ratio of power overhead for a unit of IT load. A PUE value of 2.0 means that for every watt used by equipment, another watt is used in the overheads of delivering power and removing heat.

Worried about perfection
We have seen some people hesitate to take up PUE because they over-analyze the issues and strive for perfection using real-time data collection. Some people worry about whether they are (or should be) measuring PUE, DCiE (the reciprocal of PUE), or something else. Do not fall into these traps. Just get the two numbers you need.

What is a Good PUE?

When no data is available, a standard assumption is a PUE of 2.0. However, datacenters can achieve a PUE as low as 1.5 and it can be as high as 3.0. What is the difference? Look back at Figure 1 to see the cost ratios for a PUE 2.0. Figure 2 shows the cost ratios for a PUE of 1.5, and Figure 3 shows them for a PUE of 3.0.

Image2-2 

Figure 2 - Effect of 1.5 PUE on infrastructure costs, Christian Belady, Chris Malone, “Metrics and an Infrastructure to Evaluate Data Center Efficiency” Proceedings of IPACK2007

Image2-3

Figure 3 - Effect of 3.0 PUE on infrastructure costs

Using the PUE value and datacenter costs for your infrastructure and IT Equipment, you can create your own graph. The authors believe this is one of the best ways to educate people, and show how important it is to get your team working together to promote awareness and provide an incentive to improve your PUE. Improving your PUE will not only improve your OPEX, but also improve your infrastructure CAPEX per critical load supported.

As you measure PUE across all your datacenters, you will be able to perform side-by-side comparisons of different facilities. As one datacenter makes an improvement, the change can be validated at another datacenter to confirm them as a best practice before rollout to all facilities. This does not mean all datacenters are expected to run at the same PUE value, but differences should be understood and long-term trends tracked. For example, two identical data centers in different locations may have different PUE values due to conditions such as weather and the specific IT equipment configurations.

In the next article in this series, you will see how you can start adopting PUE in your datacenter, and how Microsoft has benefited from its long-term use of the PUE metric.

Authors

  • Mike Manos, General Manager Data Center Services
  • Christian Belady, P.E., Principal Power and Cooling Architect
Microsoft’s PUE Experience—Years of Experience, Reams of Data

This short series of articles describes how Microsoft uses Power Usage Effectiveness (PUE), an industry standard metric for the efficiency of a datacenter. Being able to measure and monitor the effective power consumption of a datacenter in terms of the computing power it contains provides a way to ensure that you make best use of resources while minimizing your environmental footprint. This first article introduces PUE and looks at the issues that it can help you to resolve.

Part 1—"What Color is your Datacenter?"

Imagine if a child were to draw a picture of your datacenter. Does it look green, or is it a glowing orange or even as black as night? Look at the individual pieces of equipment in your datacenter—are any of them green?

If you want the picture of your datacenter to look greener (more energy efficient), you could try upgrading items to more energy-efficient equivalents, as if they were pieces of a puzzle that can simply be replaced. This upgrade method is what many companies are using as a way to convince themselves that they are reducing energy costs. The problem is that, unless you look at the big picture and understand how the pieces fit together, you could end up being disappointed with the outcome.

000693

Figure 1 - Seurat's Sunday Afternoon on the Island of La Grande Jatte
George Pierre Seurat/The Bridgeman Art Library/Getty Images

Figure 1 shows an example of how the painter Seurat demonstrated a scientific approach to painting called pointillism, where the artist uses combination of color dots to create an image that is harmonious and effective, while minimizing the number of colors used. This approach is analogous to management telling their datacenter team, “I want a good looking picture where everything works together and uses as few resources as possible.”

A simple idea needs a simple metric to work. In Seurat's paintings, it is a visual test. For a datacenter, it is an efficiency value—"Tell me what the energy overhead is to run the IT equipment". Microsoft has been using this approach as long as anyone can remember, and when industry groups like The Green Grid started promoting a metric for datacenter efficiency, Microsoft was an early supporter and contributor to the standard as they had years of experience with their own datacenter efficiency metrics.

Getting the Picture Right

The last thing you want to do when measuring efficiency is create a picture that requires the viewer to cross their eyes and squint to see the magic hidden content. Seurat had the luxury of 2 years and over 60 sketches to support his technique for painting static views. However, datacenters are dynamic entities with thousands of interactions. When Microsoft took occupation of a legacy facility, the datacenter team spent 2 years evaluating hundreds of possibilities to drive a 25% improvement in the power and cooling infrastructure energy efficiency.

One of the most effective metrics Microsoft has used to get the correct view of datacenter efficiency is Power Usage Effectiveness (PUE). PUE is the total facility power consumption divided by the IT equipment power, providing a ratio of the power and cooling overhead required to support a unit of IT load.

image001

By using PUE, Microsoft now has a historical record of energy efficiency, and the facilities to measure the efficiency of power and cooling systems.

A datacenter is a complex system and it is hard to get a total overall view of its operation. Microsoft uses PUE to step back and see the big picture, and yet still keep in focus how it all fits together. Many teams find it is a problem to zoom in on a detail and debate its merits without losing context of how this affects the related systems. One Microsoft datacenter engineer explains the benefits of using PUE compared to just measuring the total energy cost like this:

"I was constantly bombarded with vendors offering ways for my datacenter to run more efficiently. I also would hear many testimonials (bragging) from other facility managers who claimed to have squeezed more efficiency out of their DC. Since most savings on power can be turned into additional capacity, any efficiency gained could not be measured by simply looking at the power bill for energy savings."

Without a metric like PUE, the engineer could not measure the datacenter efficiency to see if it had improved.

So, how do you approach energy efficiency in the datacenter? Do you go for the low hanging fruit? Some options are the use of hot and cold aisles, raising room temperatures, and installation of more efficient cooling systems. Can you integrate these options into a strategy rather than a random approach? If you go after the low hanging fruit without measuring the overall effect, how do you know you are not playing a datacenter game of “whack a mole” (http://en.wikipedia.org/wiki/Whack-a-mole) by temporarily conquering energy inefficiency in one area until another a problem pops up somewhere else?

Given the complexity of datacenter design and operation, energy efficiency changes must be closely monitored for efficacy and overall effect. As you play with the control knobs making changes to your datacenter, you need to see the results. PUE is your indicator of whether things actually got better or worse.

In addition, there are other less obvious benefits from measuring and monitoring your PUE. For example, it exposes an energy-based model where power and cooling infrastructure and energy costs are allocated accurately. You can use the numbers to provide accurate information for accounting department actions, such as charge backs. Inaccurate charge backs against IT costs will create unintended consequences as business units try to manage and minimize IT costs. Accurate reporting of energy use allows people to gauge the impact of their actions.

In the next article in this series, you will see in more detail why energy efficiency is vitally important in today's economic climate.

Authors

  • Mike Manos, General Manager Data Center Services
  • Christian Belady, P.E., Principal Power and Cooling Architect
Welcome

Welcome to The Power of Software blog, a new undertaking by the patterns & practices team. As you may know, our traditional focus has been on building guidance that helps software architects and developers successfully design and build applications.

This blog is a slight departure from that. We’re exploring ideas relating to Green IT and the ways we, as a company, can use energy more efficiently. Some currently planned subjects include ways to save energy through the use of software and ways to optimize datacenters. All posts will be written or reviewed by subject matter experts, just like other patterns & practices projects.

We hope this starts a dialog with the community—please let us know the topics that interest you.

RoAnn Corbisier
Editor

Page view tracker