-
Who in their right mind agrees to fly all morning from Seattle to Minneapolis, drive to a local software testing special interest group (SIG), talk for 90 minutes, and fly back home that same night? For a number of reasons, the most important of which was being home in the morning to help get my kids ready for school, I did and it was a very good experience. This is the story of my experience along with a few reflections of traveling for the Benchmark QA forum and TwinSPIN SIG Nov. 5, 2009.
About 6:30 in the morning I went into my 8 year old son’s room to wake him up for school. He rubs his eyes a moment and then bolts upright in bed. “Dad, you’re back already,” he exclaimed.
“No, my trip is today, not last night,” I replied.
The kids quickly got ready for school and as my flight didn’t leave until after 10am, I was able to drop them off at their classroom door. Seattle had on her typical October gray cap of clouds. Even after sixteen years of living in the Northwest I wasn’t sure if it was just going to be a cloudy day or if we would see some actual rain. It wasn’t as if the weather would have any real significance on my travel plans but it is always fun to escape bad weather when traveling as opposed to flying into it. The weather stayed dry as we boarded the plane but just as we were taxiing for takeoff the clouds finally let loose a light shower, just enough to darken the tarmac and run rivulets down the little portal windows of the plane. Seattle was therefore rainy and the only question was whether or not Minneapolis would be better.
On the flight I went through the presentation one more time and shuffled a few slides to suit my mood. Once that was complete, I filed a bunch of emails into folders in Outlook, drafted some notes on feature ideas for Office 15 (yes we will ship Office 14 in early 2010 so it’s time to start thinking of what’s next), and then began writing this blog post and a few others. I finally found time to start writing my blog series on Testing in Production (TiP).
Cool Crisp day in Minneapolis
The flight out was a bit turbulent but we arrived early. So early, in fact, that there was no one at the gate to let us de-board the plane. There is nothing worse than being in the back of a plane, watching it pull into the terminal, see the fasten your seatbelt light go off, stand up and pull your bag from the overhead compartment and then just wait. In this case we waited an extra twenty minutes before someone finally arrived to open the door.
Larry Decklever, the founder and President of Benchmark QA, met me just outside the airport. On the drive to the event we talked a little about his company, golf, and oh yes some points about testing. As we drove, I took the chance to appreciate the clear blue skies and the wonderful crisp cool fall air. For this particular day at least Minneapolis weather beat Seattle.
At the Benchmark QA Offices I was introduced to Molly and Cindy who together helped me find some food, hook my laptop up to the projector, and settle in. We had some 90+ attendees registered for the event and they began showing up right on time for the pre-event networking hour.
Show time!
My presentation, “Testing in the Cloud,” is not about testing on any specific cloud infrastructure. It doesn’t focus on Microsoft’s Azure, Amazon’s AWS, or Google’s App Engine. It encompasses a series of concepts and shows how most of them apply equally to testing a web service whether that service is built out on bare metal hardware or runs as a virtual machine (VM) in a cloud, or in a hybrid mode like my favorite example from SmugMug.com. I also love to share a YouTube audio clip of Larry Ellison, CEO of Oracle, and what I consider to be the best rant on cloud computing ever, “What the hell is cloud computing?”
In the talk I cover many concepts best summarized in James Hamilton’s paper, “On Designing and Deploying Internet-Scale Services,” and the multiple presentations he has given on the topic. My presentation, though, introduces a little bit on Cloud Computing, a little on designing services correctly, and then spends the next 45 minutes discussing why Testing in Production (TiP) is vital as we move into the Cloud era.
Given that TwinSPINers seem to have much more experience testing software and embedded devices than I will likely ever amass, I expected several challenges on the topics. Instead I was pleased to see many nodding heads.
When I got to the last slide, the hands shot up and we had probably 30 minutes of Q&A. Interestingly the majority of the questions were about how Microsoft approaches this test problem or that test problem. I could have replied, just go buy “How We Test Software at Microsoft,” as all those questions are answered in the book, but I didn’t. I discussed how we use Static Code Analysis, that we have almost as many full time testers as we do developers at Microsoft, and how we heavily emphasize test automation.
For all those TwinSPINers that came up afterwards with questions and praise, thank you. Within ten minutes of finishing I was packed up and jetting back to the airport. Within an hour I was through security and quickly choking down a Philly Cheesesteak sandwich in food court. Yes, what was I thinking eating a Philly Cheesesteak made in a Minneapolis airport.
It was just past midnight when I got home. My wife left the lamp on my nightstand turned on so that I wouldn’t have to stumble in the dark. My six year old daughter had written me a little book welcoming me home and telling me how much she loved her daddy. My son, who is both a bit of an engineer and a bit of a writer, built me a little box filled with little two inch square pieces papers so that I could write my own book about my trip. I think however I will leave this blog as my trip report and use the little papers to write him a story of how much he is loved by his Daddy.
· The slides for this presentation can be found here.
· Also the slides from my presentation last month at STARWest can be found here.
· Lastly it looks like I’ll be at the Better Software Conference this coming June. I’ll post upcoming talks to my personal website here.
Thanks to Larry, Molly, Cindy, and the TwinSPINers for a fun whirl-wind trip.
-
All three authors of "How We Test Software at Microsoft", Bj Rollison, Alan Page, and Ken Johnston actually made it for this recording. Warning, I had a bit too much fun with my green screen. The discussion covers how different teams within Microsoft use Agile, Waterfall, Spiral and Feature Crew approaches for their engineering life cycles. New music, “Works on My Box,” by Art Leonard.
-
Recently I ran into a friend who was reading chapter 14 of How We Test Software at Microsoft. He commented on the picture in the book of the Rackable Ice Cube container full of servers. This was pretty cool to think about purchasing servers pre-racked in a cargo container. We joked that for our next purchase of machines for the lights out automation lab we could just order one of these, park it on the fourth floor of the parking garage near the big fan that pulls air up and out of the garage, and then run a few extension cords and Ethernet cables down to it. Done.
We'd have our new machines online in no time and we'd lose just a couple parking spaces. That was better than giving up office space for lab machines.

Of course it wouldn't really be that easy but it is fun to think outside of the box, or cargo container, at times.
Cargo Container Data Centers are Real and growing
The funny thing was my friend thought that this whole Cargo Container idea was new and cool. He got excited all over again when on June 29th Microsoft announced that it would be bringing two new data centers online. The one in Chicago is a huge facility designed to be a container based data center. Even though I wrote about this in HWTSaM, I only did enough research to present what it was. I did not take the time to research the origins of container based data center modules nor really research where we might be headed. For this post, I decided to spend a few hours digging through old blog posts and updating my knowledge. I found a few very interesting posts and links that are well worth sharing with everyone.
First off, Google was awarded a patent on container based modular data centers. The patent was originally filed back in 2003. Recently Google announced it has been using containers in its data centers since 2005. Below I have a link to a video on the subject.
So, clearly this idea has been around a few years and even been used in production for several years. The new Microsoft data center looks to be very advanced in terms of power management and cooling. We can see that the innovation isn't just about racking machines in a big metal box, there are a lot of innovations happening in this space to include the actual design of the data center. Check out this cool animation Microsoft produced a while back on a concept container based data center. I love the thought that the data center might not even need to have a roof on it.
No Roof and No Floor, let's take the data center off-shore
Back in 2004 I was working on the new SDET career stage profiles with several members of the Microsoft Test Leadership Team (MS TLT) when we wondered off topic. Electrical power costs were spiking again and our lab budgets were getting squeezed. One of us, I'm pretty sure it was Darrin, chimed in and said, "we should just take all the lab machines, put them on a boat and park it in what ever harbor had the cheapest power and be done with it. If they raised prices we'll just unplug the machines and drive the ship to a new location." Everyone thught it was a great idea if a bit impractical with current technology.
We probably should have explored the idea some more. I found another Google patent application and this one is for a water-based data center. They have some really good ideas on heat management and using the ocean to generate some of the electricity needed to run the data center. The more I think about a water-based data center the more I see a tie in for an action movie. I could see a serious James Bond or Mission Impossible sequence here where the agent has to sneak aboard a floating data center, find the right server, and copy the hard drive to a portable device without being detected.
Seriously though there are some very interesting financial and legal advantages to this idea as well as good old fashioned positive environmental impacts. On the financial side you can avoid realestate taxes and avoid dealing with local permitting processes when you don't use land. If the servers are offshore do local laws around gambling or sales tax still apply for transactions that happen on these servers. If the data center is in international waters, what laws apply and how do you assess the impact to GNP for a country? Also the ocean is a much better vehicle for releasing the heat generated by servers so a floating data center would certainly need much less air conditioning to cool the equipment. This would have a very positive impact on the data centers overall Power Usage Effectiveness.
Most of the links below can also be found by tracking James Hamilton's blog perspectives. I have him in my RSS feed and recommend his blog to everyone tracking innovations in high scale computing.
Cargo Container Data Centers
Googles cargo container DC links
Water Based data Center Links
-
This BLOG post is a thank you to all the individuals who have read “How We Test Software at Microsoft,” and posted something on the web about it (that Alan, BJ and I been able to find that is).
We are thrilled by the very positive response readers of HWTSaM have had. There are several five star, four star, and even a ten out of ten reviews of the book. If you’ve read the introduction to the book you know the lead author, Alan Page, didn’t originally feel the world needed yet another book on software testing but after a bit of reflection he saw how a book about Microsoft testing would be worth writing.
We didn’t write HWTSaM to be the ultimate book on software testing but rather to be a good companion to the other excellent books that have come before it. Regardless of the intended design, all three of us want the book to do well and for the first few months after the release we have been hovering over the Amazon rankings and scouring the Internet for comments and reviews. Through all this activity we’ve learned a few things about promoting a book that make us a bit more educated but not experts.
1. It pays to know your niche market and the popular bloggers
2. Claiming our book and making an Author Pages on Amazon.com is cool but doesn’t drive sales.
3. Climbing to #1 on the New York Times Best Seller list or Amazon.com is not going to happen (today’s book was “Common Sense”)
4. It is important to say thank you when someone praises your work
We have said thank you directly to most bloggers and reviewers but I wanted to collect a bunch of them together and say thanks in a more public forum such as this.
In the book I wrote or researched many of the sidebar stories so I thought it might be a good way to start this thank you blog with a personal story.
I call this one the “Perfect and Unexpected Plug.”
More than one hundred senior engineering managers and architects along with a few executives gathered this week to discuss some of the evolving ideas around Office 15 for a one day offsite focused on improving the Office engineering system. One of the key elements of the meeting was to bring in new thinking from outside of the Office organization. The first guest presenter was Mike Kelly who had been part of the Office organization before joining the Microsoft core engineering strategy team Engineering Excellence. In his presentation he shared some really amazing prototypes of new collaboration and build system tools as well as some industry analysis. The second presenter was Craig Fleischman, also a former Office manager who was now working on Windows. Craig presented some of the plans the Windows team has for Windows 8.
Right about now most readers are probably hoping I’ll spill some information about Windows 8 and Office 15 but sorry, I can’t do that. All I can say is that while neither Windows 7 nor Office 14 have shipped, we are hard at work developing plans for the next versions.
Craig finally got to a slide about Windows 8 and services. As this was an event focused on engineering systems rather than services and features he announced he would skip most of this slide. It was a long slide that slowly built with three different sections and more than fifteen bullets. Craig paused on the last bullet and simply said, “I don’t have time to go over this whole slide so I’ll just cover the last bullet. Read chapter 14.”
I hadn’t actually been paying much attention. That’s the problem with laptops and WiFi, it’s too easy to respond to the inbox. Speaking of focus I should re-read Alan Pages’ recent blog posts on productivity and distractions. Somehow I did hear skip the slide on services and “Read chapter 14.”
This caught my attention. I paused and looked up. All across the room multi-taskers were lifting their heads up from laptop screens and looking directly at Craig. The whole room was actually engaged, wondering what Craig meant by “Read chapter 14.”
A nervous energy shot up my spine and a wide smile broke out across my face. I suspected what this mysterious chapter 14 might be but I wasn’t certain enough to say anything.
Craig attempted to move on to his next slide but of course he was interrupted.
“Chapter 14 of what,” someone yelled out.
This seemed to catch Craig off guard. Of course everyone in Office knew about and must have read chapter 14 by now. “You know the services chapter,” he said. Blank eyes stared back at Craig. “Chapter 14 from Ken Johnston’s book,” he said.
A little “yes,” escaped my lips. I had been plugged!
Certainly I could not have asked for a better plug for one of my chapters in, “How We Test Software at Microsoft.” Here we were in a gathering of the most senior engineers from across Microsoft Office, one of the most successful businesses in the history of Software. Most of the attendees were not even testers, and now they were left wanting to know more about this mysterious chapter 14; the chapter 14 that Craig said must be read. I don’t think one could plan a better hook than that!
Craig was able to continue to his last slide but the room still wanted to know more about this book and Chapter 14. Fortunately I’d remembered one lesson from John Kremer’s book, “1001 ways to Market Your Books,” is that an author should always have a copy of their book on hand. I had three and made them available in short order.
During the break I came up to my friend and colleague Craig and jokingly asked who I should make the check out to. Craig commented that he thought we needed a core set of reading for everyone who moves from software into services and this should be one of those. Again this was high praise and I was thankful.
The end of the “Perfect and Unexpected Plug.”
This experience reminded me to be thankful of all the blogs and reviews HWTSaM has received so far. I know Alan, Bj and I read everyone and we always try to comment or contact the author to say thank you.
What follows are excerpts from several blog postings and reviews of How We Test Software at Microsoft. Most are linked to the official book site at http://www.hwtsam.com.
BLOGs and Book Reviews and Links:
· Microsoft Press has been a great partner to work with. Here is a portion of the Model Based Testing chapter with Graphics
· Six Reviews on Amazon.com. Here’s just a couple of quotes (the nice ones):
o Sally Foster “History buff” gave HWTSaM 5 out of 5 stars, “The writers are drawing from experience, they understand testing software, and more importantly, they understand how to position a tester, and a test team, for success. This book goes far beyond Kaner's "Testing Computer Software", and is a must for any software tester who is passionate about shipping quality pro ducts.”
o Manfred Dietz gave us 4 out of 5 Stars, “So, why not 5 stars? Because you guys did not mention anything about metrics and its influence on our work and the results.”
· Barnes & Noble has two five star reviews. Boulderdash wrote “A best practice book, it is loaded with real life experience of the authors…”
· Asaf – Boulderdash Blog “Alan, Ken and Bj have divided the chapters authoring among them. Each has his own way of writing, although different in style, the final result is excellent. I highly recommend the book for all those who are into software testing”
· Michael J. F. (SQA Blogs) “The excellent explanations of Equivalence Class Partitioning and Boundary Value Analysis are among the best I have ever read.”
· The Evil Tester posted a review on Compendium Development “The first 2 chapters present Microsoft as a great company to work for, one that really values the testing staff and reads as the best recruitment literature for any company I've ever read.”
· James Whittaker was one of the first to comment on HWTSaM back in January. James has a new book coming out soon that we look forward to reviewing. “…it will also be the year that I expose more insider details about testing culture and practice at Microsoft…Although, much of the thunder has already been aired by my colleagues in their new book How We Test Software at Microsoft. That book's a good read and a high bar for me to match when my own book comes out in a few months.”
· Linda Wilkinson, Practical QA “This book grabbed me right away; it was a glimpse into the culture of a vast, complex, and interesting company with some challenges that are unique in the field. And after reading this book, I’m STILL fascinated”.
o William Echlin commented, “..this is a book that on the face of it, I would not have attempted to read in a million years. Yet based on what you've said this in now somehwere near the top of my must read books.”
· iTWire Book review by David M Williams “All in all, this is an impressive work with a great deal of wisdom and principles – underpinned by sound theory – that would be of interest to any company that produces software of reasonable complexity.”
· Debra Martinez review on StickyMinds “This book has made its rounds in my testing department. There is not a day that goes by when I am not asked if I still have the book. I feel this book is great addition to any testing department”
· Michael Hunter the Braidy Tester “HWTSAM is chockablock full of details regarding fundamental testing techniques, strategies, and processes which I believe every tester should be familiar with (even if you disagree with the utility of some of them).”
· Kawal Banga on BCS Book Reviews scored 10 out of 10, “More than a million test cases were written for Microsoft Office 2007 and the automated tests for many Microsoft products have more lines of code than the products they test… All in all, this is an excellent book, and should be on every tester's bookshelf.”
· Phil Kirkham on Expected Results “I found it to be an excellent book, lots of tales from the trenches, explanations of the problems MS faces, how they try to overcome them - all intermingled with general testing theory.”
· Javier Andres Caceres Alvis and his Windows Mobile, Testing & Multi-core programming group used HWTSaM for several group discussions.
Thank you to everyone who has commented on HWTSaM whether positive or less. For those that were less positive I’m sorry I didn’t include direct links to your comments but I know Alan, Bj and I have read and reflected upon them. There are a few foreign language posting and a video podcast that I didn’t include in this article and I’m certain I missed some review somewhere. My apologies.
Thank you all,
Technorati Profile KJ
-
Alert: this post has nothing to do with S+S or operations or testing but it’s a small slice of life that I just had to share.
“P” words such as penultimate, pontificate, plethora, plebian, and polyglot have failed me more than once.
Have you ever found yourself reaching for that perfect word? Sometimes it’s when you are writing but other times it might be in the middle of a conversation or, heaven forbid, a debate around the conference table or marker board where you don’t have the luxury of looking up the definition to make sure you are picking the right word. There you are in real time reaching back for that perfect word and one comes to you and out it goes. Everyone pauses, turns and looks at you and you realize that is not the word you meant to use. So much for my credibility.
In my life that has happened mostly with a set of multi-syllabic “P” words. I’ve used penultimate instead of ultimate to mean the pinnacle or zenith, polyglot instead of plethora, and pontificate thinking I was just being smart. I’ve worked through my issues with all those words except for penultimate.
penultimate - second to last: second to last in a series or sequence, “the penultimate chapter”
Definition from Encarta.msn.com
It’s just not one of those words used much in American English. We are so focused on the winner or the next big thing or sometimes even the underdog that we rarely consider what or who came next to last. It doesn’t matter to us whether it was an elite group to be a part of in the first place. If you didn’t finish first or second, it doesn’t matter.
Some notable penultimates of the past few months of 2009:
· Kentucky Derby – Friesan Fire
· Indianapolis 500 – Driver Ryan Hunter Reay
· Car and Driver 10 Best Cars – 2009 Porsche Boxster and Cayman
· Overall American League Standings (as of this post) – Baltimore Oriels with a .431 winning %
· My son’s finish in the sack races at Field Day this year
I have decided, however it is time for me to find a way to work the word penultimate into my life and in order to do that I have emphasized it with my children. We unofficially created a new iHoliday last year at the end of school. We call it “The Penultimate Day of School!”
I’ve been ruminating on this idea for about a year now. Last year, my kids were all excited that the next day would be their last day of school for the year and they were upset that it wasn’t here yet. I then seized upon the opportunity to explain to them that this day was very important. It was the penultimate day of the school. When I explained what penultimate means, they got so excited that they started announcing it to everyone they saw. “I’m brilliant!” I thought to myself. But actually, by their enthusiasm, they created a day for which Hallmark should make cards.
Let’s face it, penultimate is a very funny sounding word. Most adults that hear it will give you quite the odd look. But upon learning the definition, the penultimate day of school was born. We all agreed to go to school that morning and share this new and titillating word by wishing everyone we met with “Happy penultimate day of school.”
The children in their classrooms all giggled at this greeting, but had no clue what it meant. The teachers were all taken aback to hear a word they didn’t know from such a small child using it with significant confidence.
At the end of the day there were a dozen or so individuals who now knew what the word penultimate meant and they now had a good place to use it, at least once a year.
Now there are many other good places to use the word penultimate, the penultimate lap of the Indy 500, or the penultimate game of the season, or the penultimate batch of salmonella tainted produce are all good examples. If I were to write an article on the rankings of search engines I would love to write, “Microsoft’s new Bing service is climbing but ask.com is still the penultimate search engine.”
Still it feels awkward to use the word penultimate and it is such a good word. The world needs a time and a place to use the word penultimate and I have decided to try and help this great word along. I am creating and promoting a new holiday, one born solely of the internet and social networking. This new holiday will be called the “Penultimate Day of School” and will be celebrated at every school in every part of the world and will be celebrated on the next to last day of school.
The rules of the Penultimate Day of School are quite simple. Students are encouraged to hail their fellow students and faculty with, “Happy penultimate day of school.” They are also encouraged to use as many big words as they can in conversation. They can even carry a thesaurus with them and whenever possible substitute a multi-syllabic word for a more common word. Penultimate Day of School is a day to revel in the use of really, really big words.
Since everybody is so excited for the start of summer, the best place to use the word penultimate is to describe the next to last day of a school year.
For 2009 I have a modest goal to simply double the involvement of parents and students in Penultimate Day of School. I have launched a Facebook page to promote the event and soon should have a SharePoint site where I hope students may post Penultimate Day of School thank you notes to teachers and faculty. Happy Penultimate Day of School, everyone!
I promise that the next post will get back to S+S testing. For more on that subject read chapter 1h4 of “How We Test Software at Microsoft.”
w76d98a3fs
-
Today I hosted four hours of interactive learning on S+S testing with table topics such as "Testing in Production, How far can we go?" and "Release Cadence in an S+S era." Every time I get together with smart engineers new better ideas are generated.
One interesting example that came up in the afternoon session was the impact background or maintenance tasks can have on a data center's infrastructure. In this particular example a rather large Microsoft service was getting ready for a big launch and needed to upgrade thousands of servers in a data center to the latest version of the service. Well the deployment of this service is fully automated (I'll write a post ranting about the importance of deployment in the near future) and so with the push of a single button the deployment was off. The "bug" if you will occurred because all the machines became very busy pulling down bits, conducting reads and writes to disk, and actually hit a higher average CPU utilization than they would during normal production use. This massive load actually caused power failures in the data center. So where is the bug?
Should we fix this in software or rely on a new policy of never upgrade thousands of machines at the exact same time ever again? This is a very interesting edge case and I don't have the answer. I offer it up as an example of how much there is to consider and learn as we move into S+S and Web 2.0 worlds with cloud computing and multiple devices.
A few more ideas that I gathered today include measuring not just time to deploy but time to rollback in case the deployment is flawed, Should we target the 75th percentile or 95th percentile when measuring and signing off on Page Load Times (PLT), and Release Criteria need to include post RTW measurements before you really sign off. All of these are great ideas that are at least a new twist on an important topic if not completely new. The great thing is they all came up during the training session today and I was lucky enough to hear them. Though most of the content is not public I will dive deeper into some of the hot topics next week.
The other experience I had today was delivering a webinar to a SIG in Bogota over Live Meeting. This session was in support of the book I helped write with my colleagues Alan Page and B.J. Rollison titled “How We Test Software at Microsoft.” For more information on the book visit www.hwtsam.com. For this session I was to deliver some content on Chapter 14 which focuses on S+S Testing and then answered some questions.
Doing a Webinar with Q&A can be challenging. Add to it the translation piece and no video of the audience to register their reaction and it becomes very challenging.
In the Webinar I introduced the topic of Testing in Production (TiP) for services. It is a growing field of thought within Microsoft and from what I can tell a process used very heavily by some of our competitors. The notion that one would ship something into production and then test it seems anathema to software testers. Needless to say this became the major topic of Q&A.
The real way to look at TiP is to ask what can safely and effectively be tested in production. The next question is to ask how to make testing in production a fast turnaround process that is cheaper than testing in a lab. When price and speed of production testing are lower than labs, and we are getting there with cloud computing, then you really should move all the testing that you can out of the lab and into production.
I have an article under way just on TiP and hope to publish it in the near future.
Thank you to the Javier Andres Caceres Alvis for the opportunity to discuss S+S and Services testing with your group.
-
Think about Services shifting the Testing into Production
The topic of Testing in Production (TiP) is a growing area of debate in SaaS and S+S testing groups. While I don’t personally really believe you should ship your service and then start testing it, I often introduce the topic this way. It is very challenging to get testers to consider shipping anything that hasn’t been thoroughly tested. The notion that a tester could build out an adequate test environment or set of environments and find every single major bug in their service is just as flawed as ship then test.
I use this “Ship first and then Test,” approach to jar audiences of testers and to move the conversation from what must be done in a lab to what can and would be best tested in production. I will try to make a much longer post on the subject in the future but for now simply think about these questions.
How often have you shipped a service and found that you missed a bug because the test environment wasn’t enough like production? Now ask yourself as scale and complexity of services increases is it reasonable to think you can really make test enough like production that you can catch every bug?
The process of thinking about what can be tested in production can be an exhilarating exercise. You quickly get to a discussion around what testability features need to ship with the service to make this possible. What impact can TiP have on revenue and customer experience, how do we isolate this impact, and what tests are best conducted in labs are other very important questions.
Think about it and let me know what opinions you have and questions you would like answered.
-
1+1 redundancy for production services is a flawed design approach.
+1 redundancy is like the kind of logic my wife uses with me when I go on an overnight business trip. She will insist that I take at least two pairs of socks in my bag even though I plan to be home the very next evening. The logic is that I might accidently step in a puddle and would need a clean pair of socks. Yes, she also wants me to take an extra pair of shoes but let’s just stick with the socks for now. Certainly there is a chance that I might step in a puddle and need an extra pair of dry socks but I find that when such and incident does occur and I need my extra pair of socks something else tends to happen like my flight getting canceled and me being forced to stay over an extra night.
To be clear I am very pleased that my wife insisted on the extra pair of socks and I wish I’d listened to her about the shoe thing too but the real problem was that when given a chance, more than one thing will eventually go wrong and then you find yourself in a long line in the airport waiting to go standby on a flight that only gets you half way to your destination and you wonder where that awful smell is coming from. Then you realize you are the one wearing the dirty socks from the day before and you are stinking up the place.
When I talk about 1+1 redundancy, it is usually in terms of two servers performing the same role like a pair of SQL Servers doing log shipping to keep each one up to date or a pair of routers with redundant routing tables. There is a great paper on flatter network architecture from Microsoft Research on the Monsoon project that you can find here. Whatever device or service you want to imagine is just fine. The key point is that they are a pair doing the same job and they are designed in such a way that if one should fail the other will pick up and charge forward. Unfortunately that just isn’t good enough.

Figure 1: The Monsoon project (see paper here) flattens the network architecture and moves networking to commodity hardware. This reduces cost and spreads risk.
In services 1+1 redundancy does not equal 2
The thing about 1+1 redundancy that people often forget is that when the 1 goes down you have lost your safety net and now you must react quickly in case the +1 should also fail. If this was the only exposure we had, then maybe we could live with it, but the reality is that due to ongoing maintenance and patching our 1+1 is down to just the +1 far more often than just failure scenarios. If you add up all the maintenance windows for a service, you will probably find something on the order of .005% of a year is spent in maintenance. In other words, we are at 1+1 redundancy just 99.995% of a year.
Even the best of our services struggle to maintain four 9’s of availability. It is therefore reasonable to expect that over the course of several years both units in a 1+1 configuration will experience coinciding down times. In my reviews of scores of critical outage summary reports, I have seen this pattern of cascading failures in 1+1 redundant topologies time and again.
1+1 is hard on operations and adds to the COGS (Cost of Goods Sold) for a production service.
We’ve established that due to maintenance windows and cascading failures 1+1 is at an above average risk of having both units fail at the same time. To date our automation for repair and failback in these situations is not very high so most services make up for this by increasing staffing levels in operations. Of course, this increases COGS.
Since I have managed operations teams, I can say confidently that this architecture bears a heavy burden for the on-call Operations and Product Engineers. The added cost is not when both fail, but when just one device fails. Every engineer immediately knows that the safety net is now gone and the risk of a second failure is looming. It becomes an urgent rush to get back to 1+1 redundancy as quickly as possible. This is true even in maintenance.
We hire smart engineers to be good at quickly performing repetitive manual tasks because risk of a customer impacting outage is high. That is not the most efficient use of our engineering talent.
The solution to the problems of 1+1 is 1+N.
For me, 1+N is the right solution as long as N is >= 3 and all 3 are fully active. I have had debates with individuals that had 1+1 topologies with a warm pair of warm spares. They would insist that the warm spares should count. In another posting I’ll go deeper into the problems with non-active backup solutions.
If everyone agrees that 1+N is the right way to go and most services launch with some portions of their service in a 1+N configuration, then why do we have so many places where we truncate down to 1+1? The answer is simple is as simple as dollars and cents. The answer is a bad assessment of COGS and risk. The answer is a poor application of the phrase “high availability”.
Everyone knows to avoid any single point of failure in a service so we design to eliminate them. The mistake in going with 1+1 usually occurs around the very expensive devices and the decisions folks make there to “save” money.
Big routers are very expensive. Big load balancers are very expensive. SQL Servers with multi-terabytes of unique user data are very expensive. We can’t have a single point of failure but we can’t afford more than one extra of any device in the system or the COGS model will be too high. That leads to the 1+1 mistake. Perceived cost and a need for some kind of redundancy cause teams to make this mistake time and again.

Figure 2: Portion of a physical topology diagram stacked by rack. Note SQL Servers are spread across 2 racks to reduce risk of power failure to both however they are being launched as pairs.
Operations and Test need to drive out 1+1 during design reviews
In my role I do a lot of service topology reviews. I love the big Visio diagrams printed on a large plotter with all the little pictures of servers, machine names and IP addresses listed all over the place, and lines for logical network connections. Don’t get me started. One of the things I do in these reviews is look for places in the diagram where there are two instances of something. In some cases where it is, say, an edge server for copying bits from corpnet and not really part of production two of that machine “role” makes economic sense but that is rare for these reviews. Here are some questions I like to ask when going through one of these topology reviews.
1. How can you say we are blocked by a technology limitation when we picked the technology? Can we pick something new or write it our self?
2. We could run this on lower end hardware couldn’t we? That would give us more instances of the same device wouldn’t it? Would this get us at least to 1+3?
3. Test to see if we can combine this machine role with another. If we can combine them then we can flatten the architecture.
4. Software can automate many processes. Can we automate the replication of data so we have more than two instances?
5. At least look at how we can we break the data store down to smaller stores with a hashing algorithm? Can we then get our 1+1 exposure down to less than 5% of our user base per pair?
6. Microsoft engineers, operations and product team are on the hook if we have a production outage. If one of these devices goes down in the middle, will you simply roll over and go back to sleep or will you immediately begin to troubleshoot? If not, why not? Are you willing to be on call for the next 365 days to respond to any outage? Fine. You wear the pager and you can keep your design.
When I do these reviews I don’t like answers such as “the technology won’t let us do it that way” or “team X does it this way so we should too” or the worst is “we just don’t have time to do it right.” Those answers ring hollow. 1+1 just isn’t redundant enough and everyone should stop defending it and get on to good design.
As a parting thought, consider the reliability of the US Space Shuttle program where they have five computers involved in making critical decisions for takeoff and landing.
Four identical machines, running identical software, pull information from thousands of sensors, make hundreds of milli-second decisions, vote on every decision, check with each other 250 times a second. A fifth computer, with different software, stands by to take control should the other four malfunction.
Charles Fishman, “They Write the Right Stuff,” Dec 18 2007
Despite being at 1+4 redundancy, billions of dollars in hardware, software, fuel and human lives at stake the Space Shuttle program has had 2 major disasters. One disaster occurred during takeoff in 1986 and the other upon re-entry in 2003. If you calculate the reliability of the program as 2/132 Flights this gives the program a 98.5% reliability rating. Please don’t think that I’m denigrating the space program as I have been a fan since my father worked with the shuttle astronauts at the Houston space center decades ago.
The takeaway here is that virtually all of our services aim for 99.75% or higher availability but NASA with 1+4 redundancy has not been able to achieve that level. Yes space flight is much more risky than services but my point is that critical systems need more than one level of redundancy and this is just one of many examples from multiple industries we can cite.
I fully and firmly believe 1+1 redundancy will not produce high availability and will be higher cost than a 1+N solution.
In the end, it really is like the dirty sock analogy. It’s nice to have an extra pair, but if you travel enough, you will probably run into a situation where you will have both wet feet and a canceled flight. In the case of socks, it’s just about a bit of discomfort and rude odors. In the case of services though we have customer impact and wear and tear on our staff struggling to keep a flawed design functioning and when our customers experience poor service due to bad design they are left smelling our foul stench.
This is the first of what I hope to be a regular set of blog postings. The focus of this bog will be on topics that intersect service design, testing and operations. I have many ideas for future posts but I’d like to hear from you. If you’ve read this post and found it useful and would like information on another topic please send your suggestion directly to me. Thank you in advance for any comments you may have on this post (even disagreements) and suggestions you send my way.
For more content on testing Software Plus Services see chapter 14 of “How We Test Software at Microsoft.”