Roving Services

 

When I set out to post this, I realized I haven’t post to my blog for a couple months.  My only excuse is I took more vacation this summer than I ever have before.   I spent a lot of time in airports and on airplanes so I thought I would get a lot of writing done and catch up on my reading backlog.  As you might expect, instead of reading any of the pile of books on my desk, I went out and bought new ones.  The one I’d like to talk about is Roving Mars by Steve Squyres.  In July I did five presentations at an internal Microsoft training conference and Dr. Squyres gave one of the keynotes.  He is the Principal Investigator on the Mars Rover project and he gave a fascinating talk about some of the things they learned.  I’ve been fascinated by space exploration since Sputnik so this was a book I had to read.

 

In addition to the science and engineering details, there were a lot of lessons about managing a large project to a tight schedule (because of celestial mechanics, if they missed their date the Rovers would end up in the basement of the Smithsonian).  Two things that I think are useful for software architects are processes and decision making.  They made major decisions by getting everyone who had anything to contribute together, listening closely to what each of them had to say, making the decision (by consensus if possible, by decree if not), and then never revisiting it unless some new circumstances made the decision invalid.  When I think back on the time wasted on some of the projects I have worked on by revisiting the same decision over and over, this seems like a great thing to try to implement.  Process was another interesting thing he talked about.  For every screw that had to be tightened, they had a process that was followed and someone looking to make sure the process was done correctly.  While this much process would bring a software project to its knees, you can see why it’s necessary in an environment where a loose screw can turn millions of dollars of technology into and inert pile of metal on a planet where tightening the screw is not an option.  I guess this means that the amount of process required in a project is proportional to the impact of a failure.  The software processes used to build a recipe filing application and an air-traffic control system are vastly different and part of architecting a system is defining how much process is required.

 

Once they got the Rovers built and landed on Mars, they had to drive them around the planet to do the actual research.  This process was complicated by the distances involved and for the most part, they only talked to each Rover once or twice a day.  This meant they had to decide what the Rover was supposed to do, send up the commands, and then wait until tomorrow to see what happened (this wasn’t always the case but communicating more often was difficult and it makes the story better if we assume it works that way).  I’m sure you all thought as I did “this is just like a Service Broker service”!  OK maybe not but I think there are some things we can learn about asynchronous service design from the way they solved this problem.

 

First, the commands they sent up had to be pretty large-grained.  If the commands were “drive ahead 3 meters”, “turn right 10 degrees”, “back up 2 meters”, etc.  It would have taken weeks to look at one rock.  The commands they send specify a whole days worth of activity.  In many asynchronous environments – especially if the network is unreliable or has a lot of latency – large grained services can improve performance and network utilization.  While coarse grained service requests have a lot of benefits, they come at a cost.  For example, if you want to look at a rock 20 meters away, you could send a command that specifies the number of wheel turns required for 20 meters and tells the Rover to turn its wheels that many times.  This would work if you were monitoring the Rover while it was doing this but the loosely coupled interface required on Mars, the wheels might slip, a rock might get in the way, there may be a chasm along the way that you couldn’t see from where the Rover started, etc.  This means the Rover needs to have enough intelligence to do what’s necessary to get to where it is supposed to go and avoid hazards along the way.  Similarly, a Service Broker service should be able to recognize and if possible deal with error conditions or faulty commands and retain enough information about the service to tell the caller what happened and what went wrong.  This means the service must understand how to do a possibly complex series of actions and more importantly understand what success looks like and what to do if success isn’t achieved.  This may even mean implementing a service as a workflow.  Also, be sure to validate the commands.  The Rovers always checked the commands for validity and completeness so that they wouldn’t do something stupid because they misunderstood the command.  Service Broker will ensure that the message hasn’t been altered or corrupted in route but it’s still a good idea to check it to make sure it’s valid before your service starts executing it.  It may come from a malicious user (or just a careless one).

 

The other lesson learned is that to be truly reliable, your service has to be able to handle failures gracefully.  If the Rover can talk to mission control then it is a hunk of junk.  For this reason, it was designed with multiple communications interfaces with failure modes and diagnostic abilities.  When the Rover thought is was having communications problems, it started listening at a lower data rate until communication were restored.  This made it possible to analyze failures and maybe program around them when communications were flakey.  On of the Rovers experienced a file system corruption problem that caused it to reboot every 15 minutes.  Fortunately, the software architect had provided a command to allow it to boot without the file system so they could repair it.  Reliability requires redundancy and software written to operate in a degraded mode if necessary. 

 

Flexibility is also a great thing to provide for when designing a service.  Once the Rovers had proven that they really could drive by themselves, the team was able to program in a more aggressive driving style that let them travel much farther in a day and allowed the Rovers to travel many time farther than they had been designed to travel.  The Rovers are now many months into their 90 day mission and still going strong because they were built with enough flexibility and extra capacity to far exceed their minimum requirements.  

 

Well, I had better stop before I carry the analogy too far.  It’s a great book.  Read it if you get a chance.  http://search.barnesandnoble.com/booksearch/isbnInquiry.asp?z=y&EAN=9781401308513&itm=2