Microsoft offers (for Premier support customers only) a Microsoft Office SharePoint Server Risk Assessment product… basically a way to analyze your SharePoint farm (including WSS, product naming aside) for a possible 450+ known potential issues, and real, road-tested solutions to those problems. A portion of this is some investigation (we call it a survey) to find out all the things that may not sit on your computer… business process, operational type stuff… and the topic of disaster recovery is addressed… and the conversation frequently goes something like this:
I am being a bit flippant in the example… but this isn’t far from conversation’s I have actually had with customers, and it demonstrates that we need to understand the differences between a backup strategy, disaster recovery plan, and fault tolerant system design.
Hopefully it is clear that though these words get mixed together frequently, they are NOT synonymous. Each of these words/ideas is an attempt to solve a different kind of problem. For example, it is perfectly acceptable to have excellent disaster recovery plans on systems that are NOT fault tolerant… as long as you accept that you may experience service unavailability should a server fail or need maintenance. It is also perfectly acceptable to have a fault tolerant system that has no true disaster recovery strategy… though that would be a little like buying car insurance that was only valid while your car was in the garage.
So… what are your options? That’s mostly been covered in other papers (ie, here, and here, and here)… but here’s a quick (and possibly incomplete) chart:
Notice anything? NOTHING in the above list offers “Disaster Recovery” or “High Availability”? That is because DR and HA are strategic objectives that require planning, documentation, coordination, and yes, tools. Backup tools are valuable only in the context of a well designed Disaster Recovery strategy. Fault tolerance is most valuable when deployed in alignment with a High Availability strategy and HA goal, objective, or target. Yes, having backup and fault tolerance can be helpful even if you haven’t put together a complete strategy, in absence of such a strategy you don’t truly know what you’re protecting yourself against.
One more thing… a note virtualization, “snapshots”, and SAN capabilities…
First, virtualization snapshots. These may provide a possible back-out strategy for changes in SharePoint… but only in one hotly debated scenario… and even then, Microsoft still does not recommend the use of snapshots for SharePoint products. While we specify Hyper-V in the linked article, the same reasons would apply to any technology that performs snapshots of active machines, including non-Microsoft virtualization products and SAN products. Because these methods are clearly not recommended and are supported by Microsoft only questionably, should absolutely not be included in any DR or HA strategy and should generally be avoided.
Also, this doesn’t address 3rd party solutions, but the fundamental requirements would be the same… no technology solution should exist that doesn’t directly support and meet the needs of an underlying strategic objective. Knowing what your 3rd party solution will or won’t protect you from should be critical to your decision making process. (With SharePoint, the best solutions will integrate directly with the Windows Volume Shadowcopy Services service and the SharePoint VSS writer… ask your vendor!!)
Post a comment if you have any questions, want more detailed or specific information to your needs, or even disagree with me… technical debates are fun and informative! :)
I was just in one of your SP Developers Class and did not expect on Day 3 of my New Job MOSS would crash and no one knew how to recover it. So this post is very timely. I was able to troubleshoot and bring the system back up (VM Networking problem). So now my focus is making sure that I never have to do this again. Thanks