A big interest of mine is designing SharePoint farms to be highly-available through good architecture & solid design; something I’ve posted about quite a bit on this blog over time. This article summarises the high-availability strategies available for SharePoint and then touches on other common areas that cause SharePoint farms to fail, as a sort of grand wrap-up to the whole HA-SP series so far.
In reality, the question of “how can we avoid downtime” really comes down to how much you really care about users having no or limited access, or rather more simply put - how much money are you willing to spend to make sure things run smoothly?
Has your SharePoint farm gone down before? This is your guide to make sure it doesn’t happen again because if your SharePoint farm has gone down once then it’s probably because, in the nicest possible way, your organisation or company just didn’t want to keep it online enough and there’s almost certainly something that could’ve been done to avoid any failures from becoming a fatal blow to SharePoint running normally.
In short, keeping SharePoint online means designing a fault-tolerant architecture, coding customisations & apps in a well designed and tested manner, and implementing good SharePoint governance. First though, the architecture…
So as previously mentioned in my blog, some of the tools & tricks in the high-availability SharePoint toolbox are:
The principal point of doing all of these is to make your SharePoint farm handle a failure of any one dependant service or resource. SQL failures, AD outages, even SharePoint server failure can be automatically worked around if the architecture is well designed – a key goal of any high-availability design.
Sometimes performance in a SharePoint application can be so slow that it’s no different from a complete system outage, at which point we often get a call to help out. Here’s how you can avoid that awkward situation:
Performance troubleshooting is a highly complex game and rarely ever highlights any single cause for slowness. Proactive planning is your strongest ally when it comes to making sure your SharePoint installation will cope.
Aside from the above risks, other farm failures are often caused by customisations to the out-the-box product. SharePoint supports customisations albeit with disclaimers for performance because it’s very common that SharePoint gets the blame for someone else’s bad coding over which we have no control. But that aside, application performance targets are a complex issue and rely on lots of factors and considerations, including:
I’ll say it again because it’s important – most customer performance issues we deal with in Microsoft have 3rd-party code involved in the problem at least, if not directly at fault. Custom-code is a big risk if not planned well.
The other thing that can unfortunately kill a SharePoint farm is the monthly patches for Windows, .Net and SharePoint alike. Very occasionally a patch of some kind has a negative impact on a farm in some way so really should be run through a staging/testing environment before being deployed in production, with some testing done on the application too. If you use some kind of patch control system this task is fairly trivial to control with Windows Server Update Services or System Center.
Good patching practises is a subject all on its own but in short, make sure you’re not updating production even with Windows updates unless you’ve checked them on a test environment first and tested the site still loads; searches still work; users still update etc all before.
A bad patch is rare but they occasionally slip through the cracks, like any other vendor really; it’s ultimately your responsibility though to make sure your systems don’t break when they’re installed. Of course, even riskier is allowing your SharePoint farm to go unpatched but I digress.
The other thing worth mentioning is how well organised the operations of the farm are, specifically how well patching, archiving, application deploying, and general administration are planned. People with no processes in place tend to be the ones’ that go offline lots; SharePoint, like any system of any complexity needs regular care & attention to keep it well-oiled and running smoothly so make sure the roles & responsibilities are maintaining a healthy farm is clearly defined. We take quality control very seriously at Microsoft but the SharePoint software stack is very complicated indeed; from Windows platform/IIS core to ASP.Net, to browser compatibility – everything has to work flawlessly together in unison so patching one element of that huge tech sandwich can occasionally have unforeseen consequences. A basic testing strategy helps mitigate this risk almost entirely and really should be done if you’re serious about mitigating risk.
That’s it! If you follow all of these rules you should never have any problems running your farm and rolling out applications. Running a SharePoint farm + applications on-top is tough job; there’s lots to remember, plan for, and think about but hopefully this has given a good quick overview at least. If there’s any interest in looking deeper at any area or other please let me know in the comments.
// Sam Betts