(In the UK now hanging out with Kimberly and Tony Rogerson before teaching a Masterclass tomorrow in Reading. Then it's off to Copenhagen for SQL Server Open World, with a little R&R in London beforehand and Copenhagen afterwards, before we fly back to the US on Sunday. The weather here is actually better than in Seattle!)
I've had a bunch of feedback from the survey I sent out (still need more before posting any statistics though) and various things have jumped out at me. The most worrying is that many people either don't know what their SLAs are or have no idea whether they can meet them.
Here are some questions around SLAs - if you can't answer YES! to all of them, then you may be in trouble.
Do you know what an SLA is?
SLA = Service Level Agreement. SLAs are agreements between you and your customers. If you're a DBA, then your customer is typically the company for whom you work. Examples of SLAs are:
Usually, it's a combination of SLAs such as those above.
Do you know why SLAs are important?
Here's the catch - an SLA is really more than just an agreement between you and your customers - it's more like a contract that you're obligated to meet. This means that if you're a DBA with zero-downtime and zero-data loss SLAs, you need to make sure that in the event of a corruption you can actually meet those SLAs. The obvious thing is that if the SLAs cannot be met then the business will suffer downtime and data loss. The not so obvious thing is that if you're the one who agreed to the SLAs in the first place, and when the disaster strikes, the capabilities of the system are far below the SLA's requirements, then you could lose your job - resume/CV time - I've heard of it happening...
Do you know your SLAs?
You have to know what your SLAs are so you can make sure the system can meet them. Several DBAs I discussed this with don't know what their business' SLAs are, even though they are responsible for making sure they are met. I find this astounding - how can you sign up for meeting an SLA when you don't know that the SLA is? Especially if failing to meet the SLA could lead to resume/CV time...
Do you think you can meet your SLAs?
The other reason to know your SLAs, of course, is so that you can correctly architect your system to meet them. There are a bunch of technologies you can use and strategies you can employ to work towards meeting your SLAs (well beyond the scope of this blog post but will be covered through the year). If you find that you can't meet your SLAs, you need to push-back on your management - otherwise you're setting yourself up for trouble when a disaster occurs and you can't meet the SLAs - you'll be held responsible.
Do you know you can meet your SLAs?
Your disaster recovery plan looks great on paper - but have you actually tried it? I know of one company that has a 15 minute downtime SLA for a 300+GB database but the DBA is relying on clusters to provide that for him. That won't work if the database is corrupt (remember a failover cluster has a single point of failure in its shared-nothing configuration - the disks) and needs to be restored from the last full backup... Another company I know of relies on database mirroring to failover in the event of a disaster but has never tried it to see if their application fails over gracefully... You have to make sure you've practiced recovering from a disaster before the first real disaster happens - you'll be amazed at the little things that are discovered (e.g. if the on-site backups are bad, how long will it take to get the offsite copies brought in-house from the off-site location 100 miles away? Can you still meet your 15 minute downtime SLA in that case?)
As you can can see from my short list of questions and answers above, its vital that you understand your SLAs and know that you can meet them - your business (and job!) may depend on it. If you're having trouble, drop me a line (firstname.lastname@example.org) and I'll see what I can do to help.
PS Don't forget to checkout the .NET Rocks! show tomorrow!
Nitpicking, but shouldn't the first example of an SLA read "In the event of a corruption, or other disaster, the MAXIMUM amount of data loss is the last 15 minutes of transactions"?
Oops - I knew what I meant. Thanks. :-)
My recent post on SLAs prompted some interest and comments from readers so this is a follow-up to that
Why Service Level Agreement (SLA) important to Business Continuity? The answer is there is a major role