SQL Server Storage Engine

More on Service Level Agreements...

My recent post on SLAs prompted some interest and comments from readers so this is a follow-up to that post.

What most people wanted was a list of some SLAs applicable to SQL Server - easier-said-than-done because a lot of SLAs depend on the application being serviced by the database. I had a poke about the web and have pulled together some examples to get you thinking about SLAs you may want to define and strive to meet, listed by category - this is definitely a non-exhaustive list!

Hours of Operation

  1. Hours that the partition/table/database is available to users. The SLAs may be different for various parts of the database, depending on what applications need access to them. Why is it important to differentiate at various granularities? For example, depending on how the unit of data is used, it may require different maintenance than other data and so knowing the availability SLA allows maintenance downtime to be planned.
  2. Hours reserved for planned downtime. Again, this may differ at various granularities of data.
  3. Amount of advance notice for extended downtime or other changes that affect users. For instance, when my bank upgraded its computer system last year, they gave a series of warnings over the preceeding few months so that people weren't surprised.

Service Availability

  1. Percentage of time SQL Server service is running and able to service connections.
  2. Percentage of time a particular partition/table/database is available for use (i.e. not exclusively locked for maintenance or restore).

System Performance

  1. Number of concurrent users the system supports.
  2. Number of transactions supported per unit of time.
  3. Acceptable level of performance, such as latency experienced by users for a variety of operations.
  4. Minimum time for an update to be replicated to various remote sites.

Disaster Recovery

  1. Time allowed for recovery from each type of failure (e.g. accidental data deletion, database corruption, SQL Server crash, OS crash, server failure, site failure).
  2. Time it takes to bring critical data online (e.g. the read/write partitions of a sales database) such that operations can continue and less critical data can be recovered later.
  3. Time taken to recover data to the point of failure.
  4. Maximum acceptable data/transaction/work loss for various kinds of failures.
  5. Maximum time for application failover to a remote server/site.

Support

  1. Methods available for application users to get help.
  2. Maximum response/resolution time from a DBA to respond to various types of problems.

Other

  1. Maximum amount of space for user tables/databases.
  2. Amount of users in specific roles.

Some other things to consider are how you define an SLA - for example, in transactions per second or in commit latency that users experience - and the interplay between SLAs - for example, the commit latency SLA may be affected if the acceptable data loss SLA is zero and a solution such as synchronous database mirroring or remote SAN mirroring is used.

Bottom line is that although it can be simple to quickly define and announce a set of SLAs for a given application, its very difficult to make sure that each is palatable to all involved, guarantee that each can be met, and allow easy diagnosis of the system to work out which component is failing when an SLA is not met. SLAs really need to be defined while a sytem is being designed as retro-fitting SLAs after-the-fact can be very time-consuming and costly.

I welcome any comments or observations on this topic - I'll post on this once I get some more feedback.

Published Monday, March 12, 2007 5:35 PM by Paul Randal - MSFT

Comments

 

.NET a 2.860 m de altura said:

No sé cuantos de nosotros estamos familiarizados con el término SLA (Service Level Agreement),

March 13, 2007 12:19 AM
Anonymous comments are disabled

About Paul Randal - MSFT

Paul started in the industry in 1994 working for DEC on the VMS file system and check/repair tools. In 1999 he moved to Microsoft to work on SQL Server, specifically on DBCC. For SQL Server 2000, he concentrated on index fragmentation (writing DBCC INDEXDEFRAG and DBCC SHOWCONTIG) plus various algorithms in DBCC CHECKDB. During SQL Server 2005 development Paul was the lead developer/manager of one the core dev teams in the Storage Engine, responsible for data access and storage (DBCC, allocation, indexes & heaps, pages/records, text/LOB storage, snapshot isolation, etc). He also spent several years rewriting DBCC CHECKDB and repair. For SQL Server 2008, Paul managed the Program Management team for the core Storage Engine to become more focused on customer/partner engagement and feature set definition. In 2007, after 8.5 years on the SQL Server team, Paul left Microsoft to join his wife, Kimberly Tripp, running SQLskills.com and pursuing his passion for presenting and consulting. Paul regularly presents at conferences and user groups around the world on high-availability, disaster recovery and Storage Engine internals. His popular blog is at http://www.sqlskills.com/blogs/paul/.

© 2009 Microsoft Corporation. All rights reserved. Terms of Use  |  Trademarks  |  Privacy Statement
Microsoft
Page view tracker