This is the first of a few blogs about designing for availability and resilience… WHAT FAILURES MIGHT OCCUR AND HOW DO WE CHOOSE THE RIGHT DESIGN TO PROTECT US?
In the very early stages of a messaging design, and in particular at the point at which discussions surface concerning availability and resilience, it is often very useful to understand the type of issues that support teams are likely to face and how your proposed design stacks up.
So first I need an example design. For the purposes of this blog I am using a pretty standard Exchange 2007 design based on CCR\SCR across 2 data centres. The design is best described on Technet here; ‘Site Resilience Configurations’. (See section ‘Production (Non-Dedicated) with One Active Directory Site’ – “This solution deploys redundant servers in a single Active Directory site that spans both datacenters.”)
I’m also using DPM for VSS based backups to disk, with long term backups to tape media, and there is a requirement to journal all messages to satisfy compliance regulations.
WHAT MIGHT GO WRONG?
The scenarios I’m going to base this on are as follows:
Data Centre Failure: The loss of an entire data centre Server Hardware Failure: Component failure e.g. motherboard Storage Failure: Access to all or a part of a volume\LUN – not including single disk failure Mailbox Database Corruption (Physical): Most likely as a result of hardware failure Mailbox Database Corruption (Logical): Data corruption may be as a result of faulting application or virus Mailbox Deletion within Deleted Mailbox Retention period (<30 days): A result of an administrative or procedural error Mailbox Deletion beyond Deleted Mailbox Retention period (>30 days): A result of an administrative or procedural error or returning employee Email or Item Deletion (<14 days): User mistakenly deleted an item –administrator intervention required only if item hard deleted Email or Item Deletion (>14 days): User mistakenly deleted an item –administrator intervention required Identify if and when a particular email was sent\received (<30 days): Only message route required Identify if and when a particular email was sent\received (>30 days): Only message route required Identify if and when a particular email was sent\received (<14 days): Entire message required Identify if and when a particular email was sent\received (>14 days): Entire message required
HOW DOES MY PROPOSED DESIGN PROTECT ME?
The following table takes the above scenarios and determines where the protection against the occurrence of each particular scenario is in your design. This first pass should help us to understand what might fail, what protection the design provides, the likelihood of the scenario occurring and the impact of that event.
Identify if and when a particular email was sent\received (>14 days)
* Whilst it is estimated that invoking the SCR target might take place in less than 2 hours, the loss of an entire data centre might mean that the complete service (including the redirection of Outlook clients, the internet connection, and the recovery of all ancillary services, such as an archive solution; may mean that resumption of service takes more than 2 hours. ** The alternative to failing over the entire server to the CCR replica is to restore a single database from disk using DPM. This increases the impact for the users will mailboxes on the affected database but provides no loss of service to users on the rest of the server. *** The default Deleted Mailbox Retention period is 30 days which is configurable. **** The default Deleted Item Retention period is 14 days which is configurable. ***** Message Tracking Logs are by default kept for 30 days. This is a configurable setting. ****** Currently it is assumed that all email is journaled and archived and retained for a period according to compliance requirements.
So to use an example from the table above. If an administrator was asked by customer to identify an email that was sent or received over 30 days ago (not actually provide the message itself but identify when it was sent and received) then they would have to identify the databases where the sender and recipient mailboxes were located at the time of the message delivery, restore them and try to find that message. A long and laborious task which might take up to 2 days. In my example I have assumed that the likelihood of this occurring is low-moderate. This exercise should highlight the areas where your proposed design doesn’t provide the protection that your specific company requires of it.
The next blog in this series is called ‘Recovery Scenarios for E2K7…..II’ and looks at each component of the design to determine which of them brings the most value at the smallest cost so that we can make a more informed decision as to which to choose to deploy…
PingBack from http://blog.a-foton.ru/index.php/2008/12/03/recovery-scenarios-for-e2k7%e2%80%a6i/
Quick question/scenario, what would you expect an RTO for an exchange environment with around 1500 users to be for say a SAN failure? Company has a SAN controller fail with no redundant controller, possible corruption of Exchange stores, should/would standard recovery take over 2 days? 1 day?