Welcome to MSDN Blogs Sign in | Join | Help

Exchange Hosted Archive - A True Testament of Scalability

Hi everyone, this is Shankar Pal. I am a Principal Program Manager on the SQL Data Services (SDS) team. I spend my time working on the backend design for large, enterprise applications.

 

I wanted to share with you some of my experiences on the scalability of SQL Data Services and how this is best exemplified by one of our online services, the Microsoft Exchange Hosted Archive (EHA). This is a very rich service for e-mail archive, e-Discovery and regulatory compliance for corporate customers and large organizations. The next generation EHA uses the same relational database service infrastructure as SDS. I will focus on the section of the service pertaining to the scale aspects of the workload, and discuss how the relational database service addresses the scale requirements of EHA.  

 

First, a brief introduction to the characteristics of the workload. Archived messages accumulate in the system and are governed by the retention policies of the customers. The message lifecycle goes from archival of messages in the system, retention based on retention policies (e.g. 3 years) and purging the messages at the end of the retention period. Inserted messages are full-text indexed on the header, subject line, message body and a variety of common business attachments such as Word documents. E-discovery consists of structured and full-text query of the messages. Examples are searches based on various properties such as the send time, the sender, or full-text search of the message body.

 

EHA looked for a long-term solution in a relational database service which would scale to a much higher archiving limit per customer than the current system, be easy to administer, provide the required availability and keep pace with the rapid growth of the service. The result is the next generation EHA which is powered by the SDS relational database service platform. The service allows the seat limit per customer to become many fold larger; this is achieved by distributing the archived emails from each customer to a large number of servers rather than to a specific server. Performance enhancements are seen during message insertions, as well as in structured and full-text queries across the system. For more information about the backend architecture, you can view a presentation from Gopal Kakivaya, a Distinguished Engineer in the SDS team, from last year’s PDC. That video can be found at http://mschnlnine.vo.llnwd.net/d1/pdc08/WMV-HQ/BB03.wmv.

 

Email is a very prolific form of communication. The high volume of incoming data is automatically replicated within the database service to provide fault-tolerance against various types of failures. The platform provides high availability whether storing gigabytes, terabytes, or petabytes of data. Each cluster of machines has a capacity to store hundreds of terabytes for email archive. Together with the replication and the backup requirements, the total capacity of the EHA cluster is petabytes of data, a testament to the scalability of the SDS relational database service platform.

 

The enormous scale is achieved with surprising simple design principles. Mail messages can be partitioned in several ways, the most obvious being by customer or user. Such segments can grow quite large, so for more parallelism, each customer or user’s messages can be partitioned further, most notably by the send time. A variation of this partitioning scheme is used for the EHA application. The partitions for each customer are scattered over many servers. This increases the throughput of the system for message insertion by distributing the write operations over a large number of physical servers. The net result is much higher insertion rate compared to the current EHA system.

 

Queries benefit from the physical distribution of the data as well by executing on multiple partitions scattered over many server machines. The process of running queries in parallel, sometimes referred to as fan out, and aggregating the responses pays greater rewards the more complex the query and the bigger and more distributed the overall data set. Customers, especially in heavily regulated industries, frequently perform full-text searches using date ranges and other qualifiers. The more structured the query the more relevant the results. Our measurements with real customer messages show that queries with a high degree of fan-out often execute an order of magnitude faster compared to a single instance of a server.

 

The new generation archive will be available later this summer. It is very exciting to build a system which uses physical distribution of data to meet the scale and performance requirements of a large enterprise application. The self-managing system simplifies a host of administrative functions and makes those more reliable.

Published Monday, June 08, 2009 8:50 AM by davidrob

Comments

# SQL Data Services Team Blog Exchange Hosted Archive A True | Quick Diets

# Email Archiving in the Windows Azure Cloud Using SQL Data Services | AzureJournal - Cloud Computing Blog

Anonymous comments are disabled
 
Page view tracker