Scale-First Approach to Database Design with Federations: Part 1 – Picking Federations and Picking the Federation Key

Scale-First Approach to Database Design with Federations: Part 1 – Picking Federations and Picking the Federation Key

Rate This
  • Comments 1

Lets defined Scale-First approach first: Many of the systems we built can take on various concerns into account; you could take different approaches to achieve lowest latencies, highest performance, high consistency or easy extensibility in your data model and database design. For example, with well known techniques like normalization, you get great consistency. However you can apply additional compromises and constraints to your design to gain much better scalability... This is exactly the focus of this post. Scale-first database design refers to these constraints to consider for best scalability of your system. Said another way, Scale-first database design considers ‘scalability’ as the top concern of your database design and applies these additional constraints to your data model and database design. Scale-first approach help you craft the model that will allow you to use federations in Windows Azure SQL Database and can get you both great scale combined with great economics.

Scale-first approach does not require you to relearn database design from scratch. It is just an additional set of steps you apply while generating your data model. These additional steps help you configure your model for best scale characteristics. By picking your federations and federations key, you work in the scale-first principals into your data and database design. In other words, by picking federations and federation keys, you get to annotate the data model and your database objects with scale-out intent.

In a series of posts, I’ll walk through the process of designing, coding and deploying applications with federations. If you’d like to design a database with scalability in mind using sharding the technique, this post will also help you get there as well...

In case you missed earlier posts here is a quick definition of federation and federation key;

Federation is an object defined to scale out parts of your schema. Every database can have many federations. Federations use federation members which are regular sql azure databases to scale out your data in one or many tables and all associated rich programming properties like views, indexes, triggers, stored procs. Each federation has a name and a federation key which is also called a distribution scheme. Federation key or federation distribution scheme defines 3 properties;

  • A federation key label, help self document the meaning of the federation key like tenant_id or product_id etc. ,
  • A data domain to define the distribution surface for your data. In v1, data domains supported are INT, BIGINT, UNIQUEIDENTIFIER (guid) and VarBinary – up to 900 bytes.
  • A distribution style, to define how the data is distributed to the data domain. At this point distribution style can only be RANGE.

The scale First approach is an add-on set of steps to how you model your data so lets pick up from the point where you modeled your basic entities and objects in your database. Assume you are designed the AdventureWorks (AW) database. You have the entities at play like Stores, Customers, Orders, Products, SalesPeople etc and now you are thinking about how to scale it out…

Picking your Federations:

Picking your federations and federations keys is much like other steps of data modeling and database design processes like normalization. With the scale-first approach and federation concepts, you can come up with multiple alternatives that compromise to optimize for various parts of your workload. Here are these steps to help you come up with those alternatives and help you choose between these alternatives:

Step #1: Identify entities you want to scale out: You first identify the entities (collection of tables) in your database that is going to be the busiest parts of your traffic, that will be popular targets that will be hit by largest portion of your workload… That will be under pressure due to excessive resource requirements and needs scale out. These entities are your candidate federations in your design.

Step #2: Identify table-group that make up these entities (identify federated tables): After identifying these popular entities you want to scale out, you identify main tables table-groups by identifying main tables that make up the entity by traversing the relationships, access patterns and properties to these set of entities. I’ll refer to these groups as table-groups.

Some may be lucky to have one table group but in typical cases you find multiple table-groups in your database model such as customer, order and product. These table-groups make up your entire data model and may have relationships such as one-to-many, one-to-one or many-to-many. For example: orders and customers are entities in SalesDB that consist of table-groups. Order table-group are made up of orders and orderdetails tables. customer table-group contain  customers and customeraddress tables. Custom has many Orders, thus these 2 entities are said to have a one-to-many relationships.

       

image

 

Picking your Federation Key:

Step #3: Identify the Federation Key: Federation key is the key used for distributing the data in the table-groups. Federation key is used in routing the query to the right partition thus is critical to have in all latency and scale sensitive parts of your workload. Federation key also define the boundary of atomic units. Atomic units are rows across all tables in the table-group. Atomic Units are rows across all the scaled out tables that share the same federation key value. For example across the order table-group, if the federation key happen to be customerID, all rows that have customerID=55 in orders and orderdetails tables make up the atomic unit.

Ideal federation keys have the following properties;

  1. Atomic Units is the target of most query traffic & transaction boundaries. This is desired because business logic within a atomic unit boundary is coded with all the richness of the SQL Language – stored procedures, triggers, CTEs, etc.
  2. Distributes workload equally to all members equally – decentralize load to many atomic units as opposed to concentrating the load. This is important because it promotes the application take advantage of the full computational capacity available to the application at all members or partitions of data. 
  3. Atomic units cannot be split, so largest atomic unit does not exceed the boundaries of a single federation member. Picking small-footprint atomic units are important because it avoids hitting the walls of the system.

Step #4: Picking Reference Tables: Now that you have picked your federations, table groups in them and federations keys for distributing the data in these federations, you have a bunch of your objects you scaled out with federations and other set of objects you don’t scale out and simply leave in the root database. Federations provide a 3rd type of table called a reference table that optimize your query-ability. Reference tables typically contain popular lookup info in the transactions and workload that hit the federated tables but are tables that do not align with the federation key. One example can be the Zipcodes table in CustomerOrders federation with CustomerID federation key. Reference tables like Zipcodes is an optimization to take advantage of local joins to improve latency and query-ability. Reference tables are created in every member and may or may not contain the same data across all member. Having some tables local and repeated in every member however changes how to converse with a reference table and what guarantees and behavior you expect from it. Simple put: reference tables operate in eventual consistency as opposed to strict consistency.

Local updates to the reference table within a member are allowed. However these local updates are not automatically replicated by the rest of the copies in other members at this point. In future this may change. However in future the replication of reference table data, even if it is done by the system need to happen asynchronously for the best scale characteristics. Distributed transactions simply does not scale to large number of transaction managers. With eventual consistency your app interact with this data can no longer assume to find the latest updates in all copies meaning; updating the zipcode table in one member in a transaction may mean that you may may still find its missing if you switch to another member.

This is it! 4 Steps!

Lets apply these rules on a example to see how they help pick federations and fed keys.

Walking through AdventureWorks

The database design for AW is something I hope you are already familiar with. You can fid details here on the schema for SQL Azure; http://msftdbprodsamples.codeplex.com/releases/view/37304. For this app we want to be able to handle Mil customers, 100 Mil orders, and Mil products. These are the largest entities in our database. I’ll add a few more details on the workload to help guide our design; here are our most popular transactions and queries;

‘place an order’, ‘track/update orders’, ‘register/update a customer’, ‘get customer orders’,’get top ordering customers’, ‘register/update products’, ’get top selling products’

and here are some key transactions and consistent query requirements;

‘place an order’, ‘import/export orders for a customer and stores’, ‘monthly bill for customers’

Step #1: We have the classic sales database setup with customer, order and product in AW. In this example, we expect orders to be the parts that will be most active, the tables will be the target of most of our workload. We expect many customers and also handle cases where there are large product catalogs.

Step #2: In AW, Store and customer tables are used to identify the customer entity. SalesTerritory, SalesOrdersHeader and SalesOrderDetail tables contain properties of orders. Customer and Order entities have one-to-many relationship. On the other hand, Product entity has a many-to-many back to Order and to Customer entity. When scaling out, you can align one-to-one and one-to-many relationships together but not many-to-many relationships. Thus we can only group Customers and Orders together but not products.

Step #3: Given the Customer (store and customer tables) and Order (SalesOrderHeader, SalesrderDetail, SalesTerritory) we can think of a few setups here for the federation key.

- StoreID as the Federation Key:

  • Well that would work for all transactions so #1 of ideal federation key principal taken care of! That is great.
  • However stores may have varying size and may not distribute the load well if there the customer and order traffic variance between stores is too wide. Not so great on #2 principal of ideal federation keys.
  • StoreID as a federation key will mean all customers and all their orders in that store will be a single atomic unit (AU). If you have stores that could get large enough to challenge the computational capacity of a single federation member, you will hit the ‘split the atom’ case and get stuck because you cannot.

So StoreID may be too coarse a granule to equally distribute load and may be too large an AU if a store gets ‘big’. By the way, TerritoryID is a similar alternative and has a very similar set of issues so same argument applies to that as well.

- OrderID as the Federation Key:

  • OrderID certainly satisfy #2 and #3 of the ideal federation key principals but has an issue with #1 so lets focus on that.
  • That could work as well but is too fine a granule for queries that are orders per customer in the system. It also won’t align with transactional requirements of import/export of customer and store orders. Another important note; with this setup, we will need a separate federation for the Customer entity. It means that queries that may be common like ‘get all orders of a customer’ or ‘get order of customer dated X/Y/Z’ will need to hit all or at least multiple members. Also with this setup we lose ability to transact multiple order from a customer. We may do that when we are importing or exporting a customers orders.

Fan-out is not necessarily bad. It promotes parallel executions and can provide great efficiencies. However efficiencies are lost when we hit all members and can’t fast-eliminate members that don’t have any data to return and when cost of parallelization overwhelms the processing of the query. With OrderID as the federation key, queries like ‘get orders of a customer’ or ‘get top products per customer’ will have to hit all members.

- OrderDetailID  as the Federation Key:

  • The case of OrderDetailID has the same issues as OrderID case above but amplified on principal #1. With this setup, we will lose transactional boundary to place a single order in a transaction. In this case, there will be more queries that will need to be fanned-out like ‘get all customer orders’ or ‘get an order’… Makes assembling order a full fan-out query which can get quite expensive.

- CustomerID as the Federation Key:

  • With CustomerID #2 and #3 will not likely be an issue. The only case is where a customer gets so large that it overwhelms a member and computational capacity of AU. For most cases, CustomerID can be a great way to decentralize the load getting around issues StoreID or TerritoryID would create.
  • This setup also satisfy almost all of #1 as well except 2 cases; one is ‘get top selling products across all customers’. However that case isn’t satisfied in any of the other alternatives either. This setup does focus the DML (INSERT/UPDATE/DELETE) transactions and satisfy both multiple order and single order placement transaction to work seamlessly. So looks like a good choice for #1 standpoint. Second is import/export at store boundary; for example import of stores all customers and orders will not be possible with this setup in a transaction. Some people may be able to live with consistency at the customer level and be ok relaxing consistency at the store level. you need to ask yourself; Can you work with eventual consistency without transactions at the store level by depending on things like datetime stamps or some other store level sequence generator. if you can this is the right choice.

How about Product Entity?

We have not touched on how to place products in this case. To remind you the issue; there is a many-to-many relationship in customers vs orders thus federation aligned with orders and/or customers  Well, you have 3 choices when it comes to products.

- Products as a Central Table: In this case, you’d leave Product entity in the root. That would risk making root and products a bottleneck. Especially if you have a fast updating catalog of products and you don’t build caching facilities to minimize hitting the root for product information for popular queries and transaction in the system. The advantage to this setup is that product entity can be maintained in a single place with full consistency.

- Product as a Reference Table: In this case, you would place a copy of the products entity in each federation member. This would mean you need to pay more for storing this redundant information. This would mean updating the product catalog will have to be done across many copies of the data and will mean you need to live with eventual consistency on product entity across members. That is, at any moment in time, copies of products in member 1 & 2 may not be identical. Upside is this would give you good performance like local joins.

- Product as a separate Federation: In this case, you have a separate federation with a key like productID that holds product entity = all the tables associated with that. You would set up products in a fully consistent setup so there would not be redundancy and you would be setting up products with great scale characteristics. You can independently decide how many nodes to engage and choose to expand if you run out of computational capacity for processing product queries. Downside compared to the reference table option is that you no longer enjoy local joins.

To tie all this together, given the constraints most people will choose customerID as the fed key and place customers and orders in the same federation and given the requirement to handle large catalogs, most people will choose a separate federation for products.

Here is what the actual statements would look like;

CREATE FEDERATION CustomerFederation (customerID uniqueidentifier RANGE)

GO

CREATE FEDERATION ProductsFederation (productID uniqueidentifier RANGE)

GO

image

Just like this case, for sophisticated schemas with variety of relationships, there pros and cons to evaluate much like other data and database design exercises. Designing with federations puts ‘scale’ as the top concern and that pushes you to evaluate various compromises on transactions and query processing.

In part 2, I’ll cover the schema annotations that you need to make with federations.

Later!

  • Hi Cihan,

     I liked all your posts on SQL Azure Federation. However I have a question:

    I have an SQL Azure Database named: [SomeDB] and I have 1 federation member for this database which is named : [Member1].

    I have created tables which are specific to the tenants in [Member1] database which is my federation member database. I have also created master tables in [SomeDB] which is my federation root database. Now I need to add foreign key relation between a table in [Member1] with a table in [SomeDB]. How would i do that?

Page 1 of 1 (1 items)