MDM Master Data Management Hub Architecture – Versions and Hierarchies

 

In my last post I talked about the options for implementing an MDM hub.  In this post, I will drill into that a bit by discussing versions and hierarchies – two features that are key to an MDM hub implementation.  I’ll talk a little about why they’re important and the present a few implementation options.

 

Link Tables

 

In the implementation options for both of these features I’ll refer to link tables frequently so I thought I would explain what I mean when I say link table.  If you’re already a linked table expert, feel free to skip to the next section.  

 

One of the fundamental concepts in relational databases is using a foreign key to define a relationship between related rows.  This is done by storing they key of the related row in a column of the other row.  For example, if I have a table of customers and another table of addresses, I can specify the shipping address for a customer by placing the primary key of the address table in a column named “shipping-address” in the customer table.  When you want to find the shipping address for a customer you use the value in the shipping-address column for that customer to look up the address.  Many customers can use the same address by using the same key in their shipping-address column but there’s no good way to model a single customer with many shipping addresses.  In reality, many customers can have the same address and one customer can have many addresses.  This is called a many to many relationship and the easiest way to model this is with a link table.  A link table looks something like this:

Simple Link Table

 

Another useful property of link tables is that columns in the link table can be used to represent properties of the relationship.  For example, a relationship between customers and addresses might represent a shipping address for a customer or a billing address for a customer.  You could represent this by having two different link tables – one for shipping addresses and one for billing addresses – or by having a single link table to link customers and addresses with a link type column that is used to differentiate between shipping address links and billing address links:

Typed Link 

 

Notice that all the information about the relationship is included in the link table.  Nether of the tables that are tied together have any information about the link.  This means you can create a new relationship between tables that are part of applications that can’t be changed.  For example, you can create a relationship between a customer record in the CRM application and a territory record in the Sales Force Automation application without changing either database.

 

Versioning

 

Data governance and regulatory compliance are much easier with a complete version history of all changes to the master data.  It is often not enough to know what a customer’s credit limit is today; you need to know what his credit limit was three months ago when he was charged a high interest rate for exceeding his limit.  While this is a simple example, there are many cases where knowledge of past values for master data attributes may be required.  This leads to versioning as a key feature for master data management systems.  Versions are also required to support data stewardship and governance activities on master data.  When master data is modified, business rules are applied to the modifications to determine if they meet the rules developed by the data governance organization.  Data stewards also use version information to monitor the results of the updates and if necessary restore the original values.

 

When most developers think of versioning, they picture source-code control systems that have full branching and merging capabilities.  If your MDM hub needs this type of versioning, the versions are generally implemented with link tables that link rows in a version table with particular version of the MDM record.  A simplified diagram of the links might look something like this:

 

Versions

 

Notice that John Smith changed in version 1.1 so there are two different rows for John Smith but Sam Spade did not change so both versions point to the same row.  In this schema, adding a new branch involves adding a row to the version table and creating rows in the VersionLink table for every customer.  As customers are updated, a new row is inserted for each modified customer row and the link is changed to point to the new row.  While this method offers a lot of flexibility, millions of customers and hundreds of branches produce huge link tables so managing the volume of data can be an issue.  Also, even fairly simple queries like “select all customers with a past due invoice” involve multiple joins to obtain the right version of the customer records.  In my opinion, most MDM systems don’t require this level of versioning flexibility and trading reduced flexibility  for simplicity and performance is a good option.

 

One of the simplest versioning schemes is to add an “EffectiveDate” column to each master data row.  When a master data item is modified, a new copy of the row is inserted with the “EffectiveDate” column set to the date and time that the change was made (OK maybe it should be EffectiveDateTime).  When you want to query the latest version of all customers, you look for the MAX(EffectiveDate).  If you want to know what a customer record looked like on a particular date, you look for the row with the maximum EffectiveDate where the EffectiveDate is less than the date you are looking for.  This versioning scheme is another place where link tables are a handy tool.  If you model a customer shipping address as a foreign key column in the customer row, changing the address requires adding a row to the address table for the new address and also adding a new row to the customer table that contains the address key.  With a link table, you just add the address table row and a corresponding row in the link table.  Whether this is a good thing or not depends on whether you consider a change to the shipping address to be a change to the customer.  If you want to track the address change as a new version of the customer record then the foreign key relationship is a better way to model the customer record because each change to the address produces a new version of the customer row. .

 

One of the down sides of maintaining a version history of all your master data entities is that even simple queries have to deal with versions so they retrieve the correct version of the data.  One way to simplify this is to create a view that exposes the latest version of all objects so that users who only care about the latest version can write simple queries and only users who need a particular version need to deal with the versioning complexity.  Another alternative solution that also may reduce the management overhead of the hub database is that instead of inserting a new row into the master data table when a record is modified, actually modify the master record in place and put the old version into a history table.  This can make your master data tables orders of magnitude smaller in addition to making non-version queries simpler to write.  Since the history data is accessed less often than the latest version, it can be stored on slower-cheaper disks reducing the overall cost of the system.   Another problem the history table approach solves is what happens when the master data schema changes.  For example when you add columns to the customer table, what value do you put into the new rows for old versions of the customer record that didn’t include the columns or more importantly, if you drop a column, what happens to the information stored in older versions?  With history tables, each schema version can be stored in a separate history table with the schema that was in use at the time the rows were created.  This obviously makes queries against historical data more complex because you will need to know which table contains the versions you want but it provides a more accurate representation of history – another tradeoff to consider.

 

Hierarchies

 

Hierarchy management for purposes of this article is the ability to define and store relationships between master data records in the MDM hub.  Relationships are a critical part of the master data - products are sold by salesmen, employees work for managers, customer companies have subsidiaries, sales territories contain customers, products are made from parts.  All these relationships make your master data more useful. 

 

Many relationships exist in your current systems.  For example your HR system may track who works for who or which organization pays your salary.  Other relationships may only be possible to define because the MDM hub integrates the data from multiple systems.  For example linking a customer in the CRM system to a service contract in the customer service system may be difficult to do if the systems aren’t aware of each other but if both the customers and service contracts are stored in the MDM hub, a link table can be defined to track this relationship.

 

Some hierarchies are special purpose or temporary.  For example, if your development teams are organized in a matrix structure, expenses and salaries may be rolled-up to a management structure for budgeting and to a project structure for time and expense reporting.

 

MDM hierarchies should be named, discoverable, versioned, governed, and shared.  For example, if I want to know how expenses for the XYZ project are rolled up or who reports to John Smith, I should be able to select the appropriate hierarchy from a list and know whether it is authoritative and when it took effect.  This means that everybody who looks at project expenses will use the same structure instead of everyone using whatever spreadsheet they happen to find.  This also means if an auditor wants to know who was working on the project on November 2, 2004, there is a single authoritative place to find the answer.  CEO’s love this stuff because it tends to keep them out of jail.

 

In order to support relationships between entities without requiring changes to the entities, most hierarchies are implemented as link tables.  If the data already contains relationships imported from the source systems, it generally makes sense to leave those relationships alone to maintain the fidelity between the MDM hub and the source system but you may decide to convert them to hierarchies implemented as link tables to take advantage of the hierarchy management features of the hub and to provide a standard format for hierarchies.

 

Here’s a simplified view of what a hierarchy management data model might look like:

 

Hierarchies

 

In reality there would be quite a bit more meta data about the hierarchy and probably more properties on the link table rows.  Whether you implement all hierarchies in the same table or create a table for each hierarchy will depend on how uniform your hierarchies are and how big they are.  One hierarchy per table is the correct way to model it from a relational theory standpoint but if you have hundreds of fairly small hierarchies, combining them may simplify database maintenance.  There are a number of intermediate options also.  For example, you may groups all the hierarchies that use the same pair of keys into a single table or group them by use – accounting in one table, HR in another, and CRM in a third.