MDM Master Data Management Hub Architecture – Population and Synchronization

 

If you have been following this series you should now have a good understanding of the architectural issues around deciding what your MDM hub database looks like and what kind of data is kept in it.  In this post we’ll discuss how to populate the hub with good clean data and how to ensure that the data stays clean and consistent.  This involves populating the hub database with data from the source systems initially and (with the exception of a pure repository model hub) keeping the source systems synchronized with the hub database as the source systems make changes to the data. 

 

Batch Loading – ETL

 

The initial population of an MDM hub is a very similar to populating the dimension tables in a relational data warehouse.  In many cases, the same ETL (Extract, Transform and Load) tools used for data warehouse loading can be used to populate the MDM hub.  Many MDM implementations either use standard ETL tools or use tools derived from ETL tools.  A typical load process involves the following steps:

Extract the data from the source system – this should probably be done one subject area at a time to make things easier.  This is that part of the process that may require either buying or building an adapter that understands the data source.  Again, the same adapters that are used to extract dimension data for data warehouses should work here unless you are using a tool that isn’t compatible with standard adapters.  This is basically a batch operation so many tools will do the extract into a flat file while other will extract directly into the ETL pipeline.

Transform to the hub data model – as part of the hub design process a data model was defined along with a mapping from each source to the common hub model.  This step in the process makes the necessary changes to transform the master data entity from the application data model to the MDM hub data model.  This again is standard ETL stuff that might include changing column names, changing field sizes, changing formats of things like phone numbers and addresses to match the standard formats for the MDM hub, combining columns into a single column and parsing a single column value into multiple columns.

Check for duplicates – this process is the “secret sauce” of most MDM systems.  It is both the hardest and most important part of populating the MDM hub.  If you want a single view of your customer or product data, records describing the same business entity must be combined into a unique record for each unique entity but if your MDM system is too aggressive in finding duplicates, entities might disappear when they are incorrectly determined to be already in the system.  For example, your duplicate detection algorithm might decide that George W Bush and George H W Bush are the same person so information about one of them might be lost.  This is one of the reasons why both versions of the record should be stored in the version history so this kind of error can be corrected if necessary.

Some duplicate-checking algorithms are fairly simple and check for things like alternate spellings and missing words.  For example, John Smith, Mr. John Smith, J. T. Smith, etc.  While these are adequate for reasonably small databases, the potential for false matches is high.  More sophisticated algorithms might check for people at the same address or with the same phone numbers.  Other systems might use external data like phone directory data or Dun & Bradstreet listings to find matches.  Many tools specialize in certain kinds of data – medical patient data, consumer goods, or auto parts, for example.  If there is a tool available for the kind of data you work with, these specialized tools can provide very accurate matching.  Other tools are more generic and often allow you to specify your own matching rules to improve the matching for your specific data.  Almost all of the matching tools provide a “degree of confidence” number for each match they detect and your loading process should specify what confidence level is required for a match.  For example, you may decide that a 95% confidence level is enough to automatically match an entity, confidence levels between 80% and 95% should be marked for manual processing, and levels below 85% are not considered matches.  What values you choose will depend on the consequences of a false match.  If the result of a mistake is sending two marketing brochures when one would have been adequate then the confidence level doesn’t have to be high but if a mistake results in someone getting arrested for tax evasion, it’s good to be very sure.

Load the MDM hub database – if the new record is not already in the hub database then this is just a matter of inserting the data into the correct tables but if it is a duplicate then the load process must check the business rules for this entity to decide what data to update with the incoming record.  For example, if there is no shipping address in the current record and the incoming record includes a shipping address then the address is added.  If there is already a shipping address and the incoming record also has one then there must be a rule specified to decide which one to keep or if both should be kept.  If the business rules can’t resolve the conflict then the incoming record should be put on a queue for manual processing.  If the MDM hub is a registry or hybrid model then even if none of the data from the incoming record is used, the key of the record should be added to the database to record the connection from the hub record to the source record.  This may be used by queries to find the source record or by the hub to publish hub updates to the source systems – more info on this in the next section.

Update the source systems – if loading a new record changes the hub database, the change may need to be propagated to one or more of the source systems.  For example, if a new, authoritative shipping address is added to a customer record, other applications which stored information about that customer may want to use the new address (I say may because there are cases where an application needs to continue with the old address and ignore the new address).  I will cover this process in more detail in the synchronization discussion but I just wanted to mention it here for completeness.  As I said at the beginning of this section, if your MDM hub uses the repository model then it will replace the databases in the source systems so this step is unnecessary.

 

The process of loading the data from a source application into the MDM hub can take a long time if there is a lot of data and if a significant amount of manual processing is required to resolve data quality issues.  In many cases it is wise to load a source application into the hub and then run for a few days or weeks to ensure everything is working correctly before loading the next application.  The load process works best if the most authoritative and complete data sources are loaded first so that subsequent loads make relatively few changes to the existing hub data but primarily record duplicates and synchronize the application data with the hub data.  Loading the most critical databases first also leads to earlier time to value which can be important in justifying the MDM investment.

 

Synchronization

 

Now that the MDM hub is populated with a single authoritative version of your master data, you need to develop a process to keep it clean and authoritative.  This means implementing a method for changes to existing data and new master data items to be transferred to the MDM hub while maintaining the same level of data cleanliness that you achieved while loading the hub from the source applications. 

 

One way of maintaining the MDM hub database is to keep any of the source applications from making changes to the master data entities and thus force all additions and updates to the master data to be done to the hub database. This is the easiest technique to implement and manage because only one database is updated and all updates can be closely monitored and controlled to ensure conformance to business rules.  The primary difficulty with implementing this technique for maintaining the master data is that it requires that none of the source applications make updates to the master data.  For example, nobody can add a customer to the CRM system and nobody can change a product definition in the ERP system.  All changes must go through the new MDM system.  In many organizations the retraining and operational changes required to make this work are unpalatable.  On the other hand, if this MDM project is part of an SOA initiative, implementing new services to manage the master data can be incorporated into the overall SOA project.  I won’t spend a lot of time on how to build this service because it is generally a pretty basic data maintenance service.  If you have access to the source systems, you might want to use a modified version of the best master data maintenance procedures you currently have or at least use the business rules and validation logic from the source systems.  The one thing to remember here is that having a single master database does not mean you don’t have to worry about duplicates.  It’s still possible for a user to create a new entity rather than modifying an existing one (and in some systems it is actually easer to create a new entry than modify an existing one) so the MDM hub service must still check for duplicate entries.

 

If moving all master data maintenance to the MDM hub is technically or organizationally impossible, you can consider a synchronization process that transfers changed master data records from the source application that made the change to the MDM hub.  The MDM hub then processes the change using much the same logic that was used to populate the hub originally.  This introduces the possibility of conflicting updates and inserts from multiple systems and introduces some latency between the time a change is made and when it shows up in the MDM hub database so the business must understand the limitations of this system.  In most systems, the rate of change to a gives master data entity is fairly low so update conflicts should be pretty rare and thus reasonable to resolve either manually or with simple business rules.  This is especially true for data attributes that represent real-world entities.  For example, the chances of two conflicting changes to a customer’s phone number or address happening the same day are pretty remote.  To further reduce the chances of update conflicts, you might introduce the concept of a preferred source for data.  For example, if it’s not feasible to change the product information maintenance process to use a new service for maintaining product data, it may still be possible to limit the maintenance of any given product to a single system.  This eliminates update conflicts without requiring a total revamping of the product maintenance process.

 

The most significant technical challenge in transferring master data changes from the source applications to the MDM hub is detecting changes in the source system.  If you have access to the source system source, you may be able to add a little logic to send each master data change to the MDM hub as it is made to the application database.  Another option is to use database triggers to detect changes if you have enough understanding of and control over the application database to do this.  Replication might also be a good alternative if the entities are simple enough so that you can determine what the entity change was from the replicated data.  Unfortunately, you may find that none of these options work in your situation so you might have to resort to periodically querying the application for changes or even parsing audit logs to find changes.  Once you have detected a change in the source system, it should be sent to the MDM hub as quickly as possible to reduce the update latency.  I generally recommend reliable messaging for this task to ensure that changes aren’t lost in network of system failures.  BizTalk and Service Broker are probably the best alternative for this on the Microsoft platform but since the source applications can be running on a variety of platforms, other alternatives may be appropriate.

 

Once the change has been transferred to the MDM hub, it must be applied to the hub database.  The following diagram illustrates a typical process.  If you have trouble reading the diagram, it is available here.

 

 Create Entity Service

 

 

This diagram shows a CRM application adding a customer to the MDM hub by calling the CreateEntity service. Follow along with the flow while I walk through the processing steps:

·      The incoming data is mapped to the MDM data model using the same transformation used in the ETL process described earlier.  This makes the duplicate check easier and puts the record into a common format that can be used throughout the rest of the process.

·      The hub looks up the entity in the hub database to see if it is already there.  This isn’t a simple SQL query, it does all the fuzzy matching logic that the duplicate elimination logic did when creating the hub database.  For this reason, it’s good to look for a tool that can look for duplicates in batch mode and also do the lookup one entity at a time.  As I explained in the ETL section, there are three possible outcomes of the search – duplicate entry found, no entry found, and don’t know.  If the answer is don’t know, the entity is put on a queue for the data steward to resolve (stewardship will be covered in a later post).

·      If a duplicate is found then another application has already added this entity so this insert will be changed to an update.  The entity in the hub is checked to see if there is already an entry from the CRM application for this entity.  If there is then this entry is a duplicate in the CRM database so the entity already in the CRM database is updated with the new data and the entity that the CRM application is trying to add will be deleted to eliminate the duplication.  On the other hand, if the entity in the MDM hub doesn’t currently have a key for the CRM application, the key for the incoming entity is added to the hub entity and the incoming entity is passed on as an update to the approval workflow.

·      If no entry was found in the MDM hub for the incoming entity it is passed to the approval workflow as an insert.  At this point, the three streams converge again and an automated workflow checks the data update or insert to verify that it meets all the business rules for the MDM hub as defined by the data governance activity of the enterprise.  Some examples of business rules might be which fields require values, allowable value ranges, address verified with an external vendor, D&B number valid for this business, and prices in the correct currency.  At some point if there isn’t enough information to determine if a rule is satisfied or a rule determines that manual approval is needed the entity will be placed on the stewardship queue for manual approval.

·      If the entity passes the approval workflow process then it is inserted or updated in the hub database as required.  In the same transaction, the version information for this entity is updated with the previous values.  If the entity contains information that can be used to derive any of the managed hierarchies for this record, the required entries are inserted in the hierarchy database.  For example, the customer entity may be linked to a support contract entity and a sales territory entity based on contract-id and territory-id fields in the customer entity.

·      When the entity has been added to the hub database, the changes are published out to the other source systems.  In some MDM systems this means publishing every change to every system but in most cases only a subset of the source systems are interested in changes to a given entity.  For example, if your ecommerce system has hundreds of millions of customers it probably doesn’t make sense to push them all into the CRM system.  A set of business rules can be used to determine which applications receive new master data updates based on entity type or source.  Another alternative is to only send updates to data if the source application has a key in the MDM hub.  In that way, an entity has to be added to the application before it is managed by the MDM system.   The various methods of publishing updates back to the source applications are described in the next section.

 

Publishing updates

The first architectural decision you must make about publishing updates is whether you need to do it.  Some MDM systems are used to provide a single source of master data for Enterprise reporting or performance management and don’t require all the source applications to use the new master data.  In many organizations, the political ramifications of an MDM system directly updating one of the key enterprise applications will prevent automatic propagation of updates to some systems.  On the other hand, creating a clean source of master data is a significant effort so it seems like a waste of resources to not propagate the cleaned-up data to all the source applications.

If you determine that you need to publish master data updates, the next decision to make is whether to push updates out to the source application or let the source applications pull the changes from the hub.  Pull is generally easier to implement and manage but push reduces the time between when the hub is updated and the updates are available in the source applications.  Pull is also generally easier to implement between heterogeneous systems.  If your MDM hub runs on SQL Server and one of the source systems is IMS on a mainframe, it will probably be much easier to have the mainframe read a change file than to write an application to push changes from Windows into the mainframe application.  This is the classic tradeoff of capability against complexity and the deciding factors are usually the requirement up-to-date master data weighed against the difficulty of doing the integration.

The Push option looks like replication on the surface and in some cases; replication may be the best way to push the changes.  This works if the source application data model is pretty close to the MDM hub data model and there is a replication connection available.  If the two data models are significantly different or if replication isn’t available between the databases or if directly updating the source application’s database isn’t allowed, an application integration solution like BizTalk is probably the best choice.  If necessary, this can include complex data transformations and even an orchestration to do the update in multiple steps.  Orchestration can also be used to selective publish updates to only the applications that require them.  For example, only CRM systems that contain a record for a customer would receive updates for that customer.  If you are going SQL Server to SQL Server, Service Broker is also a good choice for a reliable asynchronous connection and transactional application of the required changes.

If the effort and complexity of implementing and maintaining a push solution are excessive, you may have to implement a pull solution.  The simplest pull solution is to allow the application to query the MDM hub database (or preferably read-only views of the database) directly to obtain the required data.  If the amount of master data is pretty small, the application can periodically refresh its master data completely but in most cases, the application will want to refresh only what has changed.  Timestamp columns are the most common approach to this issue.  Each application keeps track of the last timestamp it has read on only retrieves data with timestamps greater than its remembered value.  The down side of pulling data directly from the database is that if it is done frequently by a large number of applications it can cause significant performance degradation.

A pull alternative that makes it easy for applications to apply changes and reduces the load on the MDM hub database is to write changes into a journal or log.  This can either be a database table or a flat file.  If updates are sequentially numbered, an application can track which updates it has processed already.  If the number of applications pulling data is relatively small, it might make sense to generate a separate journal for each application.  This can be a lot of extra IO but it makes it easier to manage if each application can manage their own journal – delete the records they have processed for example.  On the other hand, you may want to maintain a journal of changes for auditing purposes anyway so this journal can do double duty.  In a pull architecture, the application might pull updates itself or an external tools that’s either custom written or implemented with an ETL tool can periodically read the changes and apply them to the source application.

It’s also possible to do both push and pull if your application requires it.  For example, one of the destinations you push updates to might be a service that writes a journal to support pull applications.

Data integrity and reliability

Something you must keep in mind when designing an MDM infrastructure is that all the complex processing that happens in the background to support master data synchronization must be extremely reliable.  Losing updates will reduce the accuracy of the master data and can cause users to lose confidence in it.  This means that as you design the code that will handle data movement and manipulation you must ensure that every action that changes or moves data is transactional and recoverable.  Messaging should always be transactional to ensure messages don’t get lost or duplicated and all the asynchronous processing involved in the MDM workflows should be managed by transactional queues so that it can be restarted from the previous state if the system shuts down in the middle of processing.  Obviously, if I was designing the hub, it would use Service Broker to control it’s operations but I’m sure BizTalk would provide the same reliability and would make the workflow and business rules much easier to implement.  There are also transactional messaging environments available in non-Microsoft environments.  The point is not which tool you use but rather making sure that as you go through every step of the process design, you consider what happens if the system shuts down unexpectedly or a disk drive fails.