MDM Master Data Management Hub Architecture
I’m doing a presentation on Master Data Management (MDM) in February so I’m going to do another series of architecture articles like I did for Service Broker but this time for MDM. I will start with the architecture of the database that stores the master data for the system. If you need a basic introduction to what MDM is, you can find one here.
There are three basic styles or architecture used for MDM hubs – the registry, the repository, and the hybrid approach. The hybrid approach is really a continuum of approaches between the two extremes of registry and repository so I’ll spend more time on the two extremes.
In the repository approach the complete collection of master data for an enterprise is stored in a single database. The repository data model must include all the attributes required by all the applications that use the master data. The applications that consume create or maintain master data are all modified to use the master data in the hub instead of the master data previously maintained in the application database. For example, the Order Entry and CRM applications would be modified to use the same set of customer tables in the master data hub. The advantages of this approach a pretty obvious. There aren’t any issues with keeping multiple versions of the same customer record in multiple applications synchronized because all the applications use the same record. There is less chance of duplicate records because there is only one set of data so duplicates are relatively easy to detect (they obviously aren’t impossible however because things like alternate spelling, nicknames, multiple locations for the same company, typos, etc. are still possible and the MDM hub must be designed to deal with them).
While the repository approach has significant advantages for maintaining a continuously consistent source of master data, there are major issues which must be considered when designing a repository based MDM hub:
· The most obvious issue is that it’s not always easy or even possible to change your existing applications to use the new master data. If you don’t own the source for the application you may not be able to modify it to use the new master data hub. If the application’s data model is pretty close to the MDM hub’s data model you may be able to use views and linked servers to make your application think it is talking to its own data when in fact it is talking to the MDM hub. I have also seen some systems that reduce the number of changes required in the applications by creating a stand-alone application that does some of the maintenance of the master data so that not all of the application functionality needs to be ported to use the hub data. This approach is generally hard to implement in a way that users accept however. Adding customers in a different application than the one used for updates is probably unacceptably complex. On the other hand, one of the more common reasons for implementing and MDM hub is to provide clean, consistent data for a SOA implementation. If you are rewriting and wrapping your applications as services, it might not be unreasonable to create new services to manage the master data.
· Another issue that must be resolved when implementing a repository style MDM hub is coming up with a data model that includes all the necessary data without being so large that it’s impossible to use. Because the hub database is used by all applications in the repository model, it has to include all the information required for all the applications. The simple answer to this is to make the hub database a superset of all the application data models. For example a hub customer record would include all the attributes of the customer records of all the applications using the MDM hub. This is not practical because it ignores many of the problems you need an MDM solution to solve. For example, if there are 5 formats for addresses, 8 formats for phone numbers and 6 different customer id’s, making all of these columns in the customer MDM database would make the MDM hub almost unusable. Every query would have to decide which address, phone number and customer number to use and in many records only one or two formats would be populated. The obvious solution to this is to settle on an enterprise-wide standard for each of the data elements in the MDM hub and modify the applications to consume and produce the standard formats. This is not only al lot of work for the IT department but determining whose format should become the standard format is often a major political problem. Every application owner thinks that their data formats are the right ones – not necessarily because the formats are any better but because the application owner doesn’t want to make the changes required to use a different format. It’s not unusual for the meetings to settle on a data model to take as much time as the actual implementation of the project. If there are data elements that are only used by one application, the data modeling effort might decide to eliminate them. This might require significant changes to the application. Another significant data modeling issue is what to do with data elements that are not used by all applications. For example, a customer added by an order entry application would likely have significantly fewer attributes than a customer added by the CRM application or a product added by marketing might have very different attributes than a product added by engineering. In some cases it might make sense to assign default values to unpopulated attributes and in other cases you might decide to modify the application to populate the extra attributes. In an SOA implementation you may decide to populate all the attributes with the service implementation. In general there will be cases where it isn’t desirable or possible to populate all the attributes from all the applications. A typical example is the Product Information Management (PIM) part of an MDM system where it may not make sense to maintain the same attributes for a product that is purchased for resale as for a product that is manufactured in-house.
The registry approach is the opposite of the repository approach because none of the master data records are stored in the MDM hub. The master data is maintained in the application databases and the MDM hub contains lists of keys that can be used to find all the related records for a particular master data item. For example, if there are records for a particular customer in the CRM, Order Entry, and Customer Service databases, the MDM hub would contain a mapping of the keys for these three records to a common key. Because each application maintains its own data, the changes to application code to implement this model are usually minimal and current application users generally don’t need to be aware of the MDM system. The down side of this model is that every query against MDM data is a distributed query across all the entries for the desired data in all the application databases. If the query is going against a particular customer this is probably not an unreasonable query but if you want a list of all customers who have ordered a particular product in the last six months you may need to do a distributed join across tables from five or even ten databases. Doing this kind of large, distributed query efficiently is pretty difficult. This is the realm of EII (Enterprise Information Integration) so unless your requirements are relatively simple, you may want to look at distributed query tools to implement query processing in a registry model MDM hub.
There are basically two styles of repository databases used for MDM. The first has one row in a table for each master data entity and columns for the keys of the application systems. This is the most straightforward to implement and the most efficient in operation because all of the distributed queries for a given MDM record can start from the same database row. A NULL value for a particular key means that the corresponding database doesn’t contain a record for the given MDM entity. There are two significant issues with this scheme however: first, adding an application to the MDM hub means adding columns to the key matching table which isn’t a big issue but it may also mean changing queries to include the new source of information. The second and more significant issue is that this style assumes that a given database has only one record for a given MDM entity. While this would be ideal, it is rare to find this in a real application. One obvious solution to this is to first clean up the application databases so there IS only one record for each master data item. This should be one of the goals of any MDM project but it’s not always possible to make the database cleanup a prerequisite for including and application in the MDM hub. If it is impractical to clean up the application database before integrating it into the MDM hub, the repository can be designed with one row for each mapping from the MDM entity to an application record. For example, if Ford has 20 records in the CRM database the MDM hub would have 20 rows mapping the Ford MDM identity to each of the different CRM customer numbers. This styles makes for much more complex queries and also raises issues like how to deal with 10 different addresses for the same customer but it might be a necessary step in the evolution of your MDM solution. Knowing that there are 20 CRM r3ecords for Ford is a necessary first step in consolidating them into a single record.
NOTE: On the off chance that Joe Celko is reading this, yes I do know that a row in a relational table isn’t a record. When I refer to record I mean the information that an application maintains about a particular business entity. For example, a customer record may consist of one or more rows in the CustomerMaster, Address, Contact, and CreditHistory tables. I am using entity to define an enterprise wide view of a particular master data item that the enterprise uses. Examples of entities are customer, product, employee, and vendor.
The hybrid model as the name implies includes features of both the Repository and Registry models. It recognizes that in most cases it is not practical (at least in the short term) to modify all applications to use a single version of the master data and also that making every MDM hub query a distributed query is very complex and probably won’t provide acceptable performance. The hybrid model leaves the master data records in the application databases and maintains keys in the MDM hub as the registry model does but it also replicates the most important attributes for each master entity in the MDM hub so that a significant number of MDM queries can be satisfied directly from the hub database and only queries that reference less common attributes have to reference the application database. While this at first seems like the hybrid model has the advantages of both of the other models, it is important to note that it has issues that neither of the other models has. Only the hybrid model includes replicated data (other than keys) so only the hybrid model must deal with update conflicts and replication latency issues.
The following is a brief discussion on some of the architectural issues that must be considered in the design on an MDM hub database:
Data Model – In all three models the design process must include a common data model for the hub database. In the Repository model, the MDM data model becomes the hub database data model. The model includes mapping from the application data models to the MDM data model but these mappings are only used to create the hub database and define the application changes required to modify the application to use the hub database as the source of their master data. The other two hub models also require an MDM data model and mappings from the current applications but they are used differently. In the registry model, the data model is used to define queries and the mapping is used to do the necessary transformations to map the application data to the MDM data model in each query. In the hybrid model, the common attributes are replicated to the hub database and the non-common attributes are transformed as part of queries so both kinds of mapping are used. Almost by definition, there will be alternate mappings for some attributes and rules must be defined for which mapping to use. For example, a customer address is generally defined in several databases so rules must be defined to control which address to use first and which alternate to use if the preferred address isn’t there. These business rules can get to be pretty complex if many databases are integrated in the MDM hub so I will talk about business rules in a future posting. The data models and business rules are documented in the MDM meta-data and should be used as required to implement data-driven processing for populating. Maintaining and querying the MDM hub data.
MDM Hub Model – we have covered the three hub database models so let’s discuss how to decide which model to use. The repository model is the most attractive because it provides a real source of master data that is always current and consistent. The other choices involve data replication so there is usually some latency between data updates and hub update. Master data is generally fairly static so a little latency isn’t necessarily a bad thing. The non-repository approaches also maintain multiple copies of some data so consistency – keeping the copies the same - is an issue that these approaches must deal with. The down side of the repository model is that it can be extremely expensive and take a long time because it requires changes to the applications that maintain and consume the master data. The repository model makes sense if the number of applications involved in the MDM project is limited; you have enough control over the applications to make the required modifications; and the availability of authoritative and consistent master data provides enough business value to justify the time and cost required to build a repository model MDM hub.A registry model MDM hub is appropriate when only a limited number of non-performance-critical queries involve access to a significant number of the application databases integrated with the MDM hub. Registry model hubs are cheaper and quicker to implement and they can be implemented one data source at a time so they are good for incremental implementation and early return on investment. Registries are not good when queries routinely return attributes from many application databases or when there’s enough duplication of data so that determining which of several alternate sources of an attribute to return is a complex decision. In these cases, the pre-integrated and cleansed data provided by a hybrid model MDM hub provide a more efficient and consistent source of master data. It’s important to note that the hybrid model is not a single model but rather a whole continuum of options that start at the registry model and continue through to the repository model. For this reason, you may decide to start with a solution close to the registry model and then gradually expand the number of attributes integrated into the MDM hub gradually until you have an MDM repository implemented. Because MDM projects can be very large in a large enterprise with many applications, it’s good to have a strategy that allows you to implement incrementally both by gradually increasing the number of attributes stored in the hub and by adding applications to the hub. This allows you to show an early return on investment from the MDM project with a clear path to a long-term enterprise-wide solution.
This post covered the basics of how an MDM hub stores master data. In future posts I plan to cover:
· Versioning – data governance and regulatory performance are much easier with a complete version history of all changes to the master data.
· Hierarchy management – one of the more critical aspects of MDM is the creation, storage and management of hierarchical relationships among the master data records. For example, customers related to sales territories and sales territories related to financial reporting entities. Hierarchies often change more often than data entitles and may span entities managed by different applications so the ability to define and manage hierarchies independently is a useful part of an MDM system.
· Data loading, transformation and cleansing – this is pretty much the normal ETL issue and a lot of tools are available for use. Data from the application databases must be cleansed, transformed to the MDM data model and loaded into the hub database.
· Rules engines – adding records to the hub, retrieving consistent data, and maintaining data cleanliness all involve a series of decisions that are often best implemented with a rules engine.
· Synchronization – in the registry and hybrid models data additions and updates must be replicated to the MDM hub and changes to the MDM hub must be accurately and efficient replicated out to other applications that have copies of the updated item. Data models and formats may be different so simple replication often isn’t adequate for the task.
· Stewardship and governance – while business rules and automated systems can enforce the basic data requirements, ultimately the people who know the data best must resolve issues that can’t be resolved by the system.
· Meta-data – the data model of the master data and how it maps to the source system must be available both for automated processing and data governance.