MDM Master Data Management Hub Architecture – Maintenance and operation
If you have been following this series, you have learned how to design an MDM hub database, how to load it with master data and how to keep the data synchronized with your operational systems. In this post I’ll discuss the tools and processes you will need to design to keep it running smoothly.
Metadata as I sure you have heard often is data about data. In MDM systems – as in any data integration system – metadata is critical to success. As a minimum for every column of every table in the MDM hub there must be accurate data about where the value came from, what transformations were performed to get it into the hub data model format, what business rules were applied to it, and which applications receive updates to it. As with the master data itself, metadata is only useful if it is current and accurate. For compliance reasons you generally have to be able to prove that the metadata accurately describes the master data. The easiest way to furnish this proof is to show that the processes that handle the master data are derived directly from the metadata or vice-versa. For example, the mappings and transformations done be the ETL tools might be driven directly from the metadata or the ETL tools might populate the metadata repository whenever a transformation pipeline is created. Business rules should be taken directly from the metadata whenever possible. This is especially important in MDM because the business rules often determine which of several alternate values is used to populate a data element.
Reading business rules and transformations directly from the metadata repository whenever they are used can be a performance issue so either the information is cached or the actual code is generated from the metadata. The MDM metadata may also include descriptions of processing done outside the MDM hub. For example, the transformations required to map the hub data model back to the source systems that subscribe to it will probably execute on the source system but must be defined in MDM metadata.
Since I have used the term several times I should probably talk about metadata repositories. As the name implies, a metadata repository stores and manages metadata. Most enterprises that have a Data Administration group already have a metadata repository loaded up with information about all the databases and systems in the enterprise. This can be an invaluable resource for understanding the data models of the master data source applications and which applications maintain which data elements in the database. A careful analysis of this metadata will go a long way toward determining what the MDM hub data model should look like. If you have a repository then the MDM metadata should be stored there also and the relationship facilities of the repository should be used to record the mapping between each of the source data stores and the MDM hub data model. This information is required both to implement the loading and synchronization processes and to document the derivation of each data element of the master data.
If you don’t have a repository and you don’t have the time or budget to buy and populate one for the MDM project, you can consider storing the metadata in normal database tables. A useful starting point for what your metadata should look like are the standard Information Schema description provided in many database systems. For example, look at the INFORMATION_SCHEMA views in the SQL Server database. You can tie the source data definitions to the hub definitions with link tables as I described here. These links can be tied to the transformations and business rules used to map the source to the hub schema. The following diagram is a much simplified version of what the data model for a metadata repository would look like:
The mappings tables are link tables that tie the source data definitions to the hub data definitions. I didn’t include any attributes for the entities because I wanted to keep things simple. Most of the modern ETL and Business rules tools store their configuration and logic as XML files so many of the attributes will be XML columns. Building a metadata repository can be a daunting task so start simple and add features as you need them. In a previous life I spent five or six years building and using metadata repositories so I’ll probably do a separate article on repositories after this current series is complete.
Stewardship and Governance
I hesitate to talk about stewardship and governance together because they are different things that often get confused with each other but there’s a close relationship between them so I’ll try to explain the differences.
In my simplified view, data governance is the process of defining the rules that data has to follow and data stewardship makes sure that the data follows those rules. You might compare this to the legislature that makes the laws and the police who enforce them but don’t carry the analogy too far. Because poor quality master data can impact many core applications of a business, governance and stewardship are important functions in an MDM process.
The increasingly strict regulatory compliance environment of the post-Enron world has made data governance a necessity for many companies. It’s not enough to have the numbers; you have to be able to show that the numbers are based on accurate and verifiable data. This means both a governance function to demonstrate that the right controls are in place and a stewardship function to ensure that the controls are enforced.
Governance is a huge topic so I’ll just give a few examples of things governance controls. The governance rules should include who can read, create, update, and delete data, what validity checks are required for data, which application is the preferred source for data items, how long data is retained, what privacy policies are enforced, how confidential data is protected, and what disaster recovery provisions are required – to name a few. The data governance function should include leaders of both the IT and business groups of a company. This used to be a hard thing to do but with CEO’s going to jail and data on stolen laptops making front-page news, management is paying a lot more attention.
The role of a data steward is to ensure that the master data is clean, consistent and accurate. In some of the earlier posts of this series I already talked about cases where data quality couldn’t be determined with automated rules and workflows so manual intervention was required. The people doing this manual intervention are the data stewards. There have been long-running debates over whether a data steward should come from the business or IT side of the house. I personally think the right person to be a steward for a particular collection of master data is the person who understands the data the best. In many cases, only someone who understands the business can make the tough calls about which value is correct for a particular data element but in most companies, getting a business expert to do what she sees as IT’s job is difficult at best.
The technical aspects of data stewardship involve a set of tools that help a steward find, analyze and fix data quality issues. These tools are generally integrated into a “stewardship console” that incorporates the data profiling, data analysis, and data modification tools into a single UI. If your data stewards are business people, the stewardship console should be simple and highly automated. In organizations with complex governance rules and approval processes, workflow can be a useful part of the stewardship console. Basically, master data updates that can’t be approved automatically are placed on a queue for the appropriate steward to resolve. An automation supported human workflow can handle the routing and approval processes for these changes.
Data profiling tools can scan data for violations of business rules, missing values, incorrect values, duplicate records, and other data quality issues. Profiling the data in your source systems is a good place to start an MDM project so you can find out how much trouble you are in. Profiling can help you choose the authoritative source for data and design the ETL logic required to clean up and load the data into the MDM hub. Profiling should also be done periodically after the MDM system is in place to find data quality issues that the MDM system isn’t fixing.
Once you have a clean, accurate source for master data, you will need to be able to export it to other systems that need it. For example you may need to periodically export your product master data to be used in a data pool or a marketing campaign. This isn’t rocket science and most databases include tools for exporting data in a given format or XML schema so I just mention it here because your project should include a plan for doing exports.
This includes reports of the master data itself – customer lists, product information, org charts. etc. – but reports on the health of the MDM hub itself. Things like the number of rules violations detected, the number of manual interventions required, and the average latency of master data updates will help the IT organization discover issues early enough to prevent major problems. A solid reporting system that produces “canned” reports and allows user to design their own reports is an important part of the MDM hub.
Workflow and Business rules
I think I have already covered all the places I can think of where workflow automation and a good business rules engine are required. The point I want to make here is that a good rules engine is critical to success in an MDM hub. In many cases the rules are established by relatively unsophisticated data stewards to a simple wizard or UI for developing rules may be required. The rules will often involve retrieving and manipulating database data so a rules engine that has good database integration would also be useful.
Both automated workflow and people workflow are required so a workflow engine that handles both would be useful. In a Microsoft environment the BizTalk workflow and Business Rules Engine (BRE) are probably the closest fit. You might also consider WF and SharePoint workflow but their rules support is not as complete as BRE.
An MDM project will probably need a data modeling tools for recording the data models for the source applications and the MDM hub. If you have a repository, the modeling tool should integrate with the repository. Other than that, most quality modeling tools will work – the requirements are pretty minimal. ETL tools for loading the hub data, data quality tools, and application integration tools are required for loading and synchronization. If your hub uses a registry or hybrid model, a distributed query tool may be needed for queries against the master entities when some parts of the data are stored in the source systems. A tool for defining and maintaining hierarchies is required for hierarchy management. I have already talked about the need for a simple, user-friendly stewardship console. Be sure it has a good way to browse versions and hierarchies in addition to the entity management capabilities.
I hope this series of four posts helped you understand some of the architectural issues around master data management. I didn’t come close to answering the questions but I firmly believe that architecture is less about having all the answer and more about know what questions to ask. While most of the examples of specific products I used with Microsoft, I tried to make this as vendor-neutral as possible. I plan to do an article on MDM on the Microsoft platform later but that will take some research.
If you learned nothing else from this series I hope you understand that MDM is both a pretty straightforward application of technologies that have been around a while and a fairly massive undertaking in most organizations. If your organization already has a data administration function or at least some reasonably competent data modelers and data architects, you probably have the skills you need to succeed. Most large organizations have massive amounts of master data so getting it all under control will take a while. Start with something that has limited scope but also something that will have a lot of value to the organization. Early success is important not only to get management to continue the project but for the project team to gain the satisfaction and confidence that deploying something that has significant impact brings. Once you have one part of the problem solved, learn from the experience and repeat the process as many times as necessary to complete the MDM process.
As I said at the beginning of this series, this material is part of my preparation for an MDM presentation next month so I would REALLY appreciate any feedback you have on it. If there are thing missing or wrong or unclear, I would appreciate knowing about them. If you don’t want to post a comment, just send me an email. Thanks for taking the time to read this.