Master Data is not Metadata


I regularly get emails from people asking about the Microsoft Metadata story.  While I assume that’s mainly confusion over the MDM acronym which could reasonably be interpreted as Metadata Management as well as Master Data Management, there’s also enough overlap between Master Data and Metadata to lead to confusion.  It turns out I’m uniquely qualified to comment on Master Data vs. Metadata. I spent a few years in the early 90’s building one of the early metadata repositories.  I worked with four other repositories after that and now I’m working on a Master Data product.


Metadata as I’m sure you’re aware is data about data.  It describes data but it isn’t generally considered business data.  For example, customer metadata would describe the attributes of a customer entity, the datatype and size of the attributes, which programs produce the data, which programs consume the data, what business rules are enforced on the attributes, etc.  In a BI environment, derivation, transformations, source system, and last load time are also important.  The key thing to understand is that you can know the complete metadata for customers without knowing who a single customer is.  Master Data, on the other hand, is the real business data.  Customer master data is an authoritative list of customers.


While Master Data and Metadata are two distinct things, managing master data generally requires working with metadata.  The Microsoft Master Data hub is metadata driven.  The data model used to store the Master Data instances is defined with metadata stored in the hub.  Data stewardship and data governance depend on metadata to understand where the data comes from, what each attribute means, what transformations are done when the data is loaded, what business rules are satisfied or violated, and who modified the data.  The Microsoft MDM hub stores most of this metadata and the types of metadata stored will be expanded before we release the product.  The metadata required to manage master data is probably one source of the confusion between the two types of data.  While the MDM hub stores significant amounts of metadata, all the data is related directly to the master data so you can’t really describe a master data hub as a general purpose metadata repository.


Now that we’ve talked about the difference between master data and metadata, we’ll dig a little deeper into metadata because I think there are some parallels with Master Data Management that are useful to understand.  Metadata is usually managed in a repository.  The original metadata repositories modeled the metadata as entities, attributes, and relationships.  Some of the more recent repositories use objects and properties but the models are pretty similar.  Metadata repositories evolved from data dictionaries that were used to manage the schema for databases.  A repository can model the whole IT environment to provide a unified picture that goes well beyond database schema.  An accurate enterprise model can provide valuable insight into impact analysis and drive data analysis and data integration projects.  With the current compliance and auditing environment that enterprises operated in, tracking where data is produced and what processing and transformations are done to it is almost as important as the data itself.  A metadata repository populated with accurate, current metadata is a great resource for compliance, auditing and integration projects. 


The problem with metadata repositories is that they require constant maintenance to ensure the metadata is current and correctly represents the data it describes.  In many cases metadata repositories were laboriously populated with data descriptions and documentation but without adequate tools and procedures to keep the metadata current, the metadata gradually became inaccurate and the users stopped using it because they couldn’t rely on it.  The basic issue is that there really aren’t good tools to capture all the metadata that a typical enterprise needs to track.  There are many different sources – database schema, ETL logs, source code, copy books, design documents, policies, documentation, etc.  Some of this data can be extracted using automated tools and some of it must be entered by people who understand the systems involved.  The resulting system can be very complex and require a fair amount of manual data entry.  The more effort required to keep the metadata current, the less likely it is to stay current.  If this sounds like the same kind of issues that a master data management system runs into, you’re right.  A master data hub that isn’t surrounded by the tools, processes and policies required to maintain the quality of the data will gradually lose value and become unusable.


I may be stretching the limits of causality but I believe that some of the problems that master data management is trying to solve were actually caused by flawed metadata solutions.  About twenty years ago the state of the art in metadata management was the data dictionary that was carefully maintained by a data administration organization.  The DA organization maintained control over data quality and consistency by requiring all changes to the database schema to be fully documented and approved before they were implemented.  While this ensured high-quality database schema and accurate metadata, the heavyweight process was very frustrating to developers trying to keep up with their user’s demands.  Resourceful developers solved this problem by installing a departmental server with SQL Server or some other easy-to-use database, copying the data they needed out of the corporate system and making the schema changes they needed without DA approval.  In a few years there were many variations on the corporate schema with data that was not necessarily synchronized with the corporate databases.  Eventually these quick-hit applications became the core enterprise systems replacing the corporate databases and in many cases the DA organization with loosely coupled data chaos.  Several years later, MDM was invented to get a handle on the many disparate data sources.