Welcome to MSDN Blogs Sign in | Join | Help

MDM - Scaling Out Matching

 

It’s been a long time since my last post but I’ve been pretty busy figuring out how IT works.  As you might expect, one of the hardest things has been learning all the new acronyms.  Not only that but some of the acronyms I thought I knew mean different things in IT than they did in the product groups.  For example, I thought OBA meant “Office Business Application” and found out it means “On Boarding Application” only after several weeks of confusion.  My current challenge is to figure out how to scale MDM matching out to meet our future requirements -  matching hundreds of millions of incoming records a week against a billion or more customer records in our MDM database. So far we haven’t found an engine that can do this with a single instance so we are trying to figure out the best way to scale matching out over a large number of matching servers.

 

There are basically two types of matching engines that I’m aware of:

 

1)    Database engines that build an index in the same database as the MDM data.  This has the obvious advantage that as long as the index is built in the same transaction as the MDM data is inserted or updated, the index is always completely up to date so you don’t have to worry about two incoming records for the same customer not matching because the first record isn’t indexed when the second record hits the matching engine.  This engine also uses the database to do the searching for matches so it scales as well as the database scales.  The down side of this type of matching engine is that scaling the database can be pretty expensive which limits the total scalability of a database solution.  We currently are using SQL replication to scale out by using separate copies of the database for searching and other matching applications that don’t update the database.  This not only speeds up these read-only matching applications but reduces the load on the MDM hub itself because it only does merging of new and modified records. 

We’re currently doing some prototyping to see if this type of engine will scale to matching hundreds of millions of incoming records against a billion or more master records.  My current thinking is that we will have to partition the data in order to scale to the number of master records we require.  The obvious issue is what attributes to partition the data on that will produce an adequate number of partitions and allow us to reliably determine which partition the record is in.  I will talk more about partitioning later because I think partitioning is the key to making scaleout work.

2)    Memory based engines that build an index either completely or partially in memory.  These engines use the database as a source of data for the index but the actual matching is done completely outside the database.  This makes scaling out the matching pretty straightforward because we can make multiple copies of the index without replicating the database itself.  The hardest problem with this engine is keeping the indexes up to date.  The choices for updating the index are either rebuild the index periodically or update the index as each record is added.  The rebuild option means that the index is pretty much always somewhat out of date.  We can alleviate this problem somewhat by resolving any duplicates in input batches before we input the batch but this doesn’t help with real-time inserts into the master data.  One option would be to use both update techniques – rebuild the index before a big batch run and dynamically maintain the index for real-time input.

Because this type of matching engine gets much of its performance from the index being in memory, it’s pretty much a sure bet that we will have to come up with a good way to partition the data so the indexes don’t get too large to fit in memory.  In addition, I anticipate the volume will be large enough that we will have to replicate the indexes to handle the load.  This means that we will try a combination of partitioning and replication.

 

Partitioning

 

If we have to partition the indexes to get the scalability we need, the partitioning scheme will be key to performance and flexibility.  If possible, it would be great to partition the indexes on an attribute that we always know for every incoming record.  That way, we can know which index to use for each incoming record.  An idea attribute would be the language of the incoming record.  One of the things most matching engine need to know is the language they are matching against because the way that the text is divided into words, what an address looks like, what synonyms and nicknames are, etc. varies from language to language.  Some matching engines require that the indexes are built on a single language so partitioning by language may be required anyway.  The issue with this is that there’s not a reliable way to determine what language a name and address is in.  There are algorithms for guessing the language but they’re not always right and even when they guess correctly, the information might be misleading.  For example, is Toyota English or Japanese?  So language is a functional partitioning attribute but not necessarily a useful one.

 

Country might also be a good attribute for partitioning because it’s often present in the address and if it isn’t, address standardization will generally figure out the right country.  The disadvantage of country is that while it might be OK for individuals, it won’t work well for multi-national companies that exist in many countries or individuals who move to a different country.

 

One way to create partitioned indexes or databases without worrying about missing a match because we’re looking in the wrong partition is to partition the index on something arbitrary like a hash of the primary key and then do the matching by matching on all partitions in parallel and combining the results to come up with the best matches in all the partitions.  This obviously uses a lot more resources than doing the match in a single partition but it eliminates the problem of missed matches and the total time to find the match is the roughly the same as matching in a single partition.  This type of partitioning probably makes sense for the memory-based matching system where adding new copies of the index is relatively cheap but it’s probably too expensive both from actual cost and added replication overhead to work in a database matching system.  The parallel matching partition scheme also makes a lot of sense if the underlying hub database is partitioned.  Microsoft IT uses partitioned databases for many very large customer data systems so we’re looking into the possibility of partitioning our MDM data.  If we do partition the MDM hub, then partitioning the matching indexes using the same partition scheme makes a lot of sense.

 

Matching Algorithms

 

Matching Algorithms are generally orthogonal to matching scaleout but the type of algorithm our matching system uses might influence our choice of partitioning scheme.  In my greatly simplified view of matching there are two kinds of algorithms – rules based algorithms that use a set of standard and user defined rules to determine matches and mathematical algorithms that use a variety of mathematical formulae to give a similarity score to potential matches.  In my opinion, a perfectly tuned rules-based system will give more match accuracy than a mathematical match engine but I’ve never really run into a perfectly tuned rules-based match engine.  Rules change over time and are different for different locales and different languages so maintain matching rules can be a hugely complex and expensive task.  Since rules generally must be tuned differently for different languages and different countries, partitioning by language or country may be more compelling if you use a rules-based matching engine.

 

Conclusions

 

This is the place where I recommend which scaleout method you should use.  Unfortunately, at this point I really don’t know which one we will use.  Scaling out the database engine would probably be easiest for us because that’s the kind of engine we’re using now but we don’t have enough experience with very large data volumes to know whether our current system will scale to the required volumes.  The memory-based indexes theoretically look like they will scale very well but we still have to solve the update problem before we know if they really scale well in the face of a massive number of updates to the underlying data.  We plan to do more prototyping over the next few months so I’ll let you know what we find.  If you have experience or advice, I would like to hear from you.

Changes

 

This isn’t one of my more typical technical posts so if you’re looking for profound technical insights you can skip the rest of this.  This summer I went through a career change that will probably impact what I blog about so I thought I should share it with my readers.  As many of you know, I was one of the original members of the Microsoft MDM product team.  Early this summer I was (in the words of my favorite euphemism) made redundant on the MDM product team.  After a number of panicky weeks I found a new job as an architect on Microsoft IT’s internal MDM project.  This means I went from being an MDM vendor to an MDM practitioner.

 

I’m really excited about having the chance to work on what is becoming one of the biggest MDM systems I’ve ever heard of.  We’re in the process of moving Billions of records into our MDM hub which will grow to terabytes of data.  This means I’m going from writing about how to build a partitioned, scaleout solution in SQL Server to implementing one.  It also means I’m working on interesting problems like how do you scale matching algorithms to huge data volumes while maintaining near real-time response time and what kind of algorithm do you use to achieve high match quality with Asian languages and address formats.  I also hope to find out how Service Broker can be used to provide an asynchronous, scalable execution environment for MDM services.

 

What this all means to my blog is that I won’t be writing much about Microsoft’s MDM products anymore but I hope I’ll have some interesting things to say about implementing a real-time operational MDM hub at huge scale.  It also means I probably won’t be doing as many presentations at conferences.  I’ll leave it to you to decide whether that’s a good thing or a bad thing.  Bottom line, I’m embarking on a new adventure so I wanted to share it with you.

New Master Data Management White Paper Series

 

My friend Tyler Graham is writing a series of white papers on the practical aspects of implementing an MDM system.  The first three are available here: http://technet.microsoft.com/en-us/library/cc505992(TechNet.10).aspx  Tyler came to Microsoft as part of the Stratature acquisition so he has a lot of experience in implementing MDM systems at Stratature customers.  That means these papers are full of practical advice instead of the high-level theory often found in MDM articles.  Thanks for doing this Tyler and we’re looking forward to the rest of the series.

Choosing MDM Hub styles

 

A couple weeks ago, someone asked me how to choose which MDM hub style would work best for an application.  I thought I had covered this in one of my white papers but I couldn’t find a good reference to give him so I thought I would write up something here.  To review what I’ve cover elsewhere, there are basic three types of MDM hubs:

 

·         Registry – the hub doesn’t contain the actual master data.  It contains links to where the master data exists in the source systems.  In most cases, the link takes the form of the primary key and system name of the source system.

·         Repository - the master data is actually moved from the source systems to the MDM hub and the source systems are rewritten to get their master data from the MDM hub instead of from their local database.  Mapping to the source systems isn’t required because the master data isn’t stored in the source system.  This style is often called Transactional.

·         Hybrid - as the name implies, hybrid is a combination of the other two styles.  The hub contains references to the master data entities in the source systems but also contains the shared portion of the master data.  This means it can supply links to source records when required and also serve as the master data source for new applications.

 

So which style should you use? 

 

Repository

 

The repository style seems like the best option.  There are no synchronization or latency issues with updates getting propagated to multiple copies of the master data.  There are no update conflicts caused by updates to more than one copy of the master data.  In general, a single copy of the master data is significantly easier to manage and will generally be of a higher quality than multiple copies with all the potential synchronization and mapping issues.  On the other hand, if we look at what is required to get a repository style hub up and running, you may see why this style isn’t very common:

 

1.    Decide on a common data model for all applications – this will be a difficult task both politically and technically. 

2.    Transform and load all the current databases into the hub, removing duplicates in the process.

3.    Change all your applications to use the new master data tables and database.  This can be a huge effort.  If your current applications use a variety of databases you will need to deal with multi-database distributed transactions.  If you use purchased applications, you may not have the source to change the application to use the new data source and even if you do, you are likely to run into support issues.

4.    Figure out how to handle history – you are changing your databases to use a new key for all you master data so you have to deal with many years of history that was created using different keys for the master data.  In many cases you will need to create the same kind of key mapping that the other two MDM styles require to be able to access history records.

 

In many cases, this process is too difficult or too expensive to provide a significant return on investment and even if it is justified, it can take many years to make the transition so the Repository style of MDM hub may not be suitable for many projects.

 

Registry

 

The Registry style hub is attractive because it’s generally fairly quick to implement and avoids some of the political issues around a common data model.  Because only pointers to records are stored, there is no need to agree on a common data model.  There is also less need for a data quality program because the data is left in the source systems.  To be clear, it’s probably not possible to create a pure registry style MDM hub.  One of the main things this hub is used for is mapping duplicate records in the source systems to a single record in the hub.  In order to do this, each record must be matched on a set of attributes to determine if it is a duplicate of a record already in the hub.  For example, customers would probably be matched on name and address and products might be matched on descriptions and dimensions.  If you want to avoid searching every database in every source system when a new record is added to the hub, you will need to keep the matching attributes for each master record in the hub so you can tell whether in incoming record is a duplicate of one of the hub records.  This matching won’t work reliably unless the attributes stored in the hub are accurate and high quality so you will probably have to do a significant amount of data quality work to ensure the address is right and in a common format and maybe even enriching the attributes with data from an external source like Dunn and Bradstreet.  Once you have all this established, you’re a significant way down the road toward creating an hybrid hub so you can consider a registry hub to be a hybrid hub that’s not done yet.

 

The biggest disadvantage of the registry style MDM hub is that while it helps you find all the duplicate and inconsistent copies of your master data, it doesn’t give you much help in cleaning them up.  If Roger Wolter has 3 records in the ERP database, 6 records in the CRM database and 2 records in the customer support database, and among the 11 copies there are 4 phone numbers, a registry hub will tell you where all the records are but won’t help you get them to agree on a phone number.

 

Hybrid

 

The Hybrid style of MDM hub has some of the attributes of both the Registry and Repository styles.  Like the Registry style, the Hybrid style maintains links to the copies of a master data record in the source system so you won’t have to replace the master data access parts of all your applications.  Like the Repository style, the Hybrid style maintains the shared part of the master data in the MDM hub so that you can improve its quality and enrich its content in a single place.  Thus the advantage of the Hybrid approach is that it provides a single, authoritative source for shared master data without the necessity of changing all your applications to use it.

 

The most significant disadvantage of the Hybrid style is that keeping the MDM hub copy of the data synchronized with all the source systems can be a complex process.  If you allow all the source systems to change master data, you will have a continuous data integration problem caused by incompatible changes coming from different systems.  You can reduce this problem by requiring changes to the master data to be made only to the copy in the MDM hub but this may be difficult to implement and enforce.  Also, keep in mind that MDM synchronization is more complex than data replication because the data may have to be transformed both when it is loading into the hub and when it is sent from the MDM hub back to the source systems because the data models of the source systems may all be different.

 

Conclusion

 

So what’s the best choice for you?  As in everything – it depends.  Moving from Registry to Hybrid to Repository style increases cost and complexity but also increases usefulness and data quality so you have to pick the solution that provides the data quality you need in a timeframe and a budget that you can afford.  My recommendation is usually the Hybrid approach.  The Registry approach is relatively simple and quick to implement but few users will be satisfied with the data quality it provides over the long run.  The Repository style is generally too hard to do and too expensive for most companies even though it provides the best data quality.  Hybrid implementations can evolve over time.  You might start with a minimum number of attributes for each entity stored in the MDM hub so it is pretty close to being a Registry Style hub and then over time, as your needs change and your MDM data management and stewardship capabilities improve you gradually add attributes until the MDM hub is a complete source of master data.  At this point, new applications can start using the MDM hub directly for their master data so the hub evolves gradually toward the Repository style.  While not too many people will be able to move completely to the Repository style, eventually it may become the predominant approach for applications as the old apps are replaced.

One of my pet peeves is that after we ship something we normally have a Post Mortem meeting to discuss what we should learn from the experience.  I'm not against the meeting.  I think they're great and we learn a lot.  Sometimes we even have pizza!  What bugs me is the name Post Mortem.  This suggests something just died and we're getting together to figure out why.  Come on!  We just shipped a great product that we spent years of our life developing.  Nothing died - something was born!  After our next release, I'm scheduling a Postpartum review!

Master Data is not Metadata

 

I regularly get emails from people asking about the Microsoft Metadata story.  While I assume that’s mainly confusion over the MDM acronym which could reasonably be interpreted as Metadata Management as well as Master Data Management, there’s also enough overlap between Master Data and Metadata to lead to confusion.  It turns out I’m uniquely qualified to comment on Master Data vs. Metadata. I spent a few years in the early 90’s building one of the early metadata repositories.  I worked with four other repositories after that and now I’m working on a Master Data product.

 

Metadata as I’m sure you’re aware is data about data.  It describes data but it isn’t generally considered business data.  For example, customer metadata would describe the attributes of a customer entity, the datatype and size of the attributes, which programs produce the data, which programs consume the data, what business rules are enforced on the attributes, etc.  In a BI environment, derivation, transformations, source system, and last load time are also important.  The key thing to understand is that you can know the complete metadata for customers without knowing who a single customer is.  Master Data, on the other hand, is the real business data.  Customer master data is an authoritative list of customers.

 

While Master Data and Metadata are two distinct things, managing master data generally requires working with metadata.  The Microsoft Master Data hub is metadata driven.  The data model used to store the Master Data instances is defined with metadata stored in the hub.  Data stewardship and data governance depend on metadata to understand where the data comes from, what each attribute means, what transformations are done when the data is loaded, what business rules are satisfied or violated, and who modified the data.  The Microsoft MDM hub stores most of this metadata and the types of metadata stored will be expanded before we release the product.  The metadata required to manage master data is probably one source of the confusion between the two types of data.  While the MDM hub stores significant amounts of metadata, all the data is related directly to the master data so you can’t really describe a master data hub as a general purpose metadata repository.

 

Now that we’ve talked about the difference between master data and metadata, we’ll dig a little deeper into metadata because I think there are some parallels with Master Data Management that are useful to understand.  Metadata is usually managed in a repository.  The original metadata repositories modeled the metadata as entities, attributes, and relationships.  Some of the more recent repositories use objects and properties but the models are pretty similar.  Metadata repositories evolved from data dictionaries that were used to manage the schema for databases.  A repository can model the whole IT environment to provide a unified picture that goes well beyond database schema.  An accurate enterprise model can provide valuable insight into impact analysis and drive data analysis and data integration projects.  With the current compliance and auditing environment that enterprises operated in, tracking where data is produced and what processing and transformations are done to it is almost as important as the data itself.  A metadata repository populated with accurate, current metadata is a great resource for compliance, auditing and integration projects. 

 

The problem with metadata repositories is that they require constant maintenance to ensure the metadata is current and correctly represents the data it describes.  In many cases metadata repositories were laboriously populated with data descriptions and documentation but without adequate tools and procedures to keep the metadata current, the metadata gradually became inaccurate and the users stopped using it because they couldn’t rely on it.  The basic issue is that there really aren’t good tools to capture all the metadata that a typical enterprise needs to track.  There are many different sources – database schema, ETL logs, source code, copy books, design documents, policies, documentation, etc.  Some of this data can be extracted using automated tools and some of it must be entered by people who understand the systems involved.  The resulting system can be very complex and require a fair amount of manual data entry.  The more effort required to keep the metadata current, the less likely it is to stay current.  If this sounds like the same kind of issues that a master data management system runs into, you’re right.  A master data hub that isn’t surrounded by the tools, processes and policies required to maintain the quality of the data will gradually lose value and become unusable.

 

I may be stretching the limits of causality but I believe that some of the problems that master data management is trying to solve were actually caused by flawed metadata solutions.  About twenty years ago the state of the art in metadata management was the data dictionary that was carefully maintained by a data administration organization.  The DA organization maintained control over data quality and consistency by requiring all changes to the database schema to be fully documented and approved before they were implemented.  While this ensured high-quality database schema and accurate metadata, the heavyweight process was very frustrating to developers trying to keep up with their user’s demands.  Resourceful developers solved this problem by installing a departmental server with SQL Server or some other easy-to-use database, copying the data they needed out of the corporate system and making the schema changes they needed without DA approval.  In a few years there were many variations on the corporate schema with data that was not necessarily synchronized with the corporate databases.  Eventually these quick-hit applications became the core enterprise systems replacing the corporate databases and in many cases the DA organization with loosely coupled data chaos.  Several years later, MDM was invented to get a handle on the many disparate data sources.

Master Data Management Philosophy

 

Last week I did a presentation that included a slide on my philosophy for MDM so I decided to expand that slide into a blog post.  While this is my philosophy, I think it comes pretty close to the way the rest of the MDM product team looks at MDM.  As always, I welcome any comments and feedback.

 

      Multi-domain hub – while there are definite advantages to specialized MDM applications that handle data quality, match-merge, and standardization for a particular type of data, once you have cleaned up your incoming data the processes to maintain the data are common across all domains.  There are definite advantages to a single point of management and single set of tools and processes for managing master data.  Some vendors approach cross domain master data management by implementing relationships that span domains stored in different repositories but this often means that different data must be managed in different ways by data stewards. I think there’s a real advantage to maintaining all master data in the same hub because the same processes and techniques can be applied to all types of master data.  This means a hub with enough flexibility in the data model to allow any master data domain to be modeled and managed.

      Open interfaces – there are so many kinds of master data that no single vendor can provide a toolset that spans all domains.  That’s why it’s important for an MDM hub to have open interfaces for domain specific tools to plug into.  If your hub vendor provides the data management and data stewardship facilities you’re looking for but doesn’t data import and data quality tools that specialize in the kind of data you want to store, you can find best-of breed tools (or write your own) that interface to the MDM hub through the open interfaces provided.  In these days of SOA, web services interfaces are probably the most useful.

      Incremental implementation - while a single source for all your organization’s master data is the goal of Master Data Management, not many organizations have the resources and patience to consolidate all their data in a single project.  For most companies, starting with a single domain and a subset of data sources to learn how MDM works and demonstrate early success is a much better approach than a “big science” MDM project.

      Partner for domain specific solutions – as I said earlier when discussing open interfaces, it’s not realistic for a general-purpose MDM product to handle all the possible variations of Master Data domains.  For this reason, we plan to cultivate a rich partner ecosystem to provide the expertise in the specialized domains we don’t have the resources to develop ourselves.

      Use existing integration capabilities – before we started the MDM product team, I spent about a year researching how to build MDM systems with existing Microsoft technologies.  What I found was the Microsoft has a wealth of data integration technologies – SSIS, BizTalk, FRx, etc. so when we looked for an MDM product to buy we looked first for a product that excelled at managing data, data stewardship, business rules, workflow, etc. with the assumption that our current data integration capabilities would handle the rest.  By using data integration, transformation, standardization, orchestration and profiling capabilities that other teams continue to develop we can have world-class capabilities in this area while we concentrate our resources on core Master Data Management capabilities.

      Tight integration with Microsoft Products – when we started talking about MDM around Microsoft, we found that quite a few Microsoft products had requirements that an MDM product could meet.  Developing tight integrations with Microsoft products will not only be a great benefit to our customers but will give us a chance to “dogfood” the integration interfaces we plan to ship with the product.

      Hierarchy Management a critical capability – the structure of master data is often as important as the data itself.  While this is obvious if you are using MDM to manage your chart of accounts, organization structures and product hierarchies are also critically important data.  Stratature has some unique capabilities in hierarchy management that are proving very useful to MDM users.

      Data Stewardship is a key success factor – MDM systems make data quality and accuracy more important than ever because a mistake in master data can cause issues in all the systems that consume the data.  While automated match-merge, standardization and data quality tools are becoming more capable all the time, at some point real human beings who are passionate about the data are required to make decisions that tools can make and monitor the processes to ensure that business rules and data standards are being enforced correctly.  This is just one more example of the people aspects of MDM being as important as the technology.  While processes, policies, governance, and standards are the real success factors, a good MDM hub can provide tools and capabilities that make a data steward’s job easier.  Some of the more useful capabilities are business rule enforcement, workflow, versioning, searching, auditing, and eventing.  The combination of the Microsoft MDM system and SharePoint supplies all of these capabilities and more.
 

      Analytical and Operational MDM just two uses for the same data – I did a whole blog post on this a couple months ago so I won’t spend a lot of time on it here but I wanted to point out that all the data management, stewardship, and quality capabilities I have been talking about so far apply equally whether you are using you master data to build cubes in a warehouse or to provide clean data to your operational systems.  I don’t think an MDM system that doesn’t support both operational and analytical uses for master data is a good investment.  Many companies start by using their master data for analytical purposes because it is generally easier to implement and shows positive results faster but investing the time, money and effort to create a clean source of master data without eventually using it to improve your operational systems can be a significant missed opportunity.  Doing an analytical project to learn the technology and develop the policies and processes necessary to manage master data followed by another project to use the high quality master data obtained to improve operations is often a winning approach to MDM.

Check out the new Microsoft MDM web site:  http://www.microsoft.com/sharepoint/mdm/default.mspx 

Gartner MDM Conference

 

I just got back from the Gartner MDM Conference.   I learned a lot and had the chance to talk to a lot of people about MDM and what they are doing in their organizations.  Maybe it was because of the sample I happened to talk to but it seems like a lot of people are interested in MDM but not a lot of them have projects in place.  I assume that’s an indication of the state of the MDM industry – while some people have been doing MDM for several years, the mainstream is just now feeling the need to learn about MDM.

 

One thing I learned that explains some confused conversations I’ve had with several people is that Gartner is using some of the same terms I have been using to describe MDM but using them to mean very different things.  I don’t think I disagree with what they’re saying but I think some translation between their terms and the terms I have used might help.  The main point of confusion is in what Gartner calls the styles of MDM.  The way I describe the MDM Hub contents uses different terms but I think in general we’re talking about the same things:

 

Gartner’s terms

My terms

Registry Style – store references to master data.  Data continues to reside at source system.  Data quality controlled at source.  Bidirectional data flow (into the hub and from the hub out into the source systems).

Registry Style – pretty much the same concept and obviously the same term

Coexistence Style – Data as well as references to source systems stored in the master data repository.  Data quality controlled at both the source and the MDM hub.  Bidirectional data flow (into the hub and from the hub out into the source systems).

Hybrid Style – again, pretty much the same definition but with a different label.  We agree that there’s often a natural progression from the registry style to the hybrid or coexistence style.

Transaction style – Master Data resides solely at the MDM repository.  All application use MDM as their source of master data directly.

Repository Style – again, pretty much the same thing with a different name.  I think there some disagreement on how practical this style is.

Consolidation Style – MDM repository just a destination for master data.  One directional flow into the MDM repository – data never goes back into the source systems.  This style used only for analysis.

I’ve never really talked about this style of MDM repository.  Quite a few MDM project start out using MDM primarily for analytical data so I suppose this might make sense but I look at this as a minor variation of the hybrid style so I haven’t talked about it as a separate style.

 

As you can see, I’m pretty much totally in agreement with the Gartner style classification but I have been using different terms.  I’m not sure what to do about this.  If I change to use the Gartner terms, I may confuse people who have been reading my articles for a while.  On the other hand, I think it’s pretty obvious why some people have been in violent disagreement with my analytical MDM is really the same as transactional MDM stance.  I have been talking about two different uses of MDM data while using the Gartner meaning of the terms, they are two different kinds of MDM repository.  I think I am going to change my terms in this case to analytical and operational uses of master data because transactional has two different meanings.

 

Analytical and Transactional MDM

 

I was talking to someone about Analytical and Transactional MDM recently and we realized that while there are quite a few conceptual differences between the two, there’s a significant amount of overlap in the implementation details.  For my purposes, I’ll define Analytical MDM as the processes and tools to manage the dimensions in a data warehouse or OLAP cube and Transactional MDM as the processes and tools to manage the master data used in transactional systems.

 

With few exceptions, the data for the two styles of MDM looks the same.  Transactional MDM might have a few more attributes associated with a given entity because there are things the operational system cares about that that aren’t required for analysis.  An Analytical MDM hub will probably store more hierarchies than a transactional hub because there are generally hierarchies that are interesting in analysis and reporting that the operational system may not care about.  These differences aren’t incompatible and it probably makes sense to use the same MDM hub to store both analytical and transactional master data because it will be much easier to manage in one place than if you had separate hubs for the two uses of the same entities.  This seems like a good argument for looking for an MDM hub solution that isn’t limited to only a single style of master data.

 

Another difference between the two styles of MDM might be in the way data is loaded and published.  Loading an analytical hub is usually done in batches – maybe once a day while most transactional style systems are loaded an entity at a time as the entities are created or modified in the operational systems.  Other than this, the transformations, duplicate checks, business rules checks, etc. involved in loading master data into a hub are the same in either style.  This means that other than the batch size (one in the case of transactional and “N” in the case of analytical) there’s really not much difference in the load processes.

 

Publishing data is different in the two styles of MDM but not in an incompatible way.  Transactional MDM data is generally published in a “push” method where changes to the master data are pushed out to the operational systems but there are many applications that either don’t expose the required interfaces or a run by groups that won’t allow data to be pushed into their system so a “pull” style publication is required.  Analytical MDM data is generally pulled from the hub when the OLAP cube is built or when the data warehouse is updated.  This means that an MDM hub will usually have to support both push and pull style synchronization with operational systems and warehouses so again there’s not a significant difference between the requirements of transactional and analytical MDM.

 

 

I’ve heard from many people that a transactional MDM system need higher performance and scaleability than an analytical system but I’m not sure that’s necessarily true.  The data quantities are going to be identical in either case because if you add 10,000 customers a day to your operational systems, you will need to load 10,000 into the MDM hub whether the data is used in the operational system or the warehouse.  In fact, the loading is probably spread out over the business day for transactional MDM while analytical MDM loading probably has to fit into a batch window at the end of the day so the analytical may actually need more loading performance.  Publishing to several operational systems will take more processing than publishing to a data warehouse but the transformations, encoding, and messaging required can be easily unloaded to a separate server so this doesn’t affect MDM hub performance much.  In some architectures latency might be a bigger issue for transactional MDM than analytical MDM so shorter processing lengths and asynchronous business rule enforcement might be necessary.

 

So what does this all mean?  My take on it would be to look for an MDM solution that can support both transactional and analytical styles.  I think in most cases the logical progression would be to start with analytical MDM to master the data models, rules, technology and stewardship required to manage your master data in a less mission-critical environment.  Once you have achieved some successes in analytical MDM, you can use the same data, models and processes to manage the master data for your transactional systems by just adding the publishing logic to push the master data into the operational systems.

 

Stratature Misinformation

 

Do you remember the “telephone” game we used to play in school where you line a bunch of people up in a row and whisper something to the first person in line who whispers it to the next one, etc. and the last person repeats what they heard.  This is usually hilarious – “return of the Jedi” comes out as “Jeni has pink-eye”.  As someone on the inside of the Stratature MDM acquisition, I often marvel at how our plans an commitments have gotten distorted as they made their way to print.  Some of this might be malicious but most of it is probably just the kind of miss-communication that happens as information is passed from person to person.

 

For example, one of the things that most impressed us about the Stratature product is that they do a better job than just about anybody we have seen at managing hierarchies.  When we talked to our internal IT people they said they were buying a copy of Stratature +EDM primarily for its hierarchy management capabilities because they found many people were spending a significant amount of time managing hierarchies in spreadsheets on their desktops and this not only lead to lots of duplication of effort but in some case could be error-prone if the wrong spreadsheet was used. 

 

This information lead to quite a few statements that Stratature was only a hierarchy management system.  Stratature is a very fully-featured MDM hub and hierarchy management – while it’s cool – it only a small part of what it does.  Going back to the whispering analogy, this is like starting with a statement that I bought a pair of shoes because they had really cool laces and ending up with I bought a pair of shoe laces.

 

A similar example is our thinking that we can add a lot of value to the basic Stratature hub by integrating it our BI tools being interpreted as we only plan to do analytical MDM.  This is ignoring our rich set of tools – SSIS, BizTalk, WCF, WF, Service Broker, etc. that make operational MDM a very attractive market for us.  Releasing an analytical MDM only product just doesn’t make sense for us.

 

Probably the biggest piece of misinformation comes from our statement that we’re temporarily taking the Stratature product off the market.  Once we start selling the Stratature product, it becomes a Microsoft product and as such it must adhere to a whole bunch of quality, security, and legal standards that a non-Microsoft product doesn’t have to deal with.  Until we jump through all the required hoops, we can’t release the Stratature product from Microsoft.  It doesn’t require a lot of thought to conclude that this means we can’t sell Stratature for a while until we get the required changes made. This simple fact has been interpreted to mean we are planning to hack up the product and only keep the few pieces we need to do hierarchy management and analytical MDM (see above).  This interpretation is a little too bizarre to be a case of simple miscommunication so I assume there’s a deliberate attempt to spread FUD here.  Obviously some people hope that current Stratature customers will think we’re abandoning them so their only hope is to run out and buy something else.  Microsoft is often accused of being many things not often said to be stupid and this would be just plain stupid.

 

Well, that’s the way I see things from the inside.  All I ask you to do is to watch what we do in the coming months and judge for yourself what the real story is.  I think you’ll be pleasantly surprised.

Master Data Management, Microsoft, and Stratature

 

If you’re a regular reader, you are aware that I have been blogging about MDM for about a year now.  I’m very excited by the news that Microsoft acquired Stratature last week for two reasons – I think it’s a great move for Microsoft and it means that I am now the second employee on the Microsoft MDM team.  We picked up a great team with many years of MDM experience in the Stratature team so we already have a solid MDM product team in place and we’re hiring as fast as we can get interviews done.

 

If you have read my blog and white papers on MDM you know that I think an MDM hub is the key to a successful MDM implementation and Stratature has one of the best MDM hubs out there.  It supports meta-data driven schema for entity management, sophisticated hierarchy management, versioning, business rules, and workflow.  When we combine this with the rest of the Microsoft platform – SSIS, BizTalk, WCF, WF, SharePoint, InfoPath, etc. we will have one of the most complete MDM offering available.

 

The Stratature team has many years of experience in BI and data management and a solid base of current MDM customers.  Most of their customers are large enterprises with very demanding requirements.  I’m not sure which ones are public but the WebSite lists McKesson, GlaxoSmithKline, and Tiger Brands (a very large South African company).  This injection of MDM experience will be a tremendous kick start for the Microsoft MDM efforts.

 

If you have questions about our MDM effort, we have established an email alias mdmvibe@microsoft.com that you can use.  As we get further into the process and have more information available, I’ll post it here.

Master Data Management at TechEd

 

I notice that no MDM sessions made it into the TechEd schedule this year.  If you’re a regular reader, you know that MDM is my current passion.  If you’re interested in MDM, I would love to talk to you at the Architecture Track lounge at TechEd.  I haven’t totally forgotten Service Broker so if you’re interested in Service Broker, I would love to talk also.

Windows WF on SQL Service Broker

 

From the time we first started working on the Service Broker programming models five or six years ago, it was obvious that most SSB programs end up looking a lot like a workflow.  Messages come in and the application processes them often sending out more messages to other services.  When the application is waiting for a response from a message, the state of the application is stored so that it doesn’t have to be kept around if it takes a long time for the response to come back.  The Conversation Group ID is a great way to identify state so that when a message arrives it’s easy to find the right sate for the message.  The original design for Service Broker even included a “Contract Language” to define the flow for messages within a contract.  This was dropped early on because we didn’t want to invent yet another workflow language but some kind of workflow was something I have been talking about for years.  When the workflow guys first started talking about making workflow hostable, I thought hosting it on Service Broker  was a natural move.  My friend Harry Pierson has been working on WF hosted on Service Broker for one of his projects so when I needed to do a demo for an MDM talk I was giving, I decided to write my own WF service hosted on Service Broker.

 

Harry has played with WF in a CLR stored procedure but for my MDM hub application, I wanted to run the service outside SQL Server which also simplified some of the hosting code.  Even though I thought going in that Service Broker was a good fit for workflow, I was surprised how naturally workflow integrated with the Service Broker programming model.  WF is really well designed so it wasn’t too hard to do what I needed.  I did just enough to support the demo I needed to do but I tried whenever possible to build a general purpose hosting application.  When I have a little more time, I’ll try to finish up the more general purpose code.  I think the whole WF environment can be a great way to develop Service Broker services.  It won’t be as efficient as a stored procedure for simple services but it makes writing complex Service Broker services much more approachable.  The workflow designer is hostable so it should be possible to build a whole SSB service development environment based on WF.  My MDM sample could be expanded into a toolkit for building MDM synchronization applications.  I’m also playing with some string pattern matching algorithms implemented as CLR stored procedures for possible use in duplicate detection workflows.

 

If you have ever written Service Broker code, you will recognize the main processing loop in the WF hosting code.  It’s just a loop that receives a message at the top on then drops into a switch statement based on the message type.  The EndDialog and DialogError messages are handled in the hosting code for now.  At some point it may make sense to pass these up into the workflow so you can build some custom event handling for them.  Normal message types are passed up into a workflow instance for handling.  The initial message in a workflow starts up a new workflow instance, assigns the Conversation Group ID as the Workflow Instance ID, and passes the message contents as parameters into the workflow.  Any subsequent messages are packaged as Workflow events and passed into the appropriate instance identified by the conversation group ID.  This algorithm obviously requires a way to tell the difference between a message that starts a new workflow and one that is passed to an existing workflow.  The current code has a hard-coded message type as the initial message but I think this can be generalized so that the first message in a dialog where this service is a target will always start a new instance.  I plan to try this later this week.

 

Most of the logic I wrote was event handling code because pretty much all communication between the hosting code and the workflow is done through events.  Begin Dialog, End Dialog, and Send Message are events from the workflow to the hosting code and Message Received is an event from the host code into the workflow.  I also wrote several events to handle the MDM activities I needed for the demo.  All the database activity is done in the hosting code so I can use the same database connection for everything and make the database transactions work the way I wanted to for data integrity.  I made all the event handlers into custom activities so I can use them pretty simply from the designer surface because they are available in the toolbox.

 

WF comes with a persistence class that stores workflow state in SQL Server but this class manages the SQL connection and transactions.  I needed to store the workflow state using the same connection and transaction as the Service Broker messages and my MDM hub updates so I had to write my own persistence class.  Fortunately there’s a good sample so it didn’t take long to write.  There’s quite a bit of logic for locking the state to prevent simultaneous access but since the state is always locked by the Conversation Group lock, locks are not necessary for my persistence class.

 

There are parts of WF that I haven’t looked into yet. like transaction scope and compensation but so far I don’t need them.  I have only played with sequential workflows but I think a state machine workflow would also work well.  I use the dialog handles in messages as correlation ids to tie sends to receives so that I can send out multiple messages on multiple dialogs in parallel and still get the received message routed to the correct event handler.  So far it look like my parallel activities execute serially but I’m hoping that’s just because I’m running in the debugger.

 

Like most demo software, this has been debugged to the extent that it can handle two messages in a row pretty consistently but not much more.  There really isn’t any error handling to speak of so it’s way too fragile for use in a real application.  As with all prototypes, my next step is to take what I learned, throw out the prototype and start over.

MDM Hub Architecture White Paper

 

I took my MDM Hub Architecture series of posts, cleaned them up and combined them into a white paper that was just published on MSDN:  http://msdn2.microsoft.com/en-us/architecture/bb410798.aspx

 

More Posts Next page »
 
Page view tracker