MDM - Scaling Out Matching

 

It’s been a long time since my last post but I’ve been pretty busy figuring out how IT works.  As you might expect, one of the hardest things has been learning all the new acronyms.  Not only that but some of the acronyms I thought I knew mean different things in IT than they did in the product groups.  For example, I thought OBA meant “Office Business Application” and found out it means “On Boarding Application” only after several weeks of confusion.  My current challenge is to figure out how to scale MDM matching out to meet our future requirements -  matching hundreds of millions of incoming records a week against a billion or more customer records in our MDM database. So far we haven’t found an engine that can do this with a single instance so we are trying to figure out the best way to scale matching out over a large number of matching servers.

 

There are basically two types of matching engines that I’m aware of:

 

1)    Database engines that build an index in the same database as the MDM data.  This has the obvious advantage that as long as the index is built in the same transaction as the MDM data is inserted or updated, the index is always completely up to date so you don’t have to worry about two incoming records for the same customer not matching because the first record isn’t indexed when the second record hits the matching engine.  This engine also uses the database to do the searching for matches so it scales as well as the database scales.  The down side of this type of matching engine is that scaling the database can be pretty expensive which limits the total scalability of a database solution.  We currently are using SQL replication to scale out by using separate copies of the database for searching and other matching applications that don’t update the database.  This not only speeds up these read-only matching applications but reduces the load on the MDM hub itself because it only does merging of new and modified records. 

We’re currently doing some prototyping to see if this type of engine will scale to matching hundreds of millions of incoming records against a billion or more master records.  My current thinking is that we will have to partition the data in order to scale to the number of master records we require.  The obvious issue is what attributes to partition the data on that will produce an adequate number of partitions and allow us to reliably determine which partition the record is in.  I will talk more about partitioning later because I think partitioning is the key to making scaleout work.

2)    Memory based engines that build an index either completely or partially in memory.  These engines use the database as a source of data for the index but the actual matching is done completely outside the database.  This makes scaling out the matching pretty straightforward because we can make multiple copies of the index without replicating the database itself.  The hardest problem with this engine is keeping the indexes up to date.  The choices for updating the index are either rebuild the index periodically or update the index as each record is added.  The rebuild option means that the index is pretty much always somewhat out of date.  We can alleviate this problem somewhat by resolving any duplicates in input batches before we input the batch but this doesn’t help with real-time inserts into the master data.  One option would be to use both update techniques – rebuild the index before a big batch run and dynamically maintain the index for real-time input.

Because this type of matching engine gets much of its performance from the index being in memory, it’s pretty much a sure bet that we will have to come up with a good way to partition the data so the indexes don’t get too large to fit in memory.  In addition, I anticipate the volume will be large enough that we will have to replicate the indexes to handle the load.  This means that we will try a combination of partitioning and replication.

 

Partitioning

 

If we have to partition the indexes to get the scalability we need, the partitioning scheme will be key to performance and flexibility.  If possible, it would be great to partition the indexes on an attribute that we always know for every incoming record.  That way, we can know which index to use for each incoming record.  An idea attribute would be the language of the incoming record.  One of the things most matching engine need to know is the language they are matching against because the way that the text is divided into words, what an address looks like, what synonyms and nicknames are, etc. varies from language to language.  Some matching engines require that the indexes are built on a single language so partitioning by language may be required anyway.  The issue with this is that there’s not a reliable way to determine what language a name and address is in.  There are algorithms for guessing the language but they’re not always right and even when they guess correctly, the information might be misleading.  For example, is Toyota English or Japanese?  So language is a functional partitioning attribute but not necessarily a useful one.

 

Country might also be a good attribute for partitioning because it’s often present in the address and if it isn’t, address standardization will generally figure out the right country.  The disadvantage of country is that while it might be OK for individuals, it won’t work well for multi-national companies that exist in many countries or individuals who move to a different country.

 

One way to create partitioned indexes or databases without worrying about missing a match because we’re looking in the wrong partition is to partition the index on something arbitrary like a hash of the primary key and then do the matching by matching on all partitions in parallel and combining the results to come up with the best matches in all the partitions.  This obviously uses a lot more resources than doing the match in a single partition but it eliminates the problem of missed matches and the total time to find the match is the roughly the same as matching in a single partition.  This type of partitioning probably makes sense for the memory-based matching system where adding new copies of the index is relatively cheap but it’s probably too expensive both from actual cost and added replication overhead to work in a database matching system.  The parallel matching partition scheme also makes a lot of sense if the underlying hub database is partitioned.  Microsoft IT uses partitioned databases for many very large customer data systems so we’re looking into the possibility of partitioning our MDM data.  If we do partition the MDM hub, then partitioning the matching indexes using the same partition scheme makes a lot of sense.

 

Matching Algorithms

 

Matching Algorithms are generally orthogonal to matching scaleout but the type of algorithm our matching system uses might influence our choice of partitioning scheme.  In my greatly simplified view of matching there are two kinds of algorithms – rules based algorithms that use a set of standard and user defined rules to determine matches and mathematical algorithms that use a variety of mathematical formulae to give a similarity score to potential matches.  In my opinion, a perfectly tuned rules-based system will give more match accuracy than a mathematical match engine but I’ve never really run into a perfectly tuned rules-based match engine.  Rules change over time and are different for different locales and different languages so maintain matching rules can be a hugely complex and expensive task.  Since rules generally must be tuned differently for different languages and different countries, partitioning by language or country may be more compelling if you use a rules-based matching engine.

 

Conclusions

 

This is the place where I recommend which scaleout method you should use.  Unfortunately, at this point I really don’t know which one we will use.  Scaling out the database engine would probably be easiest for us because that’s the kind of engine we’re using now but we don’t have enough experience with very large data volumes to know whether our current system will scale to the required volumes.  The memory-based indexes theoretically look like they will scale very well but we still have to solve the update problem before we know if they really scale well in the face of a massive number of updates to the underlying data.  We plan to do more prototyping over the next few months so I’ll let you know what we find.  If you have experience or advice, I would like to hear from you.