Welcome to MSDN Blogs Sign in | Join | Help

Why duplicate detection is hard

In a past life (I'm fairly certain it was a past life because there are days when I'm sure I'm paying for it), I worked on a pair of very large-scale data-cleansing systems. They will go unnamed, except that I'll refer to them as System DBS and System ES. Both systems had a specific set of goals and in the case of DBS a very specific target problem domain. The goals were quite simple:

 

  • Acquire data from a number of related applications (where "related" is a very loose term meaning that the owning company for each application was effectively the same, but that otherwise they supported different business functions).
  • Define a common schema that captures the union of all the source data.
  • Define the set of candidate keys that lets you identify an item's source, its original key, its common key, and its goal key.
  • Construct from that data a common union of all interesting elements. That is, load it all up.
  • Apply a forward-chaining rule system over that data to coalesce duplicate records, identify "bad" data, any find missing data.
  • Push that nice clean data back out into the bad dirty world from which it came (but put it back nicely)

 

Given that DBS was specifically designed to run over data from a given problem domain (let's call it telecom data) one might assume that the problem was well-constrained. If one did assume that one would be very wrong. So, a small team of developers set out to generalize the problem space to cover different domains and designed ES as a result. ES followed the same path as above only in a very general way and without the overly complex rules engine (12 years later I think we might have been able to use that rules engine, but at the time our distaste for it was clear in the ES design). As an aside outside of parentheses this was called the U model mainly for the shape that the model took on while we drew it on the whiteboard.

 

So, what's the point of that history lesson and what does it have to do with duplicate detection. Well, remember that I mentioned that the primary domain was telecom. That means we covered concepts such as customers, addresses, telephone numbers, physical plant data (there are a lot of little pieces that go into getting telephone service over a land line), bills & invoices, and payments. In all there were some 30 different systems involved in sourcing this data to our engine. One the first things that needs to happen in the DBS problem space is that a set of candidate keys need to be identified that work across all systems, or at least across enough systems such that in the end all systems can be logically linked. In the telecom world that was the phone number.

 

In the U.S. (actually in the North American dialing area) telephone numbers are 10 digits long and always follow a very specific format. I won't go into the formal names for those various groups of digits or even why there are groups, but let's just say that each of the groups can become part of a key. Nearly every installation of DBS was in the U.S. so building a telephone number parser wasn't too terribly difficult. You're either looking for 7 or 10 digits (unless you run across a PBX in the data and then you have to start messing about with extensions). Well, the big DBS installation I worked on, and what drove much of the ES design, was not based in the U.S., but was instead in another country with very different telephone formats. Some places used 4 digits, some 5, some 7… you get the point. The plan was to configure and run DBS to first dedup that data so we could convert the whole country to 10 digit dialing. A secondary goal, once the customer realized what we could do, was to dedup the whole lot of data to see what we could find (did you know that many times the phone company doesn't know that your home already has a connected telephone line so they sent a technician out with new gear to hook it up?).

 

I can hear you now: "Get on with the discussion and tell us why duplicate detection is hard."

 

Remember that I mentioned the step about identifying a candidate key? Well, in the case of a phone number it's fairly simple (let's make some additional simplifying assumptions that phone numbers are never reused, each person has only one phone number, and numbers never move from person to person). Once you see a phone number in a normalized format you can then query over the set of existing data looking for other instances of that phone number. In a live RDBMS that can be done using a unique key constraint over the normalized phone number column that will throw back an error when an insert or update attempts to violate that key. In our simple world this works every time because when you get an error on an insert you know that you have either a duplicate record or an error in the key. For updates you know you have an error in one of the keys of the updated record or in at most one existing record in the database.

 

Now, let's extend this from our idealized phone number world to something that's more CRM-ish. A phone number is not a reasonable candidate key because it does change over time, two people can share it, and many people have many numbers. A solution to this problem is to identify a new candidate key. One approach is to construct a key from various bits of useful information. For example one might use some normalized elements of a contact's name, a phone number, possibly an email address, and their home address. Once we've extended the key to cover enough elements to guarantee uniqueness (which is not possible and is left as a proof to the reader) in our problem space we will invariably run into the case where that key isn't wide enough.

 

Let's see what happens when we insert a new record. First, we construct / synthesize a candidate key and put it into one of the columns in the INSERT statement. Then we fire the statement at the database and wait for an error. Let's assume we get a key violation back so we have a few options: we can change the key, we can ask the data supplier what to do, or we can punt. If we change the key automatically then we've simply ignored the duplicate detection problem and we might as well not have done any of this work. Same thing with punting except that it's overly harsh on the other side: the data doesn't go into the database.

 

We might ask the user what to do. Well, simply telling them that their just-entered data would create a duplicate record in the system and therefore must be in error wouldn't be particularly useful. How would they know what part of it is in error? How would they even know what to do with a duplicate. One thing that DBS and ES did was return the new record and the existing duplicate(s) in a nice bit of dynamic UI so that the user could see both records essentially side-by-side and make a judgment call. This worked for our solution because we specifically engineered it so that there was a headless service running but a staff of "Error Resolution and Correction Clerks". That is, people were sitting in the dark waiting for bad data to pop up on their screen, they'd make a call based on the original data, the duplicate data, and occasionally the data from the source system.

 

Let's say we do something like the first approach where we simply return the "offending" records and the new record and let the user decide. Then, the user decides that these two records are actually different from one another but that the data is 100% correct. In this case the user or the system could mar the record in such a way that it's no longer considered a duplicate and complete the write operation. What just happened here? Well, the candidate key for the new item will no longer raise an error when it's the cause of a duplication because that key is different. We could mar the data in a predictable way so that the key stays intact but so that there's a "larger" unique key over the data, but again that wouldn’t cause the insert to fail.

 

The next option is to use a unique key over the candidate key plus some invented data (invented in a predictable way that is) and query the data on the candidate key before attempting an insert. Now we're getting somewhere. We allow duplicate candidate keys but invent a wider key that guarantees uniqueness (see above for details on the widening proof). But we still haven't solved the problem because we don't have a reasonable way to verify that a duplicate is really a duplicate.

 

This all gets horribly complex when you're dealing with multiple record types or subclasses of types (think of the customer case in MS-CRM where "customer" might mean Account or Contact. This means you need a candidate key that crosses type boundaries and that you have a way to reconcile duplicates across types.

 

Anyway, that's why duplicate detection is hard. Note that I didn't say it was impossible, just hard.

 

Published Tuesday, February 07, 2006 3:00 PM by mikemill
Filed under:

Comments

# re: Why duplicate detection is hard

Ok.   Why not E-mail as a key, by definition 2 contacts sould not ever have the same e-mail address.  We all understand the phone number NPA-NXX schema and the fact that phone formating is not enforced at the UI only exaserbates the problem.  So even if you tried to match on any phone field your UI would revent reasonable accuracy because users can type (xxx)xxx-xxxx  or xxx-xxx-xxxx or xxxxxxxxx    whatever they want...   This is a dum design.  We all know what the phone formats are why is that not enforced at the data layer, or at least at the UI layer.  

This is allowing user to create dupes w/out notification.   Goldmine has a great feature that searches for Like e-mails and phone numbers during entrty.  It allows the user to choose at entry.   Sure it is not the 100% dup elimination system but the most intelligent de-duper is a real live person who can make the simple decision at entry, based on style and the understood business rules.  As far as I am concerned.   This is a major omission from the applicaiton.    
Wednesday, February 08, 2006 4:20 PM by Pierre Hulsebus

# re: Why duplicate detection is hard

Well, I purposefully stayed away from discussing candidate key identification and generation (mainly because it's a religious war and ultimately because there are a ton of patentable ideas in there). The problem with email address is that your assumption of "by definition" doesn't hold. What about the case where I share an email address with my spouse? I know this happens because that's how my parents deal with email. So, by your definition, even with data normalization (format clean-up) and adding in a telephone number, you'd end up with a duplicate. You're right though, the ultimate decision would be up to an intelligent user. But, what happens when there isn't a user (bulk import, API call, or data migration scenarios)? What happens when you run into false positives (like the sharing of email address) or false negatives (when security hides a piece of data from you)?

I'm not disagreeing that this is a missing component in the product as it stands. I was simply saying that really getting it right and usable is a lot harder than some people might expect.

As for phone number formating in the UI - that was something I actively fought for and against (different audiences and different requirements). The problem is that phone number formats are by convention only and don't have any geographic basis. It just happens that in NA we use the 3-3-4 pattern with some separators or grouping. In many countries there are different conventions for different types of numbers and for different areas within that country.

By the way, if you really hate something about the product then post a message over on one of the public newsgroups - the PM team would love to hear from you.
Thursday, February 09, 2006 12:24 AM by mikemill

# re: Why duplicate detection is hard

I'm  bit tired tonight  but: (again)

Load the data into seperate tables.  execute a couple of stored procedures to strip the excess garbage (i.e. trim and remove symbols)

Select Duplicates (i.e. NOT DISTINCT) in the upload table - present to user for edit/deletion

JOIN to existing tables - present to user for edit/deletion

Done -

Copy new data into existing database.

Applications are for users - Databases are for data.  Stop making this more complex than what is needs to be.  This is why applications today are a factor of a 1000 times larger than what they need to be and why Microsoft puts out more Hotfixes than useable code.

You can not anticipate every stupid thing a human will do.  You can not write enough code to have world peace.  Move on, BUT if you had given me a method to pump data into the CRM database - I would be happy.
Thursday, February 23, 2006 3:19 AM by Andrew chumney

# re: Why duplicate detection is hard

"The problem with email address is that your assumption of "by definition" doesn't hold. What about the case where I share an email address with my spouse? I know this happens because that's how my parents deal with email. "

Wow....  I see the problem now.  your parents need to get 2 e-mails and your problem would be solved.  

You guys need to get real.   In the Business world everyone has at least 1 e-mail.  If I have a CRM system that that lists every contact from every business interaction my company has.  When someone adds a new contact that dupes an existing e-mail address or Phone number it should flag the user saying that this exists allready.  Do you want to add a dupe yes/no.....  

It is really that simple.  

Forget designing it for every senerio.  This is prety basic stuff that all the other CRM apps have.  

I do love this app though.  Even with these types of problems.  I love how you guys have designed it.    Very slick stuff.   Keep up the great work.
Friday, February 24, 2006 12:15 PM by Pierre Hulsebus
New Comments to this post are disabled
 
Page view tracker