Why duplicate detection is hard
In a past life (I'm fairly certain it was a past life because there are days when I'm sure I'm paying for it), I worked on a pair of very large-scale data-cleansing systems. They will go unnamed, except that I'll refer to them as System DBS and System ES. Both systems had a specific set of goals and in the case of DBS a very specific target problem domain. The goals were quite simple:
- Acquire data from a number of related applications (where "related" is a very loose term meaning that the owning company for each application was effectively the same, but that otherwise they supported different business functions).
- Define a common schema that captures the union of all the source data.
- Define the set of candidate keys that lets you identify an item's source, its original key, its common key, and its goal key.
- Construct from that data a common union of all interesting elements. That is, load it all up.
- Apply a forward-chaining rule system over that data to coalesce duplicate records, identify "bad" data, any find missing data.
- Push that nice clean data back out into the bad dirty world from which it came (but put it back nicely)
Given that DBS was specifically designed to run over data from a given problem domain (let's call it telecom data) one might assume that the problem was well-constrained. If one did assume that one would be very wrong. So, a small team of developers set out to generalize the problem space to cover different domains and designed ES as a result. ES followed the same path as above only in a very general way and without the overly complex rules engine (12 years later I think we might have been able to use that rules engine, but at the time our distaste for it was clear in the ES design). As an aside outside of parentheses this was called the U model mainly for the shape that the model took on while we drew it on the whiteboard.
So, what's the point of that history lesson and what does it have to do with duplicate detection. Well, remember that I mentioned that the primary domain was telecom. That means we covered concepts such as customers, addresses, telephone numbers, physical plant data (there are a lot of little pieces that go into getting telephone service over a land line), bills & invoices, and payments. In all there were some 30 different systems involved in sourcing this data to our engine. One the first things that needs to happen in the DBS problem space is that a set of candidate keys need to be identified that work across all systems, or at least across enough systems such that in the end all systems can be logically linked. In the telecom world that was the phone number.
In the U.S. (actually in the North American dialing area) telephone numbers are 10 digits long and always follow a very specific format. I won't go into the formal names for those various groups of digits or even why there are groups, but let's just say that each of the groups can become part of a key. Nearly every installation of DBS was in the U.S. so building a telephone number parser wasn't too terribly difficult. You're either looking for 7 or 10 digits (unless you run across a PBX in the data and then you have to start messing about with extensions). Well, the big DBS installation I worked on, and what drove much of the ES design, was not based in the U.S., but was instead in another country with very different telephone formats. Some places used 4 digits, some 5, some 7… you get the point. The plan was to configure and run DBS to first dedup that data so we could convert the whole country to 10 digit dialing. A secondary goal, once the customer realized what we could do, was to dedup the whole lot of data to see what we could find (did you know that many times the phone company doesn't know that your home already has a connected telephone line so they sent a technician out with new gear to hook it up?).
I can hear you now: "Get on with the discussion and tell us why duplicate detection is hard."
Remember that I mentioned the step about identifying a candidate key? Well, in the case of a phone number it's fairly simple (let's make some additional simplifying assumptions that phone numbers are never reused, each person has only one phone number, and numbers never move from person to person). Once you see a phone number in a normalized format you can then query over the set of existing data looking for other instances of that phone number. In a live RDBMS that can be done using a unique key constraint over the normalized phone number column that will throw back an error when an insert or update attempts to violate that key. In our simple world this works every time because when you get an error on an insert you know that you have either a duplicate record or an error in the key. For updates you know you have an error in one of the keys of the updated record or in at most one existing record in the database.
Now, let's extend this from our idealized phone number world to something that's more CRM-ish. A phone number is not a reasonable candidate key because it does change over time, two people can share it, and many people have many numbers. A solution to this problem is to identify a new candidate key. One approach is to construct a key from various bits of useful information. For example one might use some normalized elements of a contact's name, a phone number, possibly an email address, and their home address. Once we've extended the key to cover enough elements to guarantee uniqueness (which is not possible and is left as a proof to the reader) in our problem space we will invariably run into the case where that key isn't wide enough.
Let's see what happens when we insert a new record. First, we construct / synthesize a candidate key and put it into one of the columns in the INSERT statement. Then we fire the statement at the database and wait for an error. Let's assume we get a key violation back so we have a few options: we can change the key, we can ask the data supplier what to do, or we can punt. If we change the key automatically then we've simply ignored the duplicate detection problem and we might as well not have done any of this work. Same thing with punting except that it's overly harsh on the other side: the data doesn't go into the database.
We might ask the user what to do. Well, simply telling them that their just-entered data would create a duplicate record in the system and therefore must be in error wouldn't be particularly useful. How would they know what part of it is in error? How would they even know what to do with a duplicate. One thing that DBS and ES did was return the new record and the existing duplicate(s) in a nice bit of dynamic UI so that the user could see both records essentially side-by-side and make a judgment call. This worked for our solution because we specifically engineered it so that there was a headless service running but a staff of "Error Resolution and Correction Clerks". That is, people were sitting in the dark waiting for bad data to pop up on their screen, they'd make a call based on the original data, the duplicate data, and occasionally the data from the source system.
Let's say we do something like the first approach where we simply return the "offending" records and the new record and let the user decide. Then, the user decides that these two records are actually different from one another but that the data is 100% correct. In this case the user or the system could mar the record in such a way that it's no longer considered a duplicate and complete the write operation. What just happened here? Well, the candidate key for the new item will no longer raise an error when it's the cause of a duplication because that key is different. We could mar the data in a predictable way so that the key stays intact but so that there's a "larger" unique key over the data, but again that wouldn’t cause the insert to fail.
The next option is to use a unique key over the candidate key plus some invented data (invented in a predictable way that is) and query the data on the candidate key before attempting an insert. Now we're getting somewhere. We allow duplicate candidate keys but invent a wider key that guarantees uniqueness (see above for details on the widening proof). But we still haven't solved the problem because we don't have a reasonable way to verify that a duplicate is really a duplicate.
This all gets horribly complex when you're dealing with multiple record types or subclasses of types (think of the customer case in MS-CRM where "customer" might mean Account or Contact. This means you need a candidate key that crosses type boundaries and that you have a way to reconcile duplicates across types.
Anyway, that's why duplicate detection is hard. Note that I didn't say it was impossible, just hard.