Data cleaning is often a big challenge when working with textual data. The Fuzzy Lookup Add-In for Excel is a new tool from Microsoft Research and BI Labs that helps with the problem of identifying and matching textually similar string data in Excel. It is robust to spelling mistakes, synonyms, missing or added words and a number of other data quality problems frequently encountered in the real world. It has support for most languages and works well across a wide variety of data domains. Common uses include cleaning up lists of names, addresses, products or other entity descriptions which contain fuzzy duplicates. It can also be used to fuzzy join two different tables together. For instance, you might clean and augment a table of dirty city, state data with a zip code by matching it against a clean reference table of city, state and zip codes. Give it a whirl and let us know how it works for your data!
An updated version of the add-in has just been released. It fixes a bug that was causing the tool to fail on systems where the default numeric format for decimal values is configured to use something other than "." as the decimal separator.
Can you use the Microsoft.DataIntegration DLL in C#? Are there any examples?
Hi Doug, at present the functionality is only available through Excel. We are looking at ways to make it more widely accessible to developers.
Just downloaded the addin and still get a COM addin error and it won't run.
Hi W.P., make sure that you installed the add-in by running setup.exe and not just just the .msi file; setup.exe installs a few prerequisites that your machine might require. Please send me an email at email@example.com and we can figure out what the problem is.
Hi Kris - Have you looked at FuzzyDupes by Kroll Software? I've used that in the past. Unfortunately, I'm only on Excel 2007 and my computer is locked down and prevents me from installing Dot Net 4. Hope to eventually get a chance to try out your algorithms.
Hi Yilun, thanks for the pointer. For those interested, some papers highlighting the ongoing research and technical details behind fuzzy lookup and other data cleaning technologies can be found here: research.microsoft.com/.../datacleaning
I can only get it to work very sporadically. A couple of times I have succesfully created lists which it has run on but, for the vast majority of the time, pressing the "Go" button has no effect. I can not establish any pattern as to when it will work or not.
It doesn't even work on the Portfolio file supplied with the Add-in.
Hi Paul, someone else reported a similar problem where the GO button was consistently not doing anything. They had originally installed the add-in by launching the .msi instead of setup.exe. When they uninstalled the add-in and then re-installed via setup.exe it started working. You might give that a try. If you still are seeing problems, send me an email at firstname.lastname@example.org and we can debug offline.
I think this software will be very helpful for a current project. I do, however, have Excel 2007. Is there anyway I can use the fuzzy look-up with '07 Excel?
Hi Nick, someone reported that they were able to get it to run with Excel 2007. You might give it a try. If you can let us know whether it worked, that would be great.
I'm successfully using Fuzzy Lookup on some hefty data sets at BT plc in the UK. It would be good to have some sort of visual progress indicator though - theres no way of knowing how far through the data set the analysis has progressed and/or whether it's crashed.
Otherwise, a very promising tool and good UI.
I have installed this today; I am having the same "Go button doesn't do anything" problem as noted below.
I have uninstalled and reinstalled using setup.exe but this doesn't seem to be having any effect.
Installed this plug-in in my 32-bit Excel in Windows 7 x64 environment and it's brilliant.
I've only had it on my PC for an afternoon and already it's saved me time and I can see this will save me hours every week. I am delighted.
I've also got a 64-bit Excel in Windows 7 x64 environment I use for my big number crunching work, it's seperate because most of my plug-ins are only available in 32-bit, and I hope it works in that too.