More info on my String Normalization algorithm

Published 13 May 03 05:07 PM | RWlodarczyk 

Starting last Friday I ended up working on a StringNormalizationList data structure that encapsulates the StringNormalization class that I was working on before. Essentially this data structure allows you to keep adding strings to it, and then you must Clean the structure. The cleaning process involves making "a set of sets". This "set of sets" is a list of all the sets of words that the algorithm deam to be similar (or rather the same). This data structure now lets me go through a long list of strings and "bin" them into the appropriate buckets. Definitely pretty useful.

There are some optimizations that I'd like to do. In particular I'd like to find some sort of hash function that hashes similar string values to similar hash keys. I'll pose the question here... Does anyone know of such a hash function?

Filed under:

Comments

# Scott Prugh said on May 13, 2003 3:16 PM:
How about SOUNDEX??
# SBC said on May 13, 2003 8:54 PM:
Look up Knuth's vol. 3.
# OK said on February 20, 2004 7:33 AM:
Soundex internally implements the same algo!?
Anonymous comments are disabled

About RWlodarczyk

Robert has been at Microsoft since August 2003. He has worked on WPF Imaging, Media, and Effects, and Windows Vista (in the form of the Windows Imaging Component). He is currently the test lead for the Windows Imaging Component.
Page view tracker