A friend of mine asked me about common algorithms for determining string similarity. One of the most well-known string similarity algorithms is the Levenshtein edit distance algorithm, possibly because it's frequently used in computer science algorithm courses as a classic example of dynamic programming. Levenshtein edit distance computes the number of insertions, deletions, or substitutions it would take to go from one string to another. For example, the distance between "dog" and "dogs" is 1, because we can insert an 's' at the end of "dog" to get to "dogs". The distance between "puppy" and "lucky" is 3, because we can swap the 'l' for the first 'p', the 'c' for the second 'p', and the 'k' for the third 'p'. Etc. More information on the Levenshtein edit distance is available in wikipedia.

I wrote up an implementation of the algorithm in C# for comparing strings, but then I realized that with generics it would be very easy to extend it to support any IEnumerable<T> of equatable types. This makes it easy to determine, for example, the edit distance between two arrays of integers. For anyone interested, here's the resulting code:

/// <SUMMARY>Computes the Levenshtein Edit Distance between two enumerables.</SUMMARY> /// <TYPEPARAM name="T">The type of the items in the enumerables.</TYPEPARAM> /// <PARAM name="x">The first enumerable.</PARAM> /// <PARAM name="y">The second enumerable.</PARAM> /// <RETURNS>The edit distance.</RETURNS> public static int EditDistance<T>(IEnumerable<T> x, IEnumerable<T> y) where T : IEquatable<T> { // Validate parameters if (x == null) throw new ArgumentNullException("x"); if (y == null) throw new ArgumentNullException("y"); // Convert the parameters into IListinstances // in order to obtain indexing capabilities IList<T> first = x as IList<T> ?? new List<T>(x); IList<T> second = y as IList<T> ?? new List<T>(y); // Get the length of both. If either is 0, return // the length of the other, since that number of insertions // would be required. int n = first.Count, m = second.Count; if (n == 0) return m; if (m == 0) return n; // Rather than maintain an entire matrix (which would require O(n*m) space), // just store the current row and the next row, each of which has a length m+1, // so just O(m) space. Initialize the current row. int curRow = 0, nextRow = 1; int[][] rows = new int[][] { new int[m + 1], new int[m + 1] }; for (int j = 0; j <= m; ++j) rows[curRow][j] = j; // For each virtual row (since we only have physical storage for two) for (int i = 1; i <= n; ++i) { // Fill in the values in the row rows[nextRow][0] = i; for (int j = 1; j <= m; ++j) { int dist1 = rows[curRow][j] + 1; int dist2 = rows[nextRow][j - 1] + 1; int dist3 = rows[curRow][j - 1] + (first[i - 1].Equals(second[j - 1]) ? 0 : 1); rows[nextRow][j] = Math.Min(dist1, Math.Min(dist2, dist3)); } // Swap the current and next rows if (curRow == 0) { curRow = 1; nextRow = 0; } else { curRow = 0; nextRow = 1; } } // Return the computed edit distance return rows[curRow][m]; }

In addition to generics, this implementation also showcases a new feature of C# 2.0, the null coalescing operator (??). The ?? operator returns the left-hand operand if it is not null, otherwise it returns the right-hand operator. I use it in the above code in order to get IList<T> representations of the supplied IEnumerable<T> instances passed as parameters. For computing the edit distance, I needed the capability to index into each of the lists being compared, but I also wanted to support any enumerable type, such as System.String (which implements IEnumerable<char>). I can easily create two new List<T> instances from the supplied parameters, but that could be an expensive operation if the lists provided to the method are long, as building a List<T> from an IEnumerable<T> requires copying all of the data from the latter into the former (if the IEnumerable<T> is an ICollection<T>, this might result in a straightforward array copy, but if it doesn't, it results in enumerating through every element in the IEnumerable<T> and Add'ing each individual item to the new List<T>). Instead, I can save a lot of time and effort if the IEnumerable<T> instance already implements IList<T>. If it does, I can simply cast the IEnumerable<T> to IList<T>, and if it doesn't, I can take the expensive path of creating the new List<T>. The ?? operator makes this easy. Rather than having to write code like:

IList<T> first; if (x is IList<T>) { first = (IList<T>)x; } else { first = new List<T>(x); }or:

IList<T> first = (x is IList<T>) ? (IList<T>)x : new List<T>(x);or:

IList<T> first = x as IList<T>; if (first == null) first = new List<T>(x);I can simply write:

IList<T> first = x as IList<T> ?? new List<T>(x);Nice and neat.