A nasality talisman for the sultana analyst

A nasality talisman for the sultana analyst

Rate This
  • Comments 24

The other day my charming wife Leah and I were playing Scrabble Brand Crossword Game (a registered trademark of Hasbro and Mattel) as is our wont. I went first, drawing the Q and a bunch of vowels. Knowing that the Q is death to hold onto, I immediately opened with QI for 22 points. I silently thanked Miriam-Webster for adding QI, KI and ZA to the OSPD 4th edition.

BigSnit

My thankfulness was short-lived, as Leah thought for a few moments and then played ANALYST, for the fifty point "bingo" bonus, making QIS for good measure along the way. She then went on to thoroughly cream me. I am not very good at Scrabble.

We were wondering afterwards what all the possible seven and eight letter "bingo" words were that she could have made with the rack AALNST?. (A blank is conventionally written as "?".) I stared at it for a few moments and found SULTANA and SEALANT, but I was suspicious that there were a lot more. So I wrote a program to find out, which I shall share with you now.

(An interesting historical note: solving this problem efficiently was one of the questions I was given during my interviews when I first applied for full-time work at Microsoft; if you want to get a job here, you might need to know this!)

The core of the program is a method SearchDictionary which takes a rack string and returns a sequence containing every word in a dictionary file which can be formed using all the letters in that rack. In this case I wanted to know not just what all the words matching the given rack were, but what all the words matching the given rack that used an existing letter on the board were. In this case the only two letters on the board were Q and I, but let's assume that it could have been any letter on the board.

The main loop of my little console program looks like this:

public static void Main()
{
    while (true)
    {
        Console.Write("Enter rack (use '?' for blank): ");
        string rack = Console.ReadLine();
        if (rack == "")
            break;
        Console.WriteLine("{0} : {1}", rack, SearchDictionary(rack).Join());
        foreach (char c in "ABCDEFGHIJKLMNOPQRSTUVWXYZ")
            Console.WriteLine("{0}+{1} : {2}", rack, c, SearchDictionary(rack + c).Join());
    }
}

Pretty straightforward so far. Of course, this is certainly not production quality code. This is a little utility that I hacked together for myself. A production-quality version would have a lot more error handling, for one thing.

The Join method is just a handy helper function that sticks a sequence of strings together:

private static string Join(this IEnumerable<string> strs)
{
    return string.Join(" ", strs.ToArray());
}

The trick to efficiently searching for anagrams in a dictionary is to realize that all anagrams have the same letters, just in different order. If you "canonicalize" each word so that its letters are uppercase and in alphabetical order, then checking whether one word is an anagram of another is as simple as comparing their canonical forms:

private static string Canonicalize(string s)
{
    char[] chars = s.ToUpperInvariant().ToCharArray();
    Array.Sort(chars);
    return new string(chars);
}

I had a performance goal for this project; I wanted to be able to use the ~2MB 2006 Tournament Word List, searching for racks with up to two blanks, in reasonable human scale time, but not necessarily appearing to be instantaneous. My first naive implementation did not meet this goal so I made a few tweaks to the algorithm until it did, and then I stopped. (It is interesting to think about how this could be made much faster, but that's a subject for another day.)

private static IEnumerable<string> SearchDictionary(string originalRack)
{
    const string dictionary = @"d:\twl06.txt";

    // Calculate all the possible distinct values for the rack.
    // As an optimization, stuff the resulting racks in an array so
    // that we do not recalculate them during the query.

    var racks = (from rack in ReplaceQuestionMarks(originalRack)
                 select Canonicalize(rack)).Distinct().ToArray();

    // Check every line in the dictionary to see if it matches
    // any possible rack.  As an optimization, do an early
    // out if the line length does not match the query length.

    return from line in FileLines(dictionary)
           where line.Length == originalRack.Length
           where racks.Contains(Canonicalize(line))
           select line;
}

LINQ queries are awesome. I love how the code reads like a description of what I'm trying to do, rather than how I'm doing it.

We need a way to turn a rack that might contain blanks into a sequence of the 26 or 26x26 possible racks without blanks. Here's a handy recursive method that does so:

private static IEnumerable<string> ReplaceQuestionMarks(string s)
{
    int index = s.IndexOf('?');
    if (index == -1)
    {
        yield return s;
        yield break;
    }
    foreach (char c in "ABCDEFGHIJKLMNOPQRSTUVWXYZ")
    {
        string s2 = s.Substring(0, index) + c.ToString() + s.Substring(index + 1);
        foreach (string result in ReplaceQuestionMarks(s2))
            yield return result;
    }
}

And of course, the code to extract the lines from the dictionary is our old friend:

private static IEnumerable<string> FileLines(string filename)
{
    using (var sr = File.OpenText(filename))
    {
        while (true)
        {
            string line = sr.ReadLine();
            if (line == null)
                yield break;
            yield return line;
        }
    }
}

And there you go. Six very brief little methods that tell you that Leah could have made the seven letter bingos ANALYST CANTALS LATINAS PLATANS SALTANT SALTPAN SEALANT and SULTANA or by going through the "I", the eight letter bingos ALATIONS ANNALIST FANTAILS LANITALS NASALITY PLATINAS SANTALIC STAMINAL TAILFANS TALISMAN and VALIANTS.

  • The general problem of writing a Scrabble playing program came up as a software project option in my final year at university (my group chose a different project because we thought Scrabble sounded too hard, we wrote a software simulation of the MC68000 CPU instead!).

    The trick turned out to be the data structure that holds the dictionary.  You can preprocess the dictionary into a Directed Acyclic Word Graph, start with a tree where each node holds a letter and each "edge" leads to another node that holds the next letter in each valid word that shares the prefix, then collapse all of the common suffixes so that they're shared wherever possible.

    Once you've got the DAWG, any way you navigate through the directed graph will produce a valid word, so you can generate all the words for a given rack by following the letter combinations, chasing down multiple paths if you have a blank to play with, adding in letters already on the board at certain positions in the word, etc, etc.

    The problem of making software play the game then comes down to what strategies can be applied to maximise the score (do you chase triple word scores? search for high value letters? look for word reuse/extension opportunities?).

    Neat problem, neat solution (might not be the best/most efficient way of finding anagrams though)!

  • This can be done more efficiently (in terms of computer time, not programmer's time) by building a histogram out of each word. Make an array of 26 buckets; entry 0 contains the number of A's, entry 1 the number of B's, etcetera. Then your comparisons are in O(1) instead of O(length_of_word), simply by comparing the histograms bucket-by-bucket. (Of course, this does not scale to languages with a larger alphabet. But I suppose that Chinese Scrabble doesn't exist anyway.)

    As a bonus, an arbitrary number of blanks can easily be implemented. While comparing the histogram of the rack with the histogram of a dictionary word, pretend the blanks are not there. If there are more of a certain letter on the rack than there are in the dictionary word, you know that you can't form that word with your whole rack. If there are more of the letter in the dictionary word than on the rack, use as many blanks as you need. If you run out of blanks, you also know that the word can't be formed.

    I estimate that this method will take less than 0.1 second to search a 2 MB dictionary.

    Do I pass the Microsoft interview now? ;)

  • @Thomas

    Just a note - although you're right that comparing the histograms is O(1), so is comparing the words really, since the word length for scrabble will never be longer than 7 or 8 letters in practice (occasionally you can continue to build on words and get longer, but he's not handling that case anyhow) so I'm not sure that comparing 26 buckets will really be faster than comparing the canonicalized strings.

    The real issue is that you're O(n) in the size of the dictionary.

    My first thought was to use tries (much like Alan describes) although it would be simpler to use a a hash table with the canonicalized strings mapped to the original. You couldn't just use a dictionary (since the same canonical form could be generated from several words), but it would be fairly simple to build. Building this would probably be bounded by the disk io of reading the word list, and finding the words would only require 26 lookups for one "?" and 676 for 2. It's though to imagine beating that by much.

  • T9 mode in cell phones works quite similarly. Although, there are no wild card searches as yet. You group words by length in a table. Then, cycle through them when the user presses next word key.

  • Thanks for this interesting little article which nicely showcases how useful Linq is just in general everyday programming. Plus, you introducde me to The Big Snit. Yea!

    I was just thinking about the number of programming examples I've seen where the author states "This is not production quality code". Hitler once said, if you repeat a lie often enough, people will believe it. My point is, we see lots of these 'quick and dirty' examples and after a while, it is easy to start believing that 'Production Quality Code' is a mythical thing. I know from experience that a lot of programmers do actually write very poor code and then put it into production. I don't really know where I'm going with this, except to say that, with the best of intentions, it is all too easy to set a bad example.

  • File.ReadAllLines(dictionary) is a one-liner replacement for FileLines().

    It's a pity that you can only use the join operator on equality comparisons, else we could have done this:

    from word in File.ReadAllLines(dictionary)
    let canonical = Canonicalize(word)
    join rack in racks on rack.Contains(canonical)
    orderby word
    select word;

    [[ ERIC: ReadAllLines is not a one-liner replacement. ReadAllLines reads the entire 2 meg dictionary into memory at once. My version reads one line at a time into memory. The next version of the framework will probably have a version of ReadAllLines that does it my way. ]] 

     

  • Actually, joining rack in racks on canonical equals rack would work just fine!

  • Is it cheating to pre-compute the canonical forms?  I mean, clearly the bottleneck is having to iterate through the whole dictionary, not computing the 676 combinations of blanks in the rack, so you just need an index.  Stuff the original word and canonical form into an MSDE database and you could do all those lookups, well, pretty much instantly.

    I know, it's not clever, but why fuss over DAWGs and histograms and data structures when somebody has already done most of the work for you?  Memory and disk space are cheap.  The clever part, as Eric already pointed out, is turning the word from a sequence into a set; how you do the transformation seems like micro-optimization to me.

    Would I fail the Microsoft interview for taking that approach? :-)

  • Of course it is not "cheating" because I haven't specified what the rules are. :-)

    This is not the first time I've implemented this program; my earlier JScript version worked by writing two programs, one which transformed the dictionary into an index file, and one which did lookups in the index file. That version was quite fast.

    That works great for the problem that I stated -- find all the words that exactly match a given rack. It might work much less well for harder problems, like "find the highest scoring word that can be made from any of the letters in this rack", or even harder still, like "find the highest scoring play for a given rack and board state".  For those problems, a simple index might not be good enough. I don't know; I haven't tried.

  • @Brad: You're right, a hashtable is a better solution (if the number of blanks is small). Neat.

  • Nice post, I've written an extension method that makes Enumerable character ranges easier to work with.

    So you can write:

    foreach (char c in 'A'.To('Z'))   
       Console.WriteLine(c);
    instead of:

    foreach (char c in "ABCDEFGHIJKLMNOPQRSTUVWXYZ")   
       Console.WriteLine(c);

    fe.

    [[ ERIC: A nice example of fluent programming -- however I find it a little bizarre that on the one hand you feel that its justified to use a crazy nano-optimization like using xor to swap two chars (which is on modern architectures no faster than just using a temp like a sensible person), and yet at the same time using iterators and lambdas and closures which have IMMENSE runtime costs compared to the nanosecond it takes to swap two chars.  I would write the code to be as clear as possible first, and then optimize it based on profile data later. ]]

  • Hi Eric,

    I've just played with your code and noted that the extension method Join won't compile. It must be declared public instead of private.

    Great exercise!

    All the best,

    Leniel Macaferi

    Or make the class that holds all the methods a static class, which is what I did. -- Eric

  • @Eric: you are quite right - the swap is better using a temp var, that will teach me for reusing over optimized array rotation code!

    private static void Swap(ref char a, ref char b)  
    {   
    temp = a;   
    a = b;  
    b = a;  
    }

    the previous version of the swap was more more effluent programming than fluent ;-)

  • Figment Engine:

    Um.. Check your code.

  • Interesante la lectura  acerca de la "codificación" en especial los dibujitos.

Page 1 of 2 (24 items) 12