The Stygian Depths Of Hacked-Together Scripts

The Stygian Depths Of Hacked-Together Scripts

  • Comments 30

Last time on FABULOUS ADVENTURES:

Eric Lippert: The second term is asymptotically insignificant compared to the first, so we'll throw it away, and declare that any sort on a list of n items has at least a worst case of O(n log n) comparisons. 

Leo McGarry: We must inform the President!

And now, the thrilling conclusion.

To extract and analyze the Google queries from my logs, I used a bunch of one-off scripts, egregious hacks, and programs written in a whole bunch of different languages. The one-off script has little in common with the ship-it-to-customers code I write every day.  One-off scripts do not need to be readable, testable, performant, robust, localizable, secure, extensible or even bug-free.  The one-off script does a job once, and then I might delete it or toss it in my "scratch" directory or whatever.  As long as it gets the job done, everything is fine.

First, let's pull down every referrer page and use a regular expression to parse out the query string.

Set re = New RegExp
re.Pattern = "a href=""http:\/\/[^>]*google[^>]*q=([^&""<]*)[&""<]"
re.Global = True
Set HTTP = CreateObject("Microsoft.XMLHTTP")
URL = http://blogs.msdn.com/ericlippert/admin/Referrers.aspx?pg=
For page = 1 to 2193
    HTTP.Open "GET", URL & page, False
    HTTP.Send
    Set matches = Re.Execute(HTTP.ResponseText)
    If Not Matches Is Nothing Then
        For Each Match in Matches
            For Each SubMatch In Match.SubMatches
                WScript.Echo SubMatch
            Next
        Next
    End If
Next  

I used VBScript because I prefer VBScript's syntax for parsing out submatches to JScript's.  See the stuff in the parens in the expression?  That's the submatch that I want to extract -- the actual query portion of the URL.

Notice that I have the number of pages hard-coded.  I knew how many pages there were, and therefore why would I bother writing logic to deduce that?  This is a one-off.  Don't like it?  Fix it yourself, buddy.

I ran that program -- which took quite a while, because downloading 2193 web pages isn't cheap -- and dumped the text to a file.  I now had a file that looked like this:

dllmain
determine+character+encoding
how+to+use+loadpicture+in+asp
what+is+garbage+collector+exactly+works+in+.net
jscript+innertext
wscript+object+in+asp.net
vbscript+round+up
wshcontroller+activex
closures+jscript.net
c++++smart+pointers

etc.  I was finding all those plus-for-space characters hard to read, so I loaded that thing into my editor and did a little vi magic to first replace all instances of c++ with CPLUSPLUS, then turned every + into a space, and then turned the CPLUSPLUSes back to c++That gave me

dllmain
determine character encoding
how to use loadpicture in asp
what is garbage collector exactly works in .net
jscript innertext
wscript object in asp.net
vbscript round up
wshcontroller activex
closures jscript.net
c++  smart pointers

Then I used my editor to search for the entries that started with "how", "what", and so on, because I figured that the questions would be the funny ones.

I was also interested in things like "what's are the top ten most popular words in these queries?", just out of curiosity.  Mike wrote a blog entry about that a while back, where he deduced from first principles how to solve the "what's the most popular word in this text?" problem. 

Let's break that problem down into three subproblems:

(1)     Obtain a list of words.
(2)     Figure out the frequency of each word.
(3)     Output the words, sorted by frequency.

Suppose that there are n words in the list obtained from (1).  Recall that yesterday we came up with an O(n2) solution to part (2) -- for each word in the list, compare it to every word that comes after in the list.  If we added a counter, we could determine how many collisions each word had.  To solve part (3), we could create a new list of (word, count) pairs, and then sort that list by frequency, somehow.

Mike came up with a much better solution than that for part (2).  Mike's solution was to first sort the word list.  Suppose you have the word list <the, cat, in, the, hat>.  Sorting it takes O(n log n) steps, as we learned yesterday: <cat, hat, in, the, the>  and suddenly finding the frequency of each word is easy.  You no longer need to compare each word to ALL the words that come after it, just the first few.  The sort is expensive but it made finding and counting the repeated words very cheap. 

This is also a good solution because it uses off-the-shelf parts; it uses a built-in sort routine to do the heavy lifting.

How then do you solve part (3)?  Mike again had the clever idea of re-using existing parts.  Dump the word and frequency data into a DataTable object, and use the DataTable's sorting methods to sort by frequency.  Another solution that he came up with later was to implement IComparable on an object that stores the (word, frequency) pairs and sorts on frequency.

Can we do better than O(n log n)?  Let's try.  And, as an added bonus, I'm going to use the new generic collection feature of C#, coming in Whidbey.  If you haven't seen this feature before, it's pretty cool.  To briefly summarize, you can now declare types like "a dictionary which maps integers onto lists of strings".  The syntax is straightforward, you'll pick it up from context pretty easily I'm sure.

Step one, read in the log and tokenize it using the convenient split method.

StreamReader streamReader = new StreamReader(@"c:\google.log");
string source = streamReader.ReadToEnd();
char[] punctuation = new char[] {' ', '\n', '\r', '\t', '(', ')', '"'};
string[] tokens = source.ToLower().Split(punctuation, true);

The true means "don't include any empty items" -- if we were splitting a string with two spaces in a row, for example, there would be an "empty" string between the two delimiters.  (In the next version of the framework, this method will take an enumerated type to choose options rather than this rather opaque Boolean.)

We've solved subproblem (1).  Onto (2):

Dictionary<string, int> firstPass = new Dictionary<string, int>();
foreach(string token in tokens)
{
    if (!firstPass.ContainsKey(token))
        firstPass[token] = 1;
    else
        ++firstPass[token];
}

On my first pass through the list of tokens, I look in the first pass table to determine if I've seen this token before.  If I have, then I increase its frequency count by one, otherwise I add it to the table with a frequency count of one. This loop is  O(n) because lookups and insertions on dictionary objects are of constant cost no matter how many items are already in the dictionary.  We've solved subproblem (2). 

On to subproblem (3).  Now I'm going to take my mapping from words to frequency and reverse it -- turn it into a mapping from frequency to a list of words that have that frequency.  I'm also going to keep track of the highest frequency item I've seen so far, because that will be convenient later.

As you can see, this uses the same trick as before.  If the frequency is already in the table then add the current word to the word list, otherwise create a new word list.

int max = 1;
Dictionary<int, List<string>> secondPass = new Dictionary<int, List<string>>();
foreach(string key in firstPass.Keys)
{
    int count = firstPass[key];
    if (count > max)
        max = count;
    if (!secondPass.ContainsKey(count))
        secondPass[count] = new List<string>();
    secondPass[count].Add(key);
}

Clearly the number of unique words in the firstPass table cannot be larger than n, so this loop is also O(n).  Inserting items into tables and onto the end of lists is of constant cost, so there's no additional asymptotic load there.

OK, great, but how does this help to solve subproblem (3)?  We need one more loop.

for (int i = max ; i >= 1 ; --i)
{
    if (secondPass.ContainsKey(i))
    {
        Console.WriteLine(i);
        foreach(string s in secondPass[i])
        {
            Console.WriteLine(s);
        }
    }
}

Again, this straightforward loop cannot possibly print out more than n items and all the list and table operations take constant time, so this is O(n). The algorithm spits out the list

3362
vbscript
2253
jscript
990
object
910
in
879
array
[...]
1
[...]
forward
xbox
honest

Three loops, none is ever worse than O(n). Therefore, the whole algorithm is O(n). As you can see, I've sorted the Google query keywords by frequency using a worst-case O(n) algorithm, thereby providing a counterexample to my proof from last time that such a sort algorithm is impossible.  Neat trick, proving a thing and then finding a counterexample!  How'd I manage to do that?

I've pulled a bait-and-switch.  What I proved yesterday was that any sort on a list where the only way to deduce the ordering requires data comparison is an O(n log n) problem.  But nowhere in this algorithm do I make so much as a single data comparison!  The internals of the table lookups do data comparisons to determine equality, but not ordering, and the comparisons to determine max and iterate the last loop are not  comparisons between two data, so they don't count. 

This sort algorithm doesn't follow our generalized sort schema at all, so clearly the proof doesn't apply.

I can get away with not making any comparisons because I know additional facts about the data -- facts which I did not assume in my proof.  In this case, the relevant fact is that the thing which I am sorting -- the list of frequencies -- is a list of n integers all of which are between 1 and n.  If you can assume that then as we've just seen, you don't have to use comparisons at all.  This algorithm is called CountingSort (for fairly obvious reasons). Another algorithm that has similar properties is RadixSort, to which a commenter alluded yesterday. 

If you're sorting lists of arbitrary strings or arbitrary floating point numbers where you cannot make any assumptions about the data then you'll end up using some version of the generalized comparison sort from yesterday.

The moral of the story (aside from "watch out for bait-and-switch schemes") is that hash tables, particularly generic hash tables, are incredibly useful off-the-shelf parts for a huge number of applications.  We use them all over the place in our code because they're easy to use and extremely efficient even when you throw lots of data at them.

But wait, something should be nagging at you.  There is something missing from this analysis.  Saying that this is an O(n) algorithm perhaps isn't quite the whole story.  What did I leave out?

  • Well, the biggest thing left out is the amount of memory/temporaries used.

    :)

    I believe there are algorithmns that can sort faster than quicksort, but quicksort uses (IIRC) O(ln(N)) memory where as they tend to use O(N)

    In a production program, an O(n) memory using algorithmn can kill the whole thing's performance despite being fast in isolation.
  • Good point. For a list which mostly consists of unique words, this solution makes two entire in-memory copies of most of the data in the list. If the data are large and not copied by reference, that could be quite expensive.

    Quicksort on the other hand only consumes memory in the form of stack frames, and each stack frame is of a constant size, and in the typical case there will be O(log n) stack frames max in use at any time. (The standard quicksort algorithm suffers from a n-squared worst case in rare cases, but getting into those details would take us far afield.)

    There are other aspects that I've neglected in this analysis. (Hint: think about what I said above; what if the data are large?)
  • You're also cheating and assuming that hash table operations are O(1). In reality, hash table complexity is... well... complex. Here's a web page that contains various hash "attacks":

    http://www.cs.rice.edu/~scrosby/hash/
  • Jeez, very cool.

    Know something weird? AFAICT, the boolean parameter on Split() isn't documented:

    http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpref/html/frlrfsystemstringclasssplittopic.asp

    'less I'm looking at the wrong Split().
  • The code below degrades to O(N^2) -- degrades quickly even with a good hash algorythm if your hash table is small (that is size N or less). I seem to remember that you need a hash table size of 2N (or was it N^2?) for the hash table to be able to assure performance... but I don't remember for sure.

    In any case the problem is that secondPass.ContainsKey(i) walks a linked list containing up to N items with that hash value.

    for (int i = max ; i >= 1 ; --i)
    {
    if (secondPass.ContainsKey(i))
    {
    Console.WriteLine(i);
    foreach(string s in secondPass[i])
    {
    Console.WriteLine(s);
    }
    }
    }

    Now that I think about it, all the loops degrade this way, this one is just clearest because it has the nested loops.

    In actually, this code is going to be very slow:
    foreach(string key in firstPass.Keys)
    {
    int count = firstPass[key];
    if (count > max)
    max = count;
    if (!secondPass.ContainsKey(count))
    secondPass[count] = new List<string>();
    secondPass[count].Add(key);
    }
    Because most of your words are going to appear once, the hash bucket for 1 is going to be huge and slow. This section will degrade O(Z^2) where Z is the number of words that appear once.



  • > The code below degrades to O(N^2)

    You're assuming a very naive hash table algorithm. Sure, if you use a really terrible hash table, things go bad real fast.

    As Raymond pointed out, it's complicated. But for many situations, you can use extensible hashing and get pretty close to O(1) searches and inserts.
  • > Because most of your words are going to appear once, the hash bucket for 1 is going to be huge and slow

    I think you're misunderstanding the code. The hash bucket for one only has one thing in it, and that's a linked list. Adding to the end of a linked list is O(1).
  • Also, I think you're misunderstanding the order of the nested loop. Yes, some of those lists could be a large fraction of n, but the TOTAL length of EVERY list added together is n or less.

    Is that clear? I mean, suppose you've got "the cat in the hat", and you have the hash table

    1 cat, in, hat
    2 the

    the fact that the first list is long means that the second will be short. The total order of the nested loop can't be more than O(n)
  • You've not told the dictionary how many items to expect, so it's going to start with whatever the default minimum size is. When you try to add more than this, it has to create a new underlying array and copy all the items to it. This will be repeated if the new array still isn't enough to hold all the items.

    If Dictionary is implemented like Hashtable, it's based on skipping forward to another bucket if the bucket selected by the hash is already in use. It will also rehash the current table once the number of hash collisions reaches a significant amount (load factor multiplied by the number of buckets in the table). I did a bit of digging with Reflector - I don't have VS2005 CTP here, so I can't confirm what Dictionary<> does.

    When the underlying bucket array is expanded, the code searches for the smallest prime which is greater than the size you asked for. It has a small table of 70 primes (I assume starting around 100ish with a decent step between them); if you ask for a capacity bigger than that, it starts searching prime numbers by dividing your candidate by every odd number up to Sqrt(candidate). When it expands automatically, the minimum size is twice the current size plus 1.

    Yes, best case is O(1), but if you get hash collisions or the table needs to rehash, it suddenly gets a lot slower.

    I'm trying to think of the rough capacities required. The two obvious cases I can think of are where all the words are different - the first Dictionary needs a capacity of N, but the second only 1 - and where the words are all the same - both dictionaries only need capacity of 1. I'm not too good at this O() stuff!
  • Of course the real question is why were those "c++" terms written as "c++" and not "c%2B%2B"? Something fishy is going on there!
  • I missed the part where you do "ContainsKey" in less than O(log n). Can you explain?
  • Um, er, I’m going to “say something stupid™”.

    Why care about performance for a one-off task? It’s a one-off task, you have a specific and not very large n, and you don’t actually care if it takes O(n), O(n log n), or O(n^2).

    And, while theoretically O(n) is better than O(n log n), in practice O(n) might be SomethingBig * n and O(n log n) might be n log n and n might be 3, and for n = 3 bubble sort suddenly becomes the optimal algorithm :)
  • I don't accept the O(1) hashtable operations claim either. Some (admittedly not very well designed) quick tests on my machine doing something very similar with integer keys and adding or incrementing a counter value every time a key is hit suggest something else:

    size=32768, time/op=5.66E-06
    size=131072, time/op=6.57E-06
    size=524288, time/op=9.53E-06
    (any more and I start swapping :)

    I concede that hashtables can be setup to do lookups in O(1) time, but in the case of the .NET 1.1 hashtable at least, they probably aren't (cursory inspection using reflector would appear to confirm this, but I may be missing something).
  • "I think you're misunderstanding the code. The hash bucket for one only has one thing in it, and that's a linked list. Adding to the end of a linked list is O(1)."

    No I do understand it... I'm not talking about the insert part I'm talking about the checking part... you hash the index and then run down the *whole list* to check if an item already exists. If most of the items are only seen once then this degrades the hash from O(1) to O(N)

    There are other problems with this algorithm but this is the most obvious one.

    As an aside this algorithm (with some modification) looks like it might work quite well with multi-processor systems.
Page 1 of 2 (30 items) 12