The Stygian Depths Of Hacked-Together Scripts

The Stygian Depths Of Hacked-Together Scripts

  • Comments 30

Last time on FABULOUS ADVENTURES:

Eric Lippert: The second term is asymptotically insignificant compared to the first, so we'll throw it away, and declare that any sort on a list of n items has at least a worst case of O(n log n) comparisons. 

Leo McGarry: We must inform the President!

And now, the thrilling conclusion.

To extract and analyze the Google queries from my logs, I used a bunch of one-off scripts, egregious hacks, and programs written in a whole bunch of different languages. The one-off script has little in common with the ship-it-to-customers code I write every day.  One-off scripts do not need to be readable, testable, performant, robust, localizable, secure, extensible or even bug-free.  The one-off script does a job once, and then I might delete it or toss it in my "scratch" directory or whatever.  As long as it gets the job done, everything is fine.

First, let's pull down every referrer page and use a regular expression to parse out the query string.

Set re = New RegExp
re.Pattern = "a href=""http:\/\/[^>]*google[^>]*q=([^&""<]*)[&""<]"
re.Global = True
Set HTTP = CreateObject("Microsoft.XMLHTTP")
URL = http://blogs.msdn.com/ericlippert/admin/Referrers.aspx?pg=
For page = 1 to 2193
    HTTP.Open "GET", URL & page, False
    HTTP.Send
    Set matches = Re.Execute(HTTP.ResponseText)
    If Not Matches Is Nothing Then
        For Each Match in Matches
            For Each SubMatch In Match.SubMatches
                WScript.Echo SubMatch
            Next
        Next
    End If
Next  

I used VBScript because I prefer VBScript's syntax for parsing out submatches to JScript's.  See the stuff in the parens in the expression?  That's the submatch that I want to extract -- the actual query portion of the URL.

Notice that I have the number of pages hard-coded.  I knew how many pages there were, and therefore why would I bother writing logic to deduce that?  This is a one-off.  Don't like it?  Fix it yourself, buddy.

I ran that program -- which took quite a while, because downloading 2193 web pages isn't cheap -- and dumped the text to a file.  I now had a file that looked like this:

dllmain
determine+character+encoding
how+to+use+loadpicture+in+asp
what+is+garbage+collector+exactly+works+in+.net
jscript+innertext
wscript+object+in+asp.net
vbscript+round+up
wshcontroller+activex
closures+jscript.net
c++++smart+pointers

etc.  I was finding all those plus-for-space characters hard to read, so I loaded that thing into my editor and did a little vi magic to first replace all instances of c++ with CPLUSPLUS, then turned every + into a space, and then turned the CPLUSPLUSes back to c++That gave me

dllmain
determine character encoding
how to use loadpicture in asp
what is garbage collector exactly works in .net
jscript innertext
wscript object in asp.net
vbscript round up
wshcontroller activex
closures jscript.net
c++  smart pointers

Then I used my editor to search for the entries that started with "how", "what", and so on, because I figured that the questions would be the funny ones.

I was also interested in things like "what's are the top ten most popular words in these queries?", just out of curiosity.  Mike wrote a blog entry about that a while back, where he deduced from first principles how to solve the "what's the most popular word in this text?" problem. 

Let's break that problem down into three subproblems:

(1)     Obtain a list of words.
(2)     Figure out the frequency of each word.
(3)     Output the words, sorted by frequency.

Suppose that there are n words in the list obtained from (1).  Recall that yesterday we came up with an O(n2) solution to part (2) -- for each word in the list, compare it to every word that comes after in the list.  If we added a counter, we could determine how many collisions each word had.  To solve part (3), we could create a new list of (word, count) pairs, and then sort that list by frequency, somehow.

Mike came up with a much better solution than that for part (2).  Mike's solution was to first sort the word list.  Suppose you have the word list <the, cat, in, the, hat>.  Sorting it takes O(n log n) steps, as we learned yesterday: <cat, hat, in, the, the>  and suddenly finding the frequency of each word is easy.  You no longer need to compare each word to ALL the words that come after it, just the first few.  The sort is expensive but it made finding and counting the repeated words very cheap. 

This is also a good solution because it uses off-the-shelf parts; it uses a built-in sort routine to do the heavy lifting.

How then do you solve part (3)?  Mike again had the clever idea of re-using existing parts.  Dump the word and frequency data into a DataTable object, and use the DataTable's sorting methods to sort by frequency.  Another solution that he came up with later was to implement IComparable on an object that stores the (word, frequency) pairs and sorts on frequency.

Can we do better than O(n log n)?  Let's try.  And, as an added bonus, I'm going to use the new generic collection feature of C#, coming in Whidbey.  If you haven't seen this feature before, it's pretty cool.  To briefly summarize, you can now declare types like "a dictionary which maps integers onto lists of strings".  The syntax is straightforward, you'll pick it up from context pretty easily I'm sure.

Step one, read in the log and tokenize it using the convenient split method.

StreamReader streamReader = new StreamReader(@"c:\google.log");
string source = streamReader.ReadToEnd();
char[] punctuation = new char[] {' ', '\n', '\r', '\t', '(', ')', '"'};
string[] tokens = source.ToLower().Split(punctuation, true);

The true means "don't include any empty items" -- if we were splitting a string with two spaces in a row, for example, there would be an "empty" string between the two delimiters.  (In the next version of the framework, this method will take an enumerated type to choose options rather than this rather opaque Boolean.)

We've solved subproblem (1).  Onto (2):

Dictionary<string, int> firstPass = new Dictionary<string, int>();
foreach(string token in tokens)
{
    if (!firstPass.ContainsKey(token))
        firstPass[token] = 1;
    else
        ++firstPass[token];
}

On my first pass through the list of tokens, I look in the first pass table to determine if I've seen this token before.  If I have, then I increase its frequency count by one, otherwise I add it to the table with a frequency count of one. This loop is  O(n) because lookups and insertions on dictionary objects are of constant cost no matter how many items are already in the dictionary.  We've solved subproblem (2). 

On to subproblem (3).  Now I'm going to take my mapping from words to frequency and reverse it -- turn it into a mapping from frequency to a list of words that have that frequency.  I'm also going to keep track of the highest frequency item I've seen so far, because that will be convenient later.

As you can see, this uses the same trick as before.  If the frequency is already in the table then add the current word to the word list, otherwise create a new word list.

int max = 1;
Dictionary<int, List<string>> secondPass = new Dictionary<int, List<string>>();
foreach(string key in firstPass.Keys)
{
    int count = firstPass[key];
    if (count > max)
        max = count;
    if (!secondPass.ContainsKey(count))
        secondPass[count] = new List<string>();
    secondPass[count].Add(key);
}

Clearly the number of unique words in the firstPass table cannot be larger than n, so this loop is also O(n).  Inserting items into tables and onto the end of lists is of constant cost, so there's no additional asymptotic load there.

OK, great, but how does this help to solve subproblem (3)?  We need one more loop.

for (int i = max ; i >= 1 ; --i)
{
    if (secondPass.ContainsKey(i))
    {
        Console.WriteLine(i);
        foreach(string s in secondPass[i])
        {
            Console.WriteLine(s);
        }
    }
}

Again, this straightforward loop cannot possibly print out more than n items and all the list and table operations take constant time, so this is O(n). The algorithm spits out the list

3362
vbscript
2253
jscript
990
object
910
in
879
array
[...]
1
[...]
forward
xbox
honest

Three loops, none is ever worse than O(n). Therefore, the whole algorithm is O(n). As you can see, I've sorted the Google query keywords by frequency using a worst-case O(n) algorithm, thereby providing a counterexample to my proof from last time that such a sort algorithm is impossible.  Neat trick, proving a thing and then finding a counterexample!  How'd I manage to do that?

I've pulled a bait-and-switch.  What I proved yesterday was that any sort on a list where the only way to deduce the ordering requires data comparison is an O(n log n) problem.  But nowhere in this algorithm do I make so much as a single data comparison!  The internals of the table lookups do data comparisons to determine equality, but not ordering, and the comparisons to determine max and iterate the last loop are not  comparisons between two data, so they don't count. 

This sort algorithm doesn't follow our generalized sort schema at all, so clearly the proof doesn't apply.

I can get away with not making any comparisons because I know additional facts about the data -- facts which I did not assume in my proof.  In this case, the relevant fact is that the thing which I am sorting -- the list of frequencies -- is a list of n integers all of which are between 1 and n.  If you can assume that then as we've just seen, you don't have to use comparisons at all.  This algorithm is called CountingSort (for fairly obvious reasons). Another algorithm that has similar properties is RadixSort, to which a commenter alluded yesterday. 

If you're sorting lists of arbitrary strings or arbitrary floating point numbers where you cannot make any assumptions about the data then you'll end up using some version of the generalized comparison sort from yesterday.

The moral of the story (aside from "watch out for bait-and-switch schemes") is that hash tables, particularly generic hash tables, are incredibly useful off-the-shelf parts for a huge number of applications.  We use them all over the place in our code because they're easy to use and extremely efficient even when you throw lots of data at them.

But wait, something should be nagging at you.  There is something missing from this analysis.  Saying that this is an O(n) algorithm perhaps isn't quite the whole story.  What did I leave out?

  • "Also, I think you're misunderstanding the order of the nested loop. Yes, some of those lists could be a large fraction of n, but the TOTAL length of EVERY list added together is n or less.

    Is that clear? I mean, suppose you've got "the cat in the hat", and you have the hash table

    1 cat, in, hat
    2 the

    the fact that the first list is long means that the second will be short. The total order of the nested loop can't be more than O(n) "

    No this is not true, as I said above.

    Because you do the linked list for each item inserted this becomes N^2.

    To follow your example:
    The cat in the hat is fun.

    When you insert "is" it will take 4 steps to see if "is" exists. When you insert "fun" it will take 5 steps to see if "fun" exists, and so on... thus up to N steps N times, or N^2.
  • > you hash the index and then run down the *whole list* to check if an item already exists

    No, I don't. It's a hash table FROM numbers TO lists of strings. Here's where I hash the number:

    if (!secondPass.ContainsKey(count))

    and here's where I insert the key into the list:

    secondPass[count].Add(key);

    But nowhere do I check to see if the key is already in the list. If I did, yes, that would be expensive. But I don't need to because we can prove logically that there will NEVER be a collision on this list.

    "Add" just adds the item to the end of a linked list, and that's cheap.
  • > When you insert "is" it will take 4 steps to see if "is" exists

    No, it won't. At no time do I search a list. I search a hash table, yes, but not a list.
  • > I missed the part where you do "ContainsKey" in less than O(log n). Can you explain

    Consider a hash table with chaining that rehashes whenever the average chain length exceeds, say, four. On average, such a hash table requires one hash and a maximum of four comparisons per lookup, no matter how big the table gets.

    As Mike points out above, rehashing the table might get expensive. But don't forget that the amortized cost of the rehashing is going to be spread over all the inserts, most of which will be very cheap. The cost of re-hashing is going to be O(n), but you only rehash about O(1/n) of the time, so it works out to O(1) per insert.
  • > Why care about performance for a one-off task?

    I don't. And clearly the time taken to fetch the 2193 web pages was way, way longer than the time taken to move the strings around in memory!

    However, two things:

    First, note that this is a VERY short algorithm that I put together very quickly. One of the aims of me writing this blog is to give scripters new techniques that they might not have been exposed to before. Lots of people do not understand the power of tables to make small, fast programs.

    Second, this is a somewhat contrived example that I'm using as an educational opportunity. There have been a number of threads on JOS and elsewhere lately on the value or lack thereof of a mathematical approach to computing. I'm trying to illustrate the intersection between theoretical computer science and practical down-n-dirty programming.

    That's why I care what the order is.
  • > And, while theoretically O(n) is better than O(n log n), in practice O(n) might be SomethingBig * n

    Indeed, in practice you cannot disregard the constant factor. For example, when I was a foolish intern I suggested to Tim Paterson that we rewrite the Instr method to use a
    "better" string search algorithm. InStr uses an O(n m) algorithm where n and m are the lengths of the search string and the pattern. There exist O(n + m) algorithms, which are obviously better for large n and m.

    But the naive algorithm Instr uses is BLAZINGLY fast for common cases. Most of the work optimizes down to a single machine instruction. The algorithmically better methods do lots of expensive string preprocessing up front and end up being slower in typical cases.
  • "But nowhere do I check to see if the key is already in the list. If I did, yes, that would be expensive. But I don't need to because we can prove logically that there will NEVER be a collision on this list. "

    You do walk the list when you perform the secondPass.ContainsKey(count) -- that is how hash tables work... if there is a colision you walk a linked list of items with that hash. AND YES there will be collisions... all words that occure once will hash to the same location because the key is the number of times a word occures.

    For a good example of how this effects performance see Raymond Chen's link above.

    -----

    It is interesting that your last post was about typical performance for InStr, you have the same problem here... for the best input data this will run at O(N), for the typical case (eg lots of words that only appear once) it will run at O(N^2) and for almost all cases it will run at O(NlogN) or worse.




  • I'm sorry to have to continue to be so contradictory, Hogan.

    Yes, checking for the collision might walk a hash chain. (Using "list" to mean "hash chain" is confusing, particularly when I have a hash table of lists.)

    But as I said above, if the hash table is written using an extensible-hashing-with-chaining algorithm then the average chain length is bounded to some small number.

    In the JScript engine we use this algorithm in the name tables and thereby guarantee that the average number of comparisons is less than 2.5 no matter how big the table gets. Thus, its O(1) for collision detection, not O(n).

    Raymond's link is about a different topic -- how to put data in a hash table such that the data is chosen to deliberate produce bad performance. That's a separate topic that maybe I'll cover later -- writing a hash table to have good performance even when attacked is a hard problem. For my purposes, I'm assuming that the hash table has good performance on typical, non-hostile data.
  • Let me run through a small example to see if I can convince you that the many-unique-words case is O(n). Consider a list of three words, A, B and C.

    In the first pass I produce this hash table:

    A -> 1
    B -> 1
    C -> 1

    You agree that each insert into this hash table is O(1) and hence building the hash table is O(n), right?

    In the second pass, I build the "second pass" hash table like this. It starts empty. I extract (A, 1) from the first hash table and look up "1" in the second hash table. I don't find it, so I insert

    1 -> < >

    That's an O(1) operation.

    Then I say "fetch me the list" and get back < >. That's an O(1) operation.

    Then I insert A into the list. That's an O(1) operation, and now I have

    1 -> < A >

    OK, we've made it through the loop once. The second time through I extract (B, 1) from the first hash table. I look up "1" in the second hash table, which is an O(1) operation. But I don't INSERT anything into the second hash table. I insert something into the LIST which I have allocated. I fetch a reference to the list and get back < A > , and I insert B onto the end of that list to produce <A, B>. I don't change the hash table at all, I change the contents of the thing that the hash table refers to. So now we have

    1 -> <A, B>

    And we've made it through the second loop doing only O(1) operations.

    Third time, same thing. Extract (C, 1), look up 1, get the list, insert C onto the end of the list.

    Does that make sense?


  • Obviously Eric's example requires that your list contains a pointer to the _end_ of the list as well as the head. Most non-academic linked list code I've used does: it simply makes your linked list more flexible at the cost of one pointer. It's not quite a necessity for a doubly-linked list to have a pointer to the end, but it makes things more orthogonal (particularly if you use the pointer-to-pointer technique).

    I agree that if you only had a pointer to the head of the list, it would be O(n), because you have to find the end if you want to add to it. But if you knew that, you'd add the new item to the head of the list rather than the end: we're not guaranteeing which order the words with the same frequency appear in.
  • Correct -- a double-linked list (or a single-linked list with a pointer to the end) is required if you want to do O(1) inserts on the end.

    In fact, the List<> implementation is not a linked list at all. It's a double-when-full array list. The amortized cost of insertions at the end of a DWF is also O(1). The down side is that the cost of insertions at the beginning is O(n).
  • I've always thought the naming of ArrayList (and obviously the new List<>) was weird. They're not lists, they're arrays. For a while I thought the name of ArrayList implied a deque<> (from STL), but digging around with Reflector cleared it up: the nearest STL equivalent is a vector<>.

    I still think in C++ ;-)

    I assume that cache locality is a distinct factor in this kind of thing. If your behaviour is read-mostly, append new items rather than insert, then an array typically has better locality of reference than a linked list. Unless you do what MFC's collections do and allocate whole blocks of nodes in one go, keeping freed nodes on a free list (and even then you get a lot of indirection through pointers).

  • I&amp;nbsp;just got a question this morning about how to take two collections of items and determine how...

  • I&amp;nbsp;just got a question this morning about how to take two collections of items and determine how...
Page 2 of 2 (30 items) 12