Breadth is sometimes better than depth

Breadth is sometimes better than depth

Rate This
  • Comments 16

A while back the Scripting Guys blogged about using recursion to list all the files in a directory and all its subdirectories. Something that you'll notice about this very standard solution to this problem is that it is "depth first". That is, it lists the files like this:

c:\xyz.txt
c:\foo\bbb.txt
c:\foo\abc\def.txt
c:\foo\bar\baz.txt
c:\foo\bar\blah.txt
c:\qaz\zaq.txt
c:\qaz\lll\ggg.txt

It picks a branch and follows it down the tree as far as it can, listing directories as it goes, retreats back up as little as possible, picks another branch to go down, goes as far down it as possible, and so on, until the entire tree has been explored. You can see why this is called "depth first".

You don't need to use recursion to do this; recursion is just an elegant way to solve the problem. The essence of the algorithm is that information about the next node to process is stored in a stack. The algorithm is simple: pop the stack, process the directory, push all the subdirectories onto the stack, repeat until the stack is empty. Recursion just lets you do this algorithm without making the stack explicit; the stack is the call stack and doing the recursive call pushes a new frame on the call stack for you. We can easily do this without recursion by making the stack explicit:

var FSO = new ActiveXObject("Scripting.FileSystemObject");
var stack = new Array();
stack.push(FSO.GetFolder("."));
while(stack.length > 0)
{
  var folder = stack.pop();
  for (var enumtor = new Enumerator(folder.Subfolders) ; !enumtor.atEnd(); enumtor.moveNext())
  {
    try
    {
      stack.push(enumtor.item());
    }
    catch(e){}
  }
  for (enumtor = new Enumerator(folder.Files) ; !enumtor.atEnd(); enumtor.moveNext())
  {
    try
    {
      WScript.Echo(enumtor.item().Path);
    }
    catch(e){}
  }
}

Notice that I've wrapped the file manipulation code in try-catches. As I've mentioned before, in this script if there is a problem reading a file -- like, I don't own it and therefore I get a security violation -- I want to skip the problem and continue on. I do not want to write robust error handling for this trivial one-off script.

The other day I had a problem -- a few weeks earlier I had written a little test program and stuck it in a directory somewhere very near the top of my directory hierarchy, but I couldn't remember where exactly. I already had the depth-first program above, and could easily modify it to add a regular expression search of each file looking for a pattern. But there are some very deep directories on that drive where I knew there would be no matches, but would take a lot of time to search. I could have written code to omit those directories, but there was another way to solve this problem: since I knew that the file was going to be somewhere shallow, don't do a depth-first search in the first place. Do a breadth-first search. That is, search all the files in the top level, then all the files that are one deep, then all the files that are two deep, then three deep, and so on.

The nice thing is that it's basically the same algorithm, just with a minor change. Instead of doing a last-in-first-out stack, we'll just do a first-in-first-out queue and magically get a breadth-first traversal:

var FSO = new ActiveXObject("Scripting.FileSystemObject");
var queue = new Array();
queue[queue.length] = FSO.GetFolder(".");
var counter = 0;
while(counter < queue.length)
{
  var folder = queue[counter];
  for (var enumtor = new Enumerator(folder.Subfolders) ; !enumtor.atEnd(); enumtor.moveNext())
  {
   
try
    {
      queue[queue.length] = enumtor.item();
    }
    catch(e) {}
  }
  for (enumtor = new Enumerator(folder.Files) ; !enumtor.atEnd(); enumtor.moveNext())
  {
    try
    {
      WScript.Echo(enumtor.item().Path);
    }
    catch(e){}
  }
  queue[counter] = null;
  counter++;
}

This then puts out the files in the order

c:\xyz.txt
c:\qaz\zaq.txt
c:\foo\bbb.txt
c:\qaz\lll\ggg.txt
c:\foo\abc\def.txt
c:\foo\bar\baz.txt
c:\foo\bar\blah.txt

going from shallowest to deepest gradually rather than continually diving way down and then coming back up.

  • Back in the day, many many moons ago when the web was young and I had more time than common sense (c. 1995): I worked for a Large Domestic Auto Manufacturer that was beginning to experiment with putting documentation and instructions onto an internal web. The problem was that this largely unoffical web didn't have any kind of search facility. An "official" search solution was still many months away.

    Our group had a good network pipe, some spare unix boxes, and a dangerous amount of Perl knowledge.

    So I built my own.

    At its heart was a crawler and an indexer. The indexer was largely uninteresting as it pulled apart html pages, threw away stopwords, and built tables in the filesystem based on hashes of keywords and pointers to the original URLs.

    The crawler originally was a depth first recursive program, designed to grab the top page, linked pages, and then follow those all the way down (excluding pages already searched). I had two problems right away: I ran out of memory (swap! swap! crash!) and regularly ran web servers into the dirt.

    The solution to both was the same. I modified the crawler to keep a queue of "places that I need to explore" externally on disk. As I visited a place removed from the head of the queue, I'd put the links found onto the end of the queue. (A further refinement was to take sites that came up twice in a row in the queue and throw the second site back onto the end of the queue.) The change from recursion to a queue was trivial.

    Now the crawler worked in a mostly breadth-first fashion because the new links from the page explored wound up at the end of the queue. If you're searching more than one site you'll alternate between those in a balanced fashion without having to put the breaks on the process to throttle it. The more important pages tended towards the top of the sites anyway and would get indexed first.

    The added benefit was when it crashed, I knew exactly where it left off (the head of the queue). I built a restart facility using cron. and if crawler/indexer crashed twice on the same URL, I'd throw that into a "broken" queue and remove it from the work queue.
  • The queue idea is very slick. Perhaps now I can convince my wife that breadth is better than depth! ;)
  • Indeed, depth-first is a bad way to search the web. There are branches of the web which are for all practical purposes infinitely deep, but very few pages are infinitely broad.
  • Why not use .push() and .shift() instead of the manual array manipulation in the second example? Unless you're using IE 5.01 of course. :)
  • Shift is an array copy. If there are a thousand items in the array, that's a thousand property creations, a thousand property deletions and a thousand value copies _every_ time it's called. _Very_ expensive.
  • This is a trade off - like with everything else in life: would you trade the simplicity in coding for the memory eaten by your stack? When the 200G drives are becoming a normal thing what would be amount of files and folders you have to "push" and "pop"?

    My 20G logical drive for "all things work" has 20 000 folders sometimes up to 20 deep - i like my folders organised...

    I do agree, however, that one should look at recursive functions in the morning somewhere between the second cup of cofee and lunch.
  • Recently I needed to extend our build process to perform a simple textual manipulation on some files: replace a placeholder with the version text. I implemented this step as a .wsf file using JScript and the BeyondJS library. The reason: BeyondJS made it very easy to perform a sequence of operations on all the files in a subdirectory that matched a particular pattern. For example, this code snippet performs the same operation as Eric's second sample (it's breadth-first):

    File.recurse(".").foreach(alert);

    "alert" in this case is a wrapper function for WScript.Echo. Yes, I've cut a few corners: it *is* recursive and I've neglected the try-catch. I still think it's cool.

    Here is the code to output only .htm or .html files:

    File.recurse(".").filter(/\.html?$/i).foreach(alert);
  • > would you trade the simplicity in coding for the memory eaten by your stack?

    I'm not following your point. Perhaps I am misunderstanding you.

    JScript isn't tail recursive, and in fact typically will blow the stack after about 600 recursions. Using an explicit "heap based" stack uses LESS memory and is MORE robust than using recursion to manage the call stack for you.
  • JScript may not have tail recursion, but Cocoon's Rhino-based JavaScript implementation does. It even sports continuations:

    http://wiki.apache.org/cocoon/RhinoWithContinuations
  • That's pretty darn cool -- but of course in this case the point is moot because this isn't an algorithm that can be made tail recursive.

    To be tail recursive, the function can't do any work that needs the state of the local variables after the recursive call, but in this algorithm, the state of the enumerator is needed after the recursive call.
  • Breadth is better than depth especially in cases where the tree being traversed is potentially infinite. Breadth-first, each node will eventually be visited. Width-first, the algorithm will dive into the first infinite branch and never return.

    With file systems, infinite trees emerge from circular symbolic links (aka junction points). Of course, any practical file system scanner must detect loops and not visit the same directory twice. Which brings a question: What is the best way to determine that C:\foo is the same as C:\foo\foo is the same as C:\foo\foo\foo?
  • Canonicalization would be an obvious answer to that. Canonical("C:\foo\foo\foo") --> "C:\foo"

    Now with URLs you have a problem, since you can't know that "http://foo.com/bar" and "http://www.foo.com/baz" are actually the same thing - two names for the same item.

    Life is complicated... :-)
  • Eric, are you sure about how shift works? Here are my timing results for shifting/popping an array of 5000 numbers and 5000 relatively large objects. The times are almost identical regardless of the type of element (in seconds):

    operation, numbers, strings, objects
    pop, 3359, 3360, 3343
    shift, 12796, 12844, 12829

    If it was actually copying the whole (remaining) array on each iteration of shift I would have thought the difference would be much more than 3.8x slower, and that times would vary by element type. Here's the code:

    var a = [];
    var n = 5000;
    for ( var i=0; i < n; i++ )
    a[i] = i-12;
    var t1 = new Date();
    for ( var i=0; i < n; i++ )
    t = a.shift();
    var t2 = new Date();
    alert(t2.getTime()-t1.getTime());
  • * Yes I'm sure! Geez!

    * On my box, the differential is a factor of five, and my timings are slower than yours. You must have a fast box!

    * I would expect that numbers, strings and objects would give roughly the same timings. The copy itself isn't the expensive part. Copying a number is just a memory copy, copying an object is a pointer copy and a call to addref, copying a string is an allocation off the OLE Aut small string heap, which is optimized for exactly this situation. OA caches small strings; if an allocation comes in for a string of the same size as one that was just freed, it reuses the memory. Very fast. (This complicates memory leak detection, see Larry Osterman's recent blog on the subject.)

    * The reason why the differential is so low is because the fixed cost of calling any method on an array object is high. In part, it's because you could assign the pop method to an object other than an array. Let me break "pop" down for you:

    1) verify that a "this" argument was passed
    2) convert the "this" argument to its object form if it is not already
    3) verify that the object form is a JScript object, not an external COM object
    4) extract the value of the length property
    5) convert the value of the length property to an integer
    6) from the length, figure out what index string is the one to pop
    7) Look up that value -- if it exists, delete it and decrement the length property

    Actually deleting the property is a tiny amount of work compared to all the rest. Similarly with shift and unshift, except that with shift and unshift, there's a big overhead followed by a big cost.

    * Everything I just said in the previous point was a lie. Why? Because nowhere did I factor in the cost of doing the garbage collection. Most of the time spent in the pop operation is in the GC, cleaning up those 5000 elements.

    This needs a much deeper analysis to understand what's going on.



  • Thanks for the info, Eric. The reason I originally asked the question was because it seemed odd to avoid shift() given that it's the logical compliment to pop().

    I'm working with a black box, so timing is about all I can do to see the differences. A .shift() version of your second example ran almost the exact same time on my box when I crawled an entire system drive (10GB and 88,000 files); it looks like disk I/O is the bottleneck in that case. For smaller tests where the disk info could be cached the .shift() version was about 70% slower. I didn't see any major difference in memory usage observing Task Manager.
Page 1 of 2 (16 items) 12