<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://blogs.msdn.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Breadth is sometimes better than depth</title><link>http://blogs.msdn.com/ericlippert/archive/2004/09/27/breadth-is-sometimes-better-than-depth.aspx</link><description>A while back the Scripting Guys blogged about using recursion to list all the files in a directory and all its subdirectories. Something that you'll notice about this very standard solution to this problem is that it is "depth first". That is, it lists</description><dc:language>en-US</dc:language><generator>CommunityServer 2.1 SP1 (Build: 61025.2)</generator><item><title>re: Breadth is sometimes better than depth</title><link>http://blogs.msdn.com/ericlippert/archive/2004/09/27/breadth-is-sometimes-better-than-depth.aspx#234859</link><pubDate>Mon, 27 Sep 2004 18:51:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:234859</guid><dc:creator>Clinton Pierce</dc:creator><description>Back in the day, many many moons ago when the web was young and I had more time than common sense (c. 1995): I worked for a Large Domestic Auto Manufacturer that was beginning to experiment with putting documentation and instructions onto an internal web.  The problem was that this largely unoffical web didn't have any kind of search facility.  An &amp;quot;official&amp;quot; search solution was still many months away.&lt;br&gt;&lt;br&gt;Our group had a good network pipe, some spare unix boxes, and a dangerous amount of Perl knowledge.&lt;br&gt;&lt;br&gt;So I built my own.&lt;br&gt;&lt;br&gt;At its heart was a crawler and an indexer.  The indexer was largely uninteresting as it pulled apart html pages, threw away stopwords, and built tables in the filesystem based on hashes of keywords and pointers to the original URLs.&lt;br&gt;&lt;br&gt;The crawler originally was a depth first recursive program, designed to grab the top page, linked pages, and then follow those all the way down (excluding pages already searched).  I had two problems right away: I ran out of memory (swap! swap!  crash!) and regularly ran web servers into the dirt.&lt;br&gt;&lt;br&gt;The solution to both was the same.  I modified the crawler to keep a queue of &amp;quot;places that I need to explore&amp;quot; externally on disk.  As I visited a place removed from the head of the queue, I'd put the links found onto the end of the queue.  (A further refinement was to take sites that came up twice in a row in the queue and throw the second site back onto the end of the queue.)  The change from recursion to a queue was trivial.&lt;br&gt;&lt;br&gt;Now the crawler worked in a mostly breadth-first fashion because the new links from the page explored wound up at the end of the queue.  If you're searching more than one site you'll alternate between those in a balanced fashion without having to put the breaks on the process to throttle it.  The more important pages tended towards the top of the sites anyway and would get indexed first.&lt;br&gt;&lt;br&gt;The added benefit was when it crashed, I knew exactly where it left off (the head of the queue).  I built a restart facility using cron. and if crawler/indexer crashed twice on the same URL, I'd throw that into a &amp;quot;broken&amp;quot; queue and remove it from the work queue.</description></item><item><title>re: Breadth is sometimes better than depth</title><link>http://blogs.msdn.com/ericlippert/archive/2004/09/27/breadth-is-sometimes-better-than-depth.aspx#234872</link><pubDate>Mon, 27 Sep 2004 19:24:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:234872</guid><dc:creator>Christian Romney</dc:creator><description>The queue idea is very slick. Perhaps now I can convince my wife that breadth is better than depth! ;)</description></item><item><title>re: Breadth is sometimes better than depth</title><link>http://blogs.msdn.com/ericlippert/archive/2004/09/27/breadth-is-sometimes-better-than-depth.aspx#234878</link><pubDate>Mon, 27 Sep 2004 19:42:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:234878</guid><dc:creator>Eric Lippert</dc:creator><description>Indeed, depth-first is a bad way to search the web.  There are branches of the web which are for all practical purposes infinitely deep, but very few pages are infinitely broad.&lt;br&gt;</description></item><item><title>re: Breadth is sometimes better than depth</title><link>http://blogs.msdn.com/ericlippert/archive/2004/09/27/breadth-is-sometimes-better-than-depth.aspx#234919</link><pubDate>Mon, 27 Sep 2004 21:29:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:234919</guid><dc:creator>Dave</dc:creator><description>Why not use .push() and .shift() instead of the manual array manipulation in the second example? Unless you're using IE 5.01 of course. :)&lt;br&gt;</description></item><item><title>re: Breadth is sometimes better than depth</title><link>http://blogs.msdn.com/ericlippert/archive/2004/09/27/breadth-is-sometimes-better-than-depth.aspx#234935</link><pubDate>Mon, 27 Sep 2004 22:00:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:234935</guid><dc:creator>Eric Lippert</dc:creator><description>Shift is an array copy.  If there are a thousand items in the array, that's a thousand property creations, a thousand property deletions and a thousand value copies _every_ time it's called.  _Very_ expensive.</description></item><item><title>re: Breadth is sometimes better than depth</title><link>http://blogs.msdn.com/ericlippert/archive/2004/09/27/breadth-is-sometimes-better-than-depth.aspx#234939</link><pubDate>Mon, 27 Sep 2004 22:05:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:234939</guid><dc:creator>B0rG</dc:creator><description>This is a trade off - like with everything else in life: would you trade the simplicity in coding for the memory eaten by your stack? When the 200G drives are becoming a normal thing what would be amount of files and folders you have to &amp;quot;push&amp;quot; and &amp;quot;pop&amp;quot;?&lt;br&gt;&lt;br&gt;My 20G logical drive for &amp;quot;all things work&amp;quot; has 20 000 folders sometimes up to 20 deep - i like my folders organised...&lt;br&gt;&lt;br&gt;I do agree, however, that one should look at recursive functions in the morning somewhere between the second cup of cofee and lunch.</description></item><item><title>re: Breadth is sometimes better than depth</title><link>http://blogs.msdn.com/ericlippert/archive/2004/09/27/breadth-is-sometimes-better-than-depth.aspx#234941</link><pubDate>Mon, 27 Sep 2004 22:15:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:234941</guid><dc:creator>Dan Shappir</dc:creator><description>Recently I needed to extend our build process to perform a simple textual manipulation on some files: replace a placeholder with the version text. I implemented this step as a .wsf file using JScript and the BeyondJS library. The reason: BeyondJS made it very easy to perform a sequence of operations on all the files in a subdirectory that matched a particular pattern. For example, this code snippet performs the same operation as Eric's second sample (it's breadth-first):&lt;br&gt;&lt;br&gt;File.recurse(&amp;quot;.&amp;quot;).foreach(alert);&lt;br&gt;&lt;br&gt;&amp;quot;alert&amp;quot; in this case is a wrapper function for WScript.Echo. Yes, I've cut a few corners: it *is* recursive and I've neglected the try-catch. I still think it's cool.&lt;br&gt;&lt;br&gt;Here is the code to output only .htm or .html files:&lt;br&gt;&lt;br&gt;File.recurse(&amp;quot;.&amp;quot;).filter(/\.html?$/i).foreach(alert);</description></item><item><title>re: Breadth is sometimes better than depth</title><link>http://blogs.msdn.com/ericlippert/archive/2004/09/27/breadth-is-sometimes-better-than-depth.aspx#234952</link><pubDate>Mon, 27 Sep 2004 22:44:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:234952</guid><dc:creator>Eric Lippert</dc:creator><description>&amp;gt; would you trade the simplicity in coding for the memory eaten by your stack?&lt;br&gt;&lt;br&gt;I'm not following your point.  Perhaps I am misunderstanding you.&lt;br&gt;&lt;br&gt;JScript isn't tail recursive, and in fact typically will blow the stack after about 600 recursions.  Using an explicit &amp;quot;heap based&amp;quot; stack uses LESS memory and is MORE robust than using recursion to manage the call stack for you.&lt;br&gt;</description></item><item><title>re: Breadth is sometimes better than depth</title><link>http://blogs.msdn.com/ericlippert/archive/2004/09/27/breadth-is-sometimes-better-than-depth.aspx#234958</link><pubDate>Mon, 27 Sep 2004 23:01:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:234958</guid><dc:creator>Dan Shappir</dc:creator><description>JScript may not have tail recursion, but Cocoon's Rhino-based JavaScript implementation does. It even sports continuations:&lt;br&gt;&lt;br&gt;&lt;a target="_new" href="http://wiki.apache.org/cocoon/RhinoWithContinuations"&gt;http://wiki.apache.org/cocoon/RhinoWithContinuations&lt;/a&gt;</description></item><item><title>re: Breadth is sometimes better than depth</title><link>http://blogs.msdn.com/ericlippert/archive/2004/09/27/breadth-is-sometimes-better-than-depth.aspx#234964</link><pubDate>Mon, 27 Sep 2004 23:08:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:234964</guid><dc:creator>Eric Lippert</dc:creator><description>That's pretty darn cool -- but of course in this case the point is moot because this isn't an algorithm that can be made tail recursive.  &lt;br&gt;&lt;br&gt;To be tail recursive, the function can't do any work that needs the state of the local variables after the recursive call, but in this algorithm, the state of the enumerator is needed after the recursive call.</description></item><item><title>re: Breadth is sometimes better than depth</title><link>http://blogs.msdn.com/ericlippert/archive/2004/09/27/breadth-is-sometimes-better-than-depth.aspx#235068</link><pubDate>Tue, 28 Sep 2004 05:55:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:235068</guid><dc:creator>Centaur</dc:creator><description>Breadth is better than depth especially in cases where the tree being traversed is potentially infinite. Breadth-first, each node will eventually be visited. Width-first, the algorithm will dive into the first infinite branch and never return.&lt;br&gt;&lt;br&gt;With file systems, infinite trees emerge from circular symbolic links (aka junction points). Of course, any practical file system scanner must detect loops and not visit the same directory twice. Which brings a question: What is the best way to determine that C:\foo is the same as C:\foo\foo is the same as C:\foo\foo\foo?</description></item><item><title>re: Breadth is sometimes better than depth</title><link>http://blogs.msdn.com/ericlippert/archive/2004/09/27/breadth-is-sometimes-better-than-depth.aspx#235138</link><pubDate>Tue, 28 Sep 2004 10:46:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:235138</guid><dc:creator>Christian Mogensen</dc:creator><description>Canonicalization would be an obvious answer to that. Canonical(&amp;quot;C:\foo\foo\foo&amp;quot;) --&amp;gt; &amp;quot;C:\foo&amp;quot;&lt;br&gt;&lt;br&gt;Now with URLs you have a problem, since you can't know that &amp;quot;&lt;a target="_new" href="http://foo.com/bar&amp;quot;"&gt;http://foo.com/bar&amp;quot;&lt;/a&gt; and &amp;quot;&lt;a target="_new" href="http://www.foo.com/baz&amp;quot;"&gt;http://www.foo.com/baz&amp;quot;&lt;/a&gt; are actually the same thing - two names for the same item. &lt;br&gt;&lt;br&gt;Life is complicated... :-)</description></item><item><title>re: array pop and shift</title><link>http://blogs.msdn.com/ericlippert/archive/2004/09/27/breadth-is-sometimes-better-than-depth.aspx#236178</link><pubDate>Thu, 30 Sep 2004 15:58:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:236178</guid><dc:creator>Dave</dc:creator><description>Eric, are you sure about how shift works? Here are my timing results for shifting/popping an array of 5000 numbers and 5000 relatively large objects. The times are almost identical regardless of the type of element (in seconds):&lt;br&gt;&lt;br&gt;operation, numbers, strings, objects&lt;br&gt;pop, 3359, 3360, 3343&lt;br&gt;shift, 12796, 12844, 12829&lt;br&gt;&lt;br&gt;If it was actually copying the whole (remaining) array on each iteration of shift I would have thought the difference would be much more than 3.8x slower, and that times would vary by element type. Here's the code:&lt;br&gt;&lt;br&gt;var a = [];&lt;br&gt;var n = 5000;&lt;br&gt;for ( var i=0; i &amp;lt; n; i++ )&lt;br&gt;	a[i] = i-12;&lt;br&gt;var t1 = new Date();&lt;br&gt;for ( var i=0; i &amp;lt; n; i++ )&lt;br&gt;	t = a.shift();&lt;br&gt;var t2 = new Date();&lt;br&gt;alert(t2.getTime()-t1.getTime());&lt;br&gt;</description></item><item><title>re: Breadth is sometimes better than depth</title><link>http://blogs.msdn.com/ericlippert/archive/2004/09/27/breadth-is-sometimes-better-than-depth.aspx#236208</link><pubDate>Thu, 30 Sep 2004 16:48:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:236208</guid><dc:creator>Eric Lippert</dc:creator><description>* Yes I'm sure!  Geez!&lt;br&gt;&lt;br&gt;* On my box, the differential is a factor of five, and my timings are slower than yours. You must have a fast box!&lt;br&gt;&lt;br&gt;* I would expect that numbers, strings and objects would give roughly the same timings.  The copy itself isn't the expensive part.  Copying a number is just a memory copy, copying an object is a pointer copy and a call to addref, copying a string is an allocation off the OLE Aut small string heap, which is optimized for exactly this situation.  OA caches small strings; if an allocation comes in for a string of the same size as one that was just freed, it reuses the memory.  Very fast.  (This complicates memory leak detection, see Larry Osterman's recent blog on the subject.)&lt;br&gt;&lt;br&gt;* The reason why the differential is so low is because the fixed cost of calling any method on an array object is high.  In part, it's because you could assign the pop method to an object other than an array.  Let me break &amp;quot;pop&amp;quot; down for you:&lt;br&gt;&lt;br&gt;1) verify that a &amp;quot;this&amp;quot; argument was passed&lt;br&gt;2) convert the &amp;quot;this&amp;quot; argument to its object form if it is not already&lt;br&gt;3) verify that the object form is a JScript object, not an external COM object&lt;br&gt;4) extract the value of the length property&lt;br&gt;5) convert the value of the length property to an integer&lt;br&gt;6) from the length, figure out what index string is the one to pop&lt;br&gt;7) Look up that value -- if it exists, delete it and decrement the length property&lt;br&gt;&lt;br&gt;Actually deleting the property is a tiny amount of work compared to all the rest.  Similarly with shift and unshift, except that with shift and unshift, there's a big overhead followed by a big cost.&lt;br&gt;&lt;br&gt;* Everything I just said in the previous point was a lie.  Why?  Because nowhere did I factor in the cost of doing the garbage collection.  Most of the time spent in the pop operation is in the GC, cleaning up those 5000 elements. &lt;br&gt;&lt;br&gt;This needs a much deeper analysis to understand what's going on.&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;</description></item><item><title>re: Breadth is sometimes better than depth</title><link>http://blogs.msdn.com/ericlippert/archive/2004/09/27/breadth-is-sometimes-better-than-depth.aspx#237582</link><pubDate>Mon, 04 Oct 2004 16:22:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:237582</guid><dc:creator>Dave</dc:creator><description>Thanks for the info, Eric. The reason I originally asked the question was because it seemed odd to avoid shift() given that it's the logical compliment to pop(). &lt;br&gt;&lt;br&gt;I'm working with a black box, so timing is about all I can do to see the differences. A .shift() version of your second example ran almost the exact same time on my box when I crawled an entire system drive (10GB and 88,000 files); it looks like disk I/O is the bottleneck in that case. For smaller tests where the disk info could be cached the .shift() version was about 70% slower. I didn't see any major difference in memory usage observing Task Manager.&lt;br&gt;</description></item><item><title>re: Breadth is sometimes better than depth</title><link>http://blogs.msdn.com/ericlippert/archive/2004/09/27/breadth-is-sometimes-better-than-depth.aspx#238125</link><pubDate>Tue, 05 Oct 2004 17:06:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:238125</guid><dc:creator>Eric Lippert</dc:creator><description>Indeed, questions of performance are best answered by trying it out! &lt;br&gt;&lt;br&gt;In JScript classic, asymptotic performance is n^2 for both algorithms because it's gated on the behaviour of the garbage collector.  Given n allocations, there are O(n) collections and each takes O(n) time to mark and sweep.&lt;br&gt;&lt;br&gt;In JScript .NET, which uses a more efficient generational GC, the pop algorithm has this performance on my box:&lt;br&gt;&lt;br&gt;n = 5000  : 140 ms&lt;br&gt;n = 10000 : 156 ms&lt;br&gt;n = 20000 : 171 ms&lt;br&gt;&lt;br&gt;and the shift algorithm is:&lt;br&gt;&lt;br&gt;n = 5000  : 203 ms&lt;br&gt;n = 10000 : 954 ms&lt;br&gt;n = 20000 : 2078 ms&lt;br&gt;n = 40000 : 10547 ms&lt;br&gt;&lt;br&gt;Clearly shift is growing way, way faster than pop as we go asymptotic.  If we did a proper and careful study we'd see that pop grew as O(n) and shift grew as O(n^2) once we mitigate the cost of the GC.&lt;br&gt;</description></item></channel></rss>