250% of what, exactly?

250% of what, exactly?

  • Comments 27

I just got a question this morning about how to take two collections of items and determine how many of those items had the same name. The user had written this straightforward but extremely slow VBScript algorithm:

For Each Frog In Frogs
  For Each Toad In Toads
    If Frog.Name = Toad.Name Then
      SameName = SameName + 1
      Exit For
    End If
  Next
Next

There were about 5000 frogs, 3000 toads and 1500 of them had the same name. Every one of the 3500 unsuccessful searches checked all 3000 toads, and the 1500 successful searches on average checked 1500 toads each. Each time through the inner loop does one loop iteration, two calls to the Name property, one string comparison. Add all those up and you get roughly 50 million function calls to determine this count.

This code has been somewhat simplified – the actual user code was doing more work inside the loop, including calls to WMI objects. The whole thing was taking 10+ minutes, which is actually pretty fast considering how much work was being done. Each individual function call was only taking a few microseconds, but fifty million calls adds up!

Now, we've been down this road before in this blog (here, here and here) and so of course I recommended building a faster lookup table rather than doing a full search through the collection every time.

Set FrogLookup = CreateObject("Scripting.Dictionary")
For Each Frog In Frogs
  FrogLookup(Frog.Name) = Frog
Next
For Each Toad In Toads
  If FrogLookup.Exists(Toad.Name) Then SameName = SameName + 1
Next

Which is much, much faster. That's only about 16 thousand function calls. Now, this is maybe not an apples-to-apples comparison of function calls, but we at least have good reason to believe that this is going to be several times faster. 

And indeed it was. But I'm still not sure how much, which brings me to the actual subject of today's blog. The user reported that the new algorithm was "250% better". I hear results like this all the time, and I always have to ask to clarify what it means. You can't just say "n% better" without somehow also communicating what you're measuring.

(UPDATE: This reported result understandably confused some readers.  Clearly the new loop here is thousands of times faster, not a mere 250% faster. As I said before, the sketch above is highly simplified code. The real-world code included many thousands of additional calls to WMI objects which were not eliminated by this optimization. Eliminating these 50 million function calls helped -- you should always eliminate the slowest thing first!  But doing so also exposed a new "hot spot" that needed further optimization.  However, the point of this article is not the benefits of using lookup tables, but rather that using unexplained percentages to report performance results is potentially misleading.  The result above is merely illustrative.  See the comments for more details.)

Allow me to illustrate. Suppose I have a web server that is serving up ten thousand pages every ten seconds. I make a performance improvement to the page generator so that it is now serving up fifteen thousand pages every ten seconds. I can sensibly say:

  • performance has improved by 50%, because we are now serving up 5000 more pages every ten seconds, and 5000 is 50% of the original 10000.  In this world, any positive percentage is good.
  • performance is now 150% of original performance because 15000 is 150% of 10000.  In this world, 0%-100% worse or the same, 100%+ is good.
  • We've gone from 1000 microseconds per page to 667 microseconds per page, saving 333 microseconds per page. 333 is 33% of 1000, so we've got a 33% performance improvement.  In this world, 0% is bad, 100% is perfect,  more than 100% is nonsensical.

If I can sensibly say that performance is better by 50%, 150% and 33%, all referring to exactly the same improvement, then I cannot actually be communicating any fact to my listeners! They need to know whether I'm measuring speed or time, and if speed, whether I'm comparing original speed to new speed or original speed to the difference.

So what is "250%" better? Clearly not raw timings. If this loop took 10 minutes to run before, 250% of 10 minutes is 25 minutes, but surely it is not running in -15 minutes now! I assume that the measurement is speed -- say, number of loops runnable per hour. If before it took ten minutes and therefore we could run this loop six times per hour, and now it is 250% better, 250% of 6 is 15, and this is "better", so we need to add. So that's 21 loops per hour, or about 3 minutes per loop. Had the user accidentally said "250% faster" but meant to say "250% of previous performance" then we'd conclude that 250% of 6 per hour is 15 per hour, so now we've got 4 minutes per loop.

And of course, this is assuming that the person reporting the improvement is actually trying to communicate facts to the audience. Be wary! Since 33%, 50% and 150% are all sensible ways to report this improvement, which do you think someone who wants to impress you is going to choose? There are opportunities here for what Kee Dewdney calls "percentage pumping" and other shyster tricks for making weak numbers look more impressive. Professor Dewdney's book on the subject, "200% of Nothing", is quite entertaining.

The moral of the story is: please do not report performance improvements in percentages. Report performance improvements in terms of raw numbers with units attached, such as "microseconds per call, before and after change", or "pages per second, before and after change". That gives the reader enough information to actually understand the result.

(And the other moral is, of course, lookup tables are your friends.)

  • I'm reading the Dictionary docs:
    http://msdn.microsoft.com/library/default.asp?url=/library/en-us/script56/html/jsobjDictionary.asp

    and I have no idea what this line does:
    FrogLookup(Frog.Name) = Frog
  • As I mentioned in yesterday's blog entry, passing an argument to an object that has a default property where the property takes an argument is equivalent to calling the default property.

    The default property of Dictionary is Item, and Item takes an argument. Therefore

    dict(x) = y

    is just a syntactic sugar for

    dict.Item(x) = y

    See

    http://msdn.microsoft.com/library/default.asp?url=/library/en-us/script56/html/jsproitem.asp

    for the Item documentation.

  • "The default property of Dictionary is Item..."

    How do you know what the default property of an (arbitrary) object is, if in fact there is one? The Dictionary object documentation doesn't seem to identify Item as the default property (unless I'm missing something).

    Is the existance and name of the default property something you can easily determine from VBScript? My apologies if you've answered this in the past.
  • I know this isn't the topic, but wouldn't this be even faster?

    Function FrogExists(sName)
    On Error Resume Next
    FrogLookup = False
    FrogLookup = IsObject(Frogs.Item(sName))
    End Function

    For Each Toad In Toads
    If FrogExists(Toad.Name) Then
    SameName = SameName + 1
    End If
    Next
  • In general there is no way from the language itself to determine any information about what the functions, properties, events, etc, on an object are. That information can be determined by examining the type library for the object of course, but there's no facility from within the VBScript language for reflection on arbitrary objects.

    However, one of the OLE Automation design guidelines is that all collection classes have at default getter (and preferably also a setter) called Item.

    I regret that this wasn't called out in the dictionary documentation; it probably should have been. The scripting documentation team was very, very small at the time this documentation was written and there were a lot of missed stuff like this.
  • If the Frogs or Toads collection supported a by-name indexer, then indeed, that would probably be faster than building an index. (Assuming that the people who wrote the indexer used an efficient lookup table internally!)

    However, in the actual user code, I believe that the collection only supported indexing by ordinal, not name.

    Good idea though.
  • I think the first poster is asking why you set the value of FrogLookup(Frog.Name) to Frog and not to some constant - since Exists() doesn't care about the value anyway.
  • Still too many iterations for really big collections though...

    If you sort both lists then you can compare for matches in n+m time where n is the size of list one and m is the size of list two..

    This is because you only have to look at the second list until you find a member that is logically after it in the second list. Since the next number is after the first in order, you can start at the position in the second list you last stopped at and then go until you find a member that is logically after it. Eventually, you will exhaust both lists.

    This gives an overall runtime of m log m + n log n + m + n or roughly n log n which is less for large sets than n * m (which is the complexity of the algorithm you suggested.

    I leave the code as an exercise for the reader. I wonder if Larry Wall will consider me virtuous.

    TANSTAFC!!

    Shaun Bedingfield
    blogsb.blogspot.com
    shaunbed@houston.rr.com

    Oh, by the way.. 5000 * 3000 = 15 M
    and 5000 log2 5000 ~= 13 * 5000 = or 65 K.

    If the constant is equal for both sets, which it won't be... My algorithm would be roughly 23000 % faster. Not a small gain.
  • Using a constant value would work of course, and if you wanted to wring every last bit of performance out, that would probably be better. But being able to map back to the object from the name is often useful.
  • Shaun, I'm not following your logic here. How is my algorithm not O(n+m) already? We have O(n) hash table adds and O(m) hash table lookups.

    Surely your algorithm will be dominated by the sort time, which will be larger than O(n+m).
  • It depends on your hash function.. If you are only getting a 33% increase over the original algorithm, the built in hash algorithm must be missing almost everything. Hash functions are only O(1) if you have a good hash.
  • Yes, my algorithm is O(n log n).
  • Hash functions are not magic and while they can be O(1), their worst case performance is O(n).. You seem to be much closer to the worst case so you are O(m*n)
  • The algorithm I proposed is O(m+n), not O(mn) as the original algorithm. That would have about 16000 function calls instead of 50 million. Assuming an apples-to-apples comparison -- which is actually reasonable in this particular case -- we'd probably get a factor of 3000x speedup.

    For all practical purposes, the cost of the loop is now zero compared to the original cost.

    The fact that the user was reporting a 3x speedup, not a 3000x speedup tells me that the likely cost of the original algorithm was 7 minutes in the loop, 3 minutes in other WMI calls. Now we're down to 0 minutes in the loop, 3 minutes in other WMI calls, for a total speedup of 3x. Something else is the gating factor now; probably something else with thousands of WMI calls.

    The dictionary hash function is reasonably good on sets of 5000 objects. I would not be putting millions of things into there though. Essentially the hash table is an O(n) table with a very small constant factor.
  • I am sorry, however, all of the information I had pointed to a very suboptimal hash function. This can happen.

    At the end of the day, it is always best to profile before assuming that your hash function is indeed O(c) as it may not be.
Page 1 of 2 (27 items) 12