250% of what, exactly?

250% of what, exactly?

  • Comments 27

I just got a question this morning about how to take two collections of items and determine how many of those items had the same name. The user had written this straightforward but extremely slow VBScript algorithm:

For Each Frog In Frogs
  For Each Toad In Toads
    If Frog.Name = Toad.Name Then
      SameName = SameName + 1
      Exit For
    End If
  Next
Next

There were about 5000 frogs, 3000 toads and 1500 of them had the same name. Every one of the 3500 unsuccessful searches checked all 3000 toads, and the 1500 successful searches on average checked 1500 toads each. Each time through the inner loop does one loop iteration, two calls to the Name property, one string comparison. Add all those up and you get roughly 50 million function calls to determine this count.

This code has been somewhat simplified – the actual user code was doing more work inside the loop, including calls to WMI objects. The whole thing was taking 10+ minutes, which is actually pretty fast considering how much work was being done. Each individual function call was only taking a few microseconds, but fifty million calls adds up!

Now, we've been down this road before in this blog (here, here and here) and so of course I recommended building a faster lookup table rather than doing a full search through the collection every time.

Set FrogLookup = CreateObject("Scripting.Dictionary")
For Each Frog In Frogs
  FrogLookup(Frog.Name) = Frog
Next
For Each Toad In Toads
  If FrogLookup.Exists(Toad.Name) Then SameName = SameName + 1
Next

Which is much, much faster. That's only about 16 thousand function calls. Now, this is maybe not an apples-to-apples comparison of function calls, but we at least have good reason to believe that this is going to be several times faster. 

And indeed it was. But I'm still not sure how much, which brings me to the actual subject of today's blog. The user reported that the new algorithm was "250% better". I hear results like this all the time, and I always have to ask to clarify what it means. You can't just say "n% better" without somehow also communicating what you're measuring.

(UPDATE: This reported result understandably confused some readers.  Clearly the new loop here is thousands of times faster, not a mere 250% faster. As I said before, the sketch above is highly simplified code. The real-world code included many thousands of additional calls to WMI objects which were not eliminated by this optimization. Eliminating these 50 million function calls helped -- you should always eliminate the slowest thing first!  But doing so also exposed a new "hot spot" that needed further optimization.  However, the point of this article is not the benefits of using lookup tables, but rather that using unexplained percentages to report performance results is potentially misleading.  The result above is merely illustrative.  See the comments for more details.)

Allow me to illustrate. Suppose I have a web server that is serving up ten thousand pages every ten seconds. I make a performance improvement to the page generator so that it is now serving up fifteen thousand pages every ten seconds. I can sensibly say:

  • performance has improved by 50%, because we are now serving up 5000 more pages every ten seconds, and 5000 is 50% of the original 10000.  In this world, any positive percentage is good.
  • performance is now 150% of original performance because 15000 is 150% of 10000.  In this world, 0%-100% worse or the same, 100%+ is good.
  • We've gone from 1000 microseconds per page to 667 microseconds per page, saving 333 microseconds per page. 333 is 33% of 1000, so we've got a 33% performance improvement.  In this world, 0% is bad, 100% is perfect,  more than 100% is nonsensical.

If I can sensibly say that performance is better by 50%, 150% and 33%, all referring to exactly the same improvement, then I cannot actually be communicating any fact to my listeners! They need to know whether I'm measuring speed or time, and if speed, whether I'm comparing original speed to new speed or original speed to the difference.

So what is "250%" better? Clearly not raw timings. If this loop took 10 minutes to run before, 250% of 10 minutes is 25 minutes, but surely it is not running in -15 minutes now! I assume that the measurement is speed -- say, number of loops runnable per hour. If before it took ten minutes and therefore we could run this loop six times per hour, and now it is 250% better, 250% of 6 is 15, and this is "better", so we need to add. So that's 21 loops per hour, or about 3 minutes per loop. Had the user accidentally said "250% faster" but meant to say "250% of previous performance" then we'd conclude that 250% of 6 per hour is 15 per hour, so now we've got 4 minutes per loop.

And of course, this is assuming that the person reporting the improvement is actually trying to communicate facts to the audience. Be wary! Since 33%, 50% and 150% are all sensible ways to report this improvement, which do you think someone who wants to impress you is going to choose? There are opportunities here for what Kee Dewdney calls "percentage pumping" and other shyster tricks for making weak numbers look more impressive. Professor Dewdney's book on the subject, "200% of Nothing", is quite entertaining.

The moral of the story is: please do not report performance improvements in percentages. Report performance improvements in terms of raw numbers with units attached, such as "microseconds per call, before and after change", or "pages per second, before and after change". That gives the reader enough information to actually understand the result.

(And the other moral is, of course, lookup tables are your friends.)

  • You're algorithm is O(n + m) AVERAGE time complexity (assuming the hash algorithm works for the problem. It is O(mn) worst time complexity as a hash lookup operation can be O(n).
  • Boy, do I know that the hard way.

    As I've mentioned in this blog before, when I wrote the scripting dictionary hash algorithm, the first version had a bug, the upshot of which was that short strings consisting solely of digits would get hashed to a very small number of buckets.

    We discovered this the hard way when www.msn.com performance started to suck one day. They had a dictionary with every zip code in the US in it, and the lookup cost very quickly became O(n). Fortunately it was an easy fix!
  • Of course, there are tradeoffs there as well.

    The obvious solution is to just use a standard lookup area because there are so few zip codes. Excluding zip + 4 you have 10^5 or ~2^17 zip codes. The advantage here is you now exactly how much time every query will take and exactly how much memory you need. Properties like this are really important in certain fields like game programming.

    Zip + 4 is not practical as you have 10^9 or ~2^30 (a good way to convert base 10 to 2 is remember that every 1000 is roughly equal to 2^10 = 1024) zip codes with a look up table though and would quickly consume almost all of a machine's memory. You can add a two table system with a second layer of indirection but this only aids memory usage a little because you don't need a secondary table for every zip.
    In addition, changes to the zip system would require serious rework. You also start to run into problems as your table is no longer very page friendly and page faults abound.

    At that stage, you pretty much need a good hash table to exploit the fact that zip codes are sparse (not everyone zip or zip +4 is actually used). A generic hash table has the advantage of being able to expand easily if the zip code properties change. All you really need to do is change the code to verify something IS a zip code. A more specialized function can take into account properties of the hash codes and may perform better but when hash codes are added or if the US government decides to change the rules, rework ensues.

    When you don't know the properties of the data going into your hashtable, you are playing a probability game. Bet on the library functions first and don't customize until they fail you. Remember Knuth "Premature optimization is the root of all evil" and don't try to optimize until library routines have failed. Besides, not everyone can write a hash function that out performs the library routines. Those routines are also tested.

    But always, profile and make sure the code is doing what you think it is. I like to think I know everything, however, I often find the only thing more astouding than my intellect is my ignorance. Things don't always work the way you think they will and when they don't, you need to be able to figure out why. Always, learn as much as you can and then be ready to find out you were wrong.

    Overall, the guys at Microsoft do write some nice code and make some great libraries which is why we all like giving them such a hard time ;)

    Shaun Bedingfield
  • I assumed the problem was going to be duplicate names. If you have frogs "Abe, Bill, Bill, and Bill", and Toads "Abe, Abe, Bill, and Colin" how many are the same? Two? Three? Four? Seven?
  • Good point. I believe in this particular case the collections already ensure that every name within the collection is unique. But if you couldn't make that guarantee then you would need some more specific statement of the desired output, and a different algorithm.
  • If your only counting, it is an easy mod.

    You just have the associative array point to the number of items with that name and then add that number when you get a match.

  • Actually it is a multiply.. gr.. Looping through an associative array (size n) and getting the value of the other..

    I am just not a morning person.
  • This is an obvious case of trying to optimise the code, and not the algorithm.

    Sorting the lists, and comparing them is by far the best solution to this problem in the real world (As apposed to a simple sample).

    Chances are you either need to do this test several times, or you have the data in trees , or inserting new items can be done in order, all of those leaving you with the sorting time a non-existant problem.
  • About 15 years ago, I was a programmer on IBM mainframes. I read Computerworld every week. The company Syncsort ran a full-page ad touting how much better their Syncsort product was than the built-in sort that the IBM operating system provided.

    The Syncsort ad I remember best featured a quote from a customer in large type, crowing:

    "Since we switched to Syncsort, our sort jobs take 120% less time than before!"

    I felt sure that Syncsort had some employees with mathematics aptitude, but of course they didn't vet the ads. That particular ad ran for several weeks at least.
  • One problem with not putting differences in Percentages--> There seem to be very few people who won't look at the specifics and turn it into a percentage or fraction anyway. For example, if I say that we used to be able to process 176/second and now we can process 323/second, I invariably get asked what percentage increase it is? It is how many people think and understand data like this.
  • Statistics are fun things ... as Samuel Clemens famously said, "There are lies, darn lies, and statistics."
  • Mark Twain is indeed famous for saying that, but he was quoting someone else.

    http://www1c.btwebworld.com/quote-unquote/p0000149.htm

Page 2 of 2 (27 items) 12