Santalic tailfans, part two

Santalic tailfans, part two

Rate This
  • Comments 36

As I have said before many times, there is only one sensible way to make a performant application. (As an aside: perfectly good word, performant, deal with it!) That is:

  • Set meaningful, measurable, customer-focused goals.
  • Write the code to be as clear and correct as possible.
  • Carefully measure your performance against your goals.
  • Did you meet your goal? Great! Don't waste any time on performance analysis. Spend your valuable time on features, documentation, bug fixing, robustness, security, whatever.
  • If you did not meet your goal, use tools to discover what the worst-performing fixable thing is, and fix it.

Iterate on that process, updating your goals if it proves necessary, until you either have something that meets your goals or you give up.

My explicit goal for my little dictionary search program was that it give results fast enough that I would not be sitting there impatiently waiting for more than a few seconds. That's a very customer-focused goal, me being the only customer of this program. With only a very small amount of tweaking my program met that goal right away, so why would I spend any more time on it? The program takes just under two seconds to report the results for a typical query, which is faster than I can read.

But suppose that we did want to improve the performance of this program for some reason. How? Well, let's go down the list. We have goals, we have very clear code, we can measure the performance easily enough. Suppose we didn't meet the goal. The last thing on the list is "use tools to discover what the slowest fixable thing is".

A commenter conjectured that the performance bottleneck of this program was in the disk I/O. As you can see, every time we do a query we re-read the two megabyte dictionary file dozens of times. This has the benefit of using very little memory; we never have more than one line of the dictionary in memory at a time, instead of the whole four megs it would take to store the dictionary (remember, the dictionary is in ASCII on disk but two-byte Unicode if in strings in memory, so the in-memory size will double.)

That's a conjecture -- a reasonable conjecture -- but nevertheless, it's just a guess. If I've learned one thing about performance analysis it's that my guesses about where the bottleneck is are often wrong. I'll tell you right now that yes, the disk-hitting performance is bad, but it is not the worst thing in this program, not by far. Where's the real performance bottleneck? Any guesses? Could you know without using tools?

Here's the result of a timing profile run of a seven-letter queries with one blank, repeated 20 times:

43%: Contains
21%: ReadLine
14%: Sort
7%: ToUpperInvariant
15%: everything else

Holy goodness! The call to the Contains extension method in the query to test whether the dictionary word is in the rack set array is almost half the cost of this program!

Which makes sense, once you stop to think about it. The "Contains" method is by its highly general nature necessarily very naive. When given an array, 99.9% of the time it has to look through the entire 26-item array because 99.9% of the time, the word is not actually going to match any of the possible racks. It cannot take advantage of any "early outs" like you could do if you were doing a linear search on a sorted list. And each time through the array it has to do a full-on string comparison; there's no fancy-pants checks in there that take advantage of string immutability or hash codes or any such thing.

We have a data structure that is designed to rapidly tell you whether a member is contained in a set. And even better, it already does the "distinct" logic. When I replace

    var racks = (from rack in ReplaceQuestionMarks(originalRack)
                 select Canonicalize(rack)).Distinct().ToArray();

with

    var racks = new HashSet<string>(
                from rack in ReplaceQuestionMarks(originalRack)
                select Canonicalize(rack));

suddenly performance improves massively. The "Contains" drops down to 3% of the total cost, and of course, the total cost is now half of what it was before.

Another subtle point here: notice how when I changed the type of the variable "racks" from "array of string" to "set of string", I didn't have to redundantly change the type thanks to implicit typing of local variables. I want to emphasize the semantics here, not the storage mechanism. If I felt that communicating the storage mechanism was an important part of this code -- because it has such a strong effect on performance -- perhaps I would choose to emphasize the storage by eschewing the "var".

With this change, the program performance improves to about one second per query and the profile now looks like this:

39%: ReadLine
23%: Sort
11%: ToUpperInvariant
7%: the iterator block goo in FileLines
5%: ToArray (called by Join)
15%: everything else

Now the bottleneck is clearly the combination of repeated file line reading (48%) and the string canonicalization of every dictionary line over and over again (37%).

With this information, we now have data with which to make sensible investments of time and effort. We could cache portions of the dictionary in memory to avoid the repeated disk cost. Is the potential increase in speed worth the potentially massive increase in memory usage? We could be smart about it and, say, only cache the seven- and eight-letter words in memory.

We could also attack the canonicalization performance problem. Should we perhaps precompute an index for the dictionary where every string is already in canonical form? This in effect trades increased disk space, increased program complexity and increased redundancy for decreased time. Should we use a different canonicalization algorithm entirely?

All of these decisions are driven by the fact that I have already exceeded my performance goal, so the answer is "no". Good enough is, by definition, good enough. If I were using this algorithm to actually build a game AI, it would not be good enough anymore and I'd go with some more clever solution. But I'm not.

  • There's a great lesson in there about assuming where the bottleneck is.

    I'm interested in knowing how you profiled or gathered the metrics you were using.  Some tool, or just a timer routine?

    Sufficiently buff editions of Visual Studio have an "Analyze" menu item. -- Eric

  • This point about achieving customer-focused performance goals and stopping is definitely appropriate for application development. But when developing frameworks, I think you need to tweak the hell out of it. Nobody wants to be developing an application and find that they are not achieving their customer-focused performance goals because of a bottleneck in the framework. Often they don't have the option of fixing the framework. Part of the point of a framework is to do the hard stuff that's harder to maintain because the framework is more stable and doesn't require as much maintenance. The application that needs to be maintained more because of changing requirements doesn't need to deal with that stuff.

    Absolutely. And the way we make sure that the framework meets our customers' performance needs is we talk to our customers and set meaningful performance goals based on the feedback they give us.

    The hard part is that there is almost never a win without a tradeoff. Just look at the trivial example we have here. Making this go faster means making some tradeoff, usually in the "more memory, less time" direction. But not all of our customers are time constrained. Many of them are memory constrained, particularly those customers who develop programs for hand-held devices. Knowing whether we're maybe making things worse is a necessary precondition of making things better. -- Eric

  • I indeed agree that setting goals and measuring against them is the correct approach, always, for real systems.

    For blog posts, sometimes it's fun to dream about how to do things blazingly fast :).

    Building an index on the canonicalized form over the dictionary (say a B+tree or something) seems like it could be a win if you're willing to double the disk usage to reduce the execution time substantially (since you wouldn't even have to read in most of the dictionary then).

    I also tossed around the idea of using a precomputed Bloom filter to exclude most of your combinations out front without looking even into your index. This would probably allow you to have more than 2 blanks (which isn't terrifically useful for Scrabble, but is for some other applications), since 90% of the time you could throw out a word without leaving the L2 cache.

    But indeed, for this application, for this purpose, there's no reason to spend any more time on performance. That's the difference between engineering and doing programming as a hobby.

  • While I agree there is more than one great lesson in here there is one thing I believe is missing:

    For some reason I don't think this piece of code was intended to be actually used while playing scrabble, (wouldn't that be cheating?) I think it was made because Eric enjoyed writing it and as a bonus it could serve as a subject for some interesting reading material.

    While the rational 'Good enough is good enough' applies to the code we build for our customers, I find there is a very good reason to optimize the hell out of an algorithm that we built just for our self: It is a lot of FUN!

    At the point where Eric said "But suppose that we did want to improve the performance", we got into the world where the customer no longer rules and we are allowed to tweak it up to the last CPU cycle! Just because we can!

  • I find the rationale "good enough is good enough" to be inappropriate for many problems I deal with.  I'm a student with interest in pattern classification problems; a faster algorithm means using a larger (better) data set, or at least means using more fold in cross validation.  At the very least, it can mean more test runs, and thus a better understanding by myself of the behavior of the system.  Performance has yet to be fast enough, and I doubt it ever will be in my lifetime.

    A fundamental problem with the approach of writing code and then fixing it is that it's not always easy to fix code that has the wrong performance characteristics.  For instance; HashSet<string> is rather slow due to the hashcode computation.  replacing string with a custom type that precomputes the hash can improve performance manyfold; but if the application is complex, this new type must be used throughout the Api to avoid needing recomputing.

    Recently I was looking at basic vector operations - which beg the question which implementation of a vector results in the fastest performance?  If you think you can just use an interface (say, IList) and then later replace the internals as appropriate, you'd be mistaken: a simple test shows that dot-products computed on an array cast to IList are  50 times slower than when computed directly on an array.  Using a C++ implementation improved performance a further 60%.  In 64-bit mode, results were even more awkward, where simply using a static, presumably trivially inlinable method to compute the dot-product (a setup with low to no overhead in 32-bit .NET and C++) resulted in a slowdown of more than a factor 2.

    My conclusion would be that, in particular in .NET, semantic detais such as IList vs. List vs. array make large  differences.  If your application is likely to be disk bound or in some other way "complex" then sure, you can afford to ignore such differences - but that's mostly because CPU time is often not an issue at all.

    So sure, you could write an app and then fix the performance, but if you know which operations have reasonable overhead up front, you can avoid using other operations in tight loops altogether.

    I placed the results online: http://eamon.nerbonne.org/2009/02/net-numeric-performance-disappointing.html

    Two points.

    First, I agree that making good decisions up front during the early design and implementation is key, particularly for large systems. Something that I've ignored for the purposes of this discussion is that for a large system, the process I've outlined needs to start early. If an early prototype of a large system is orders of magnitude away from the performance goal, you might have a big problem. Of course, that necessarily requires having known goals!

    Second, though I agree that faster is better, obviously faster costs effort. Lots of effort.

    I'm sure you would agree that more robust is also better. And more fully featured is better, less buggy is better, more secure is better, better documented is better, all these things are better and they all cost effort. There is only a finite amount of effort to go around on any project. Effort is our most precious resource and it needs to be doled out carefully to make sure that we're spending it where it can do the most good.

    -- Eric

  • Some of us use Visual Studio Express Editions.  While we have CLRProfiler to help spot out-of-control allocations, what do you suggest we do to spot performance issues?

    I am an expert on the C# programming language, not on what toolsets are available from what providers for what price. So I do not make suggestions. -- Eric

    MSFT expects hobbyists and academia to use Visual Studio Express Editions as an entry point into building MSFT applications.  However, how do we train the next wave of engineers to measure applications without tools?

    I am an expert on the C# programming language, not on budgeting of educational institutions. So, I don't know. -- Eric

    In the workplace, there is tremendous pressure to use open source platforms because of perceived cost savings and ROI.  While MSFT has great studies showing that you get tremendous ROI with expensive editions of Visual Studio, these studies fall on deaf ears of CFO's.  What can engineers who want to keep using tools like Visual Studio for coding do when they can't produce performant applications but also can not measure to improve them?

    What do you suggest? This has plagued a department under fire to switch to "free" tools for some time.

    I am an expert on the C# programming language, not on convincing CFOs that getting the best tools money can buy is a sensible investment. So I don't have any suggestions here either.

    Sorry that I'm completely unhelpful -- these are excellent points all, but are also all questions about things I know nothing about. You should ask these questions of people like Soma; it's his job to think about this stuff. -- Eric

  • Eric, while I like your new way of inserting your answer to comments, could you make it another color/background ? I think that would be easier to separate both visually and there would be no need to use [[ ERIC: ]].

    Good idea. How's this? -- Eric

    Also, nice illustration on how to improve perf without losing expressiveness of code. I might borrow that example when giving out formations, if you don't mind ;)

  • A couple of comments.  (Zero: I like your blog.)

    1) I like your suggestion to immediately incorporate tools, if you have them.  People who haven't investigated performance issues before tend to want to guess, and it's a waste of time.

    2) I'd like to encourage beginners who are new to profilers to remember that everything has to add to 100%.  In the example, the first performance issue is taken off the table, then there are more that apparently take its place.  A common theme is for someone to clear out the 40% item and then say "Oh my gosh!  Now XYX() is taking 50%!  How did that happen?"   But SOMETHING has to take the 100%, even it if it's so blazing fast that the program is done before the sound from the mouse click reaches your ears.

    You address this concern implicitly by saying "the program is fast enough now, so we don't need to tweak it any more." (advice with which I totally agree).  

    Anyway, just a reminder to beginners that just because something takes the most time doesn't mean it's slow.  Thanks.

  • Eric, I just wanted to say thank you for your response.  I appreciate your being upfront that you are not the right person to direct these questions to.  I think too many times commenters would walk away angry without realizing that just because someone makes a post about point A does not mean they are the right person to ask about tangential point B.

    Thanks again.

    You're welcome. Thanks for your comments; you raise excellent questions. -- Eric

  • Eric, you can write a program that uses high-level concepts and then use a profile to optimize it from 2 seconds to 1 second, or you can simply write a good program to begin with, and it will take 1 microsecond to compute the solution.

    Where "good" means what? Apparently in your case good means "incorrect, unnecessarily hard to read, and unnecessarily fast". I have a different definition of what makes code "good" -- "correct" is first on my list. -- Eric

    struct Dict {
      StrMap<cstring> words;
      cstring in,temp;
      Dict() {
        cstring contents(FileToString("c:\\words.txt"));
        cstring line;
        for(StringIter iter(contents); !iter.EofQ(); ) {
          strptr orig_line=iter.GetLine();
          line=orig_line;
          line.MakeLower();
          QuickSort(line.ToStrPtr());
          words.Add(line,orig_line);
        }
      }
      void Lookup(in_strptr word, cstring &answers) {
        answers.Flush();
        in=word;
        in.MakeLower();
        int idx;
        if ((idx=in.Find('?'))!=-1) {
          int idx2=in.FindFrom('?',idx+1);
          if (idx2!=-1) {
            frep_(i,26*26) {
              temp=in;
              temp.SetAt(idx,'a'+i/26);
              temp.SetAt(idx2,'a'+i%26);
              QuickSort(temp.ToStrPtr());
              if (shstring *result = words[temp]) {
                answers<<" "<<*result;
              }
            }
          } else {
            frep_(i,26) {
              temp=in;
              temp.SetAt(idx,'a'+i);
              QuickSort(temp.ToStrPtr());
              if (shstring *result = words[temp]) {
                answers<<" "<<*result;
              }
            }
          }
        } else {
          QuickSort(in.ToStrPtr());
          if (shstring *result = words[in]) {
            answers<<" "<<*result;
          }
        }
      }
    };

    Results: 0.2secs to initialize, and 1.5microsecs for the ANALYST example.

    I know you'll say that this is not fair, and that your post makes bigger points about the strategy of writing readable code, how to allocate programmer's time, etc. Granted.

    I am confused. Why would I say it's not fair? What's unfair about it? Aside from the fact that your implementation of the algorithm gives incorrect results -- it is almost always easier to go faster when you don't worry about correctness!  -- Eric

    But I think you're missing the biggest point of all -- that picking a good algorithm before sitting down to write the program will obviate the need to optimize later, because the resulting program will simply be as fast as it theoretically should be.

    I am even more confused. Your program uses almost the same _algorithm_ mine does. You just made the performance choice to cache the entire dictionary in memory rather than reading it off of disk every time. (You have not correctly cached the dictionary, of course, but you've cached enough of the dictionary to get the right answer for the small set of test cases you've tried. Hint: what happens when you do a search on the rack "OPST"? What results should you get? What results do you get?)

    You've traded massive memory use for a great improvement in search time, which I explicitly called out was a deliberate tradeoff that I chose to not make. What would you say to someone who turned this around on you and said that they needed to optimize for minimal memory use? Your program has terrible memory performance -- it uses up megabytes of memory when it only needs to use a few KB. Why do you automatically assume that speed is the most important thing?

    Also, your program is nowhere even close to "as fast as it theoretically should be". We could make it considerably faster. You haven't thought about this problem very deeply if you think this is as fast as you can get it. How would you change this program if you needed to keep the index in the processor cache, for example? How would you change it if the word list were too large to fit into memory but you still needed the speed? The O(1) dictionary lookup depends on a hash; is the hash function the bottleneck? If so, how would you improve it? If any of these were concerns then clearly you have not written a "good" program. Your implementation makes implicit assumptions about both the size of the problem and about what constitutes a "good enough" performance result. -- Eric

    It's really a shame to use 1 or 2 seconds to do 26 dictionary lookups (or 26*26), when Google takes a tenth of that time to search *the entire Internet*. Let's be real. Performance counts.

    Of course it does; that's my point. Performance counts to consumers, but it comes at a cost. It is important to understand deeply what the tradeoffs are between constrained resources like time and memory, and to know what your customer finds acceptable, so that you'll know when you're done optimizing. In your case, you haven't even managed to write a correct program, so optimizing it is rather premature! Customers typically do not want "wrong but fast". -- Eric

  • Eric -- touche, a normalized word can obviously map to more than one dictionary word, so the code should read:

    words.Add(line)<<" "<<orig_line;

    instead of

           words.Add(line,orig_line);

    Your other objections seem to be:

    1. it uses 2MB of memory

    2. it is inefficient compared to some other solution

    I'll address these one by one. First, your program takes up 5MB of ram the moment it's loaded, before the search even runs. I don't know why -- maybe the command line prompts aren't optimized in C# yet ;-)

    How are you measuring that? I suspect that you might be confusing RAM with address space. They are very different. Of course my program consumes lots of address space -- it loads the CLR and the framework. Who cares? Address space is cheap. The relevant measure is the scarce resource. On my desktop machine, RAM is not the scarce resource. What if it were? Or what if the dictionary size were large enough that your lookup table didn't even fit into the address space available?  -- Eric

    Second, 2 megs of RAM is cheaper than 1 second of time, that should be clear to us all. Up to a large enough limit (somewhere in the gigabytes these days), memory and time are interchangeable.

    1 second of time is ~2 billion operations.

    2 megs of RAM is ~2 million operations.

    So we have a factor of 1,000 improvement in frugality, and a factor of 1,000,000 in performance.

    I am not following your train of thought in the slightest here. 2 megs of RAM is cheaper than one second of time only when you've got lots of ram and little time. There are plenty of scenarios where you've got lots of time but are memory and storage constrained. Think cell phones, for example. When writing for cell phones the relevant metric is not time at all, but rather storage and battery consumption. -- Eric

    How much faster could we get? The correct measure of program efficiency is the number of bytes output divided by time. Well, in your original example, there are 64 characters of output. If the lookup is 1.5us, the efficiency is 23 nanoseconds per character.

    No, the correct measure of program efficiency is units of useful work divided by value of resources consumed. Why do you believe that "bytes output" and "time consumed" are always the relevant metrics? The absolute utility of both the work and the resources is not determined by you or me, it is determined by our customers, who are, after all, the ones spending their money on the solution. No performance process can be successful without understanding the utility as seen from the perspective of the customer. -- Eric

    Potentially, then, there is another factor of 50 or so that could be squeezed out. But by any metric, the solution is acceptable. Why do I say that? Because in the web space, where we are currently located, the usefulness is measured by how many people you could serve if you hosted your algorithm online. With a properly "performant" solution, a million times more as it turns out.

    This metric is completely artificial and has nothing whatsoever to do with my scenario. I'm writing this program for myself, I'm the only customer, and I'm not hosting a service online. By considering only metrics relevant to a narrow set of scenarios, and by artificially restricting the bounds on the problem to one that can be solved using your caching strategy then of course your solution is superior. It is inferior when things are different. 

    I could say that usefulness has nothing whatsoever to do with "how many people I can serve online". Suppose I were to say that usefulness is defined as "how large a dictionary I can support". Does your solution scale to a dictionary that could hold the entire human genome? If I were solving a gene sequence lookup problem instead of a scrabble rack lookup problem then your solution would not be suitable. 

     Why is "millions of times more simultaneous users" the sensible thing to optimize for and not "millions of times larger dictionary", or "millions of times less available storage", or any other possible direction we could push this problem? You've just picked one of them and enshrined it as the only sensible direction to push performance. It's not. My point is that you've got to look at what the customer actually needs before you make choices on how to optimize the implementation. -- Eric

    I would love it if you published a faster solution in C#. In fact, I dare you to write an anagram lookup in under 5 nanoseconds per output character using your original AALNST? example

    You dare me? Wow. No one has dared me to do anything in about twenty-five years. That didn't work on me when I was ten years old and it certainly doesn't work now! -- Eric

  • In Argument, the spectator learns.. thanks..

  • Eric, your ideas of memory vs. CPU power and availability need to be updated.

    A phone like Blackberry Bold has 624 mhz processor / 128MB of ram.

    An old Sony Ericsson P910 sports 160Mhz / 64MB ram.

    On these phones, your program would take 10x-50x the time it takes on your desktop, or 10-50 seconds even after your optimization.

    Why? My program is disk-bound, not processor-bound. Since cell phones do not have spinning metal disks, the timings are going to be very different. But, as we'll see, that's irrelevant. -- Eric

    It's just too slow, any way you sell it.

    First off, it's not "too slow". As I keep saying, as the only customer of this program on the desktop, I am the only one who gets to decide what "too slow" is, and it isn't too slow.

    But let's consider the cell phone scenario for a moment. In that scenario speed is not particularly important to the customer. Storage consumption and battery consumption are important to the customer. To solve this problem on a cell phone I would not use your cached-dictionary approach, for two reasons. First, it is far too memory intensive, and second, setting up the dictionary is far too processor-intensive. Rather, I would solve both problems by precomputing the dictionary into a b-tree of canonical forms. Or, rather, actually probably fourteen separate b-trees, one for each word length. That data structure on disk would make it very cheap in both storage and processor time to rapidly find matches without doing any on-phone processing of the dictionary.

    (Then there's the question of how to efficiently store the "posting list" of matches. There are a number of clever things you can do to compress posting lists, but that would take me rather far afield.)

    -- Eric

  • It's miraculous that after having it so thorougly explained to you, you still do not understand Eric's points, Smartass.

    Time is not an absolute priority. There may be uses where other factors are more important. In any case it is vital to first establish your priorities for your specific use, and only then optimise.

  • If you're suggesting that 'performant' means something other than a performer in a play or similar, then you're wrong. There's no such word. There may be references on Wikipedia or elsewhere, but they're created by programmers who don't seem to realise that 'efficient' is a perfectly good synonym. No more excuses.

    Fascinating -- you have a way of knowing when there is "such a word" or not. I would love to learn how to categorize utterances into words vs non-words. Can you give me a precise definition for what makes an utterance a word vs a non-word?

    As for your claim that "efficient" and "performant" are synonyms, clearly that is incorrect. "Efficient" is defined (by me) as "consuming a small number of relevant resources per unit of useful work". "Performant" is defined (by me) as "meeting or exceeding the measurable performance expectations of a typical user".

    Obviously there is a relationship between being efficient and performant, but a program can be efficient without being performant, and performant without being efficient, so clearly they cannot be synonyms.

    I notice however that you have made an interesting error in logic. You are claiming that "performant" and "efficient" are synonyms, and at the same time that "performant" is not a word. But surely to be "synonyms" two things have to both be words! How can two things be synonyms if one of them isn't even a word?  -- Eric

     

Page 1 of 3 (36 items) 123