Santalic tailfans, part two

Santalic tailfans, part two

Rate This
  • Comments 36

As I have said before many times, there is only one sensible way to make a performant application. (As an aside: perfectly good word, performant, deal with it!) That is:

  • Set meaningful, measurable, customer-focused goals.
  • Write the code to be as clear and correct as possible.
  • Carefully measure your performance against your goals.
  • Did you meet your goal? Great! Don't waste any time on performance analysis. Spend your valuable time on features, documentation, bug fixing, robustness, security, whatever.
  • If you did not meet your goal, use tools to discover what the worst-performing fixable thing is, and fix it.

Iterate on that process, updating your goals if it proves necessary, until you either have something that meets your goals or you give up.

My explicit goal for my little dictionary search program was that it give results fast enough that I would not be sitting there impatiently waiting for more than a few seconds. That's a very customer-focused goal, me being the only customer of this program. With only a very small amount of tweaking my program met that goal right away, so why would I spend any more time on it? The program takes just under two seconds to report the results for a typical query, which is faster than I can read.

But suppose that we did want to improve the performance of this program for some reason. How? Well, let's go down the list. We have goals, we have very clear code, we can measure the performance easily enough. Suppose we didn't meet the goal. The last thing on the list is "use tools to discover what the slowest fixable thing is".

A commenter conjectured that the performance bottleneck of this program was in the disk I/O. As you can see, every time we do a query we re-read the two megabyte dictionary file dozens of times. This has the benefit of using very little memory; we never have more than one line of the dictionary in memory at a time, instead of the whole four megs it would take to store the dictionary (remember, the dictionary is in ASCII on disk but two-byte Unicode if in strings in memory, so the in-memory size will double.)

That's a conjecture -- a reasonable conjecture -- but nevertheless, it's just a guess. If I've learned one thing about performance analysis it's that my guesses about where the bottleneck is are often wrong. I'll tell you right now that yes, the disk-hitting performance is bad, but it is not the worst thing in this program, not by far. Where's the real performance bottleneck? Any guesses? Could you know without using tools?

Here's the result of a timing profile run of a seven-letter queries with one blank, repeated 20 times:

43%: Contains
21%: ReadLine
14%: Sort
7%: ToUpperInvariant
15%: everything else

Holy goodness! The call to the Contains extension method in the query to test whether the dictionary word is in the rack set array is almost half the cost of this program!

Which makes sense, once you stop to think about it. The "Contains" method is by its highly general nature necessarily very naive. When given an array, 99.9% of the time it has to look through the entire 26-item array because 99.9% of the time, the word is not actually going to match any of the possible racks. It cannot take advantage of any "early outs" like you could do if you were doing a linear search on a sorted list. And each time through the array it has to do a full-on string comparison; there's no fancy-pants checks in there that take advantage of string immutability or hash codes or any such thing.

We have a data structure that is designed to rapidly tell you whether a member is contained in a set. And even better, it already does the "distinct" logic. When I replace

    var racks = (from rack in ReplaceQuestionMarks(originalRack)
                 select Canonicalize(rack)).Distinct().ToArray();

with

    var racks = new HashSet<string>(
                from rack in ReplaceQuestionMarks(originalRack)
                select Canonicalize(rack));

suddenly performance improves massively. The "Contains" drops down to 3% of the total cost, and of course, the total cost is now half of what it was before.

Another subtle point here: notice how when I changed the type of the variable "racks" from "array of string" to "set of string", I didn't have to redundantly change the type thanks to implicit typing of local variables. I want to emphasize the semantics here, not the storage mechanism. If I felt that communicating the storage mechanism was an important part of this code -- because it has such a strong effect on performance -- perhaps I would choose to emphasize the storage by eschewing the "var".

With this change, the program performance improves to about one second per query and the profile now looks like this:

39%: ReadLine
23%: Sort
11%: ToUpperInvariant
7%: the iterator block goo in FileLines
5%: ToArray (called by Join)
15%: everything else

Now the bottleneck is clearly the combination of repeated file line reading (48%) and the string canonicalization of every dictionary line over and over again (37%).

With this information, we now have data with which to make sensible investments of time and effort. We could cache portions of the dictionary in memory to avoid the repeated disk cost. Is the potential increase in speed worth the potentially massive increase in memory usage? We could be smart about it and, say, only cache the seven- and eight-letter words in memory.

We could also attack the canonicalization performance problem. Should we perhaps precompute an index for the dictionary where every string is already in canonical form? This in effect trades increased disk space, increased program complexity and increased redundancy for decreased time. Should we use a different canonicalization algorithm entirely?

All of these decisions are driven by the fact that I have already exceeded my performance goal, so the answer is "no". Good enough is, by definition, good enough. If I were using this algorithm to actually build a game AI, it would not be good enough anymore and I'd go with some more clever solution. But I'm not.

  • To me, all these Code Wars above show just how fiercely protective we, programmers, are over our own ideas (and over our own code that realizes those ideas).

    But the same goes for our customers: they usually have their own ideas as regards to what's important in the applications they use, and what's not. So, while some people out there may demand instant gratification and get frustrated by a slightest delay, others may put up with a long wait, so long as a funny animation, or a cool music, is played to them in the meantime.

    Of course, there are situations where objective criteria outweight the user satisfaction: if an application is designed to set off an emergency shutdown of a nuclear reactor gone bananas, and does it on time, it performs just fine, no matter what anyone thinks (they have to be alive to think it, right?) But if we don't go into such extremes, at the end of the day, the performance criteria are actually subjective.

    I disagree -- the extremes are subjective as well. We have made the subjective decision that some risks of exposure to hazardous conditions are acceptable, and some are not, and we budget accordingly.

    The hard part -- particularly in your human-life-safety scenario -- is how to turn subjective feelings about what is acceptable into crisp, objective requirements that we can program against. With nuclear plant shutdown software the customer has a subjective requirement -- "safe enough" -- and then has to turn that into an objective requirement -- "software responds to outlier condition and initiates feedback-driven control in less than 20 milliseconds of sensing condition". Is that "safe enough"? That depends on what the costs are of building the system compared to the benefit of mitigating the risk.

    And on that subject, if you ever get the chance to tour the insides of a reactor safety system, I recommend it. They've got some fascinating stuff in there. (I have a friend who is a safety officer at a power plant.) I was particularly impressed by the gravity-driven reactor poisoning system. Simple and effective. -- Eric

    And so are our ideas about what is "performant", "performative", "performaniac", or "performatilicious" :-)

    And for those who like thigs formal, look into ITIL: utility: fit for purpose; quality: fit for use.

  • Maybe I overlooked it, but where did you get that scrabble word list from? I would really like to try it and write a similar program, too.

    Thanks!

  • Eric,

    An excellent column!

    For my take on the same theme, see http://lyontamers.com/blogs/jimlyon/archive/2008/08/25/performance-rant.aspx

  • @Max

    Although using your phrase “where did you get that scrabble word list” in Google is probably not the best place to start, I found some interesting information using the phrase “2006 scrabble tournament word list”.

    I use the word “best” in this comment thread with some trepidation; note that I will not be defending its use.

  • Great post.  Articles like this blog entry should be repeated all over the place along with those others that you linked to.  All good points.  I see all kinds of weird code added to applications "for performance" without being measured or even understood.

    "Did you meet your goal? Great! Don't waste any time on performance analysis."

    I would say one exception to this is if you are using SQL to access a database and you are concatenating user strings into SQL rather than using parameters, you have to fix that, even it performance is adequate.  If the concatenation prevents query plan reuse, you can be killing the performance of other applications (or future applications) accessing the same database server.  Not to mention that this is a huge security hole -- I think you've blogged about SQL injection before.   But even if you are building the SQL string based on something not coming from the user, and it changes from one statement to the next, it could still prevent plan reuse (Oracle is really sensitive to this), and this can kill the performance of another application sharing the same database.

    My point is that sometimes code that performs well enough still needs to be refactored for the performance impact it has on other applications.  Make sure not to test the application in isolation.

    An excellent point. Another reason why performance goals and performance tests should be based on scenarios relevant to real-world customers. -- Eric

  • --------------------------------------------------------------------------------

    "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." (Donald Knuth)

  • Look, if you're going to use a word that isn't in the dictionary, then it would be useful if you actually defined it beforehand. How do I know your definitions of 'performant' or 'efficient' when you don't actually say? We're not psychic!

    But... the whole article is about what my definition of "performant" is. That's the point of this article. -- Eric  

    Anyway, I don't really care if it's not synonymous with 'efficient'. Why should I go chasing around for definitions of non-words when there are perfectly good phrases that will do, such as the one you suggested - 'exeeds expected performance' seems fine to me. For someone who seems to crave precision, for you to use a word that isn't in a trusted dictionary - which I would think is the very definition of a 'real' word - smacks of hypocrisy.

    A few things come to mind. 

    First, now you're making a distinction between "non-words" and "real words". I think there is a difference between these two things. "zqsr;7bv" is a non-word. Your claim, as I understand it, is not that "performant" is not a word. Rather, that it is a word, just not a "real word".

    Second, sure, why use a word when a phrase will do? The French, for example, have no "real" word for "sandwich". They are forced to use the equivalent phrase to "some meat and cheese between two slices of bread". True fact! The French Academy has the right idea; why pollute the language with ridiculous neologisms named after English gambling addicts when there are perfectly good short phrases that you can use instead?

    And third, finally we come to your definition of what makes a word "real". "Real" means "found in a standard dictionary" according to you. What are the implications of this definition of "real"?

    Well, for one thing it implies that English had no "real" words in it at all before 1755, when the first dictionary that could reasonably be called "standard" was published. English only had "unreal" words in 1754! How fascinating!

    Another interesting consequence of this definition is that all recent neologisms somehow made the transition from "unreal" to "real". I remember a day when "yuppie" was not in any standard dictionaries, and yet I very clearly knew what it meant. Was the widespread use of this "unreal" word in the media a bad thing? Was its promotion to "real" word a bad thing?

    I was curious about your unusual definition of "real", so I looked up "real" in some standard dictionaries. None of them listed "found in a standard dictionary" as a standard definition of "real".

    So I'm afraid I must turn this around on you and tell you that your use of "real"  is not by your definition a "real usage". And isn't that ironic? It's like ten thousand spoons when what you really need is a fork!

    (One wonders whether your "unreal" usage of "real" also "smacks of hypocrisy", but I really care very little about hypocrisy.)

    -- Eric

    I don't mind the other common Microsoft turns of phrase, like beginning every spoken sentence with "so", using "super" too often and using "ask" as a noun, but "performant" is an exception because it's undefined and acts as a barrier to understanding. The very fact that you provided your own personal definition of it in your comment speaks volumes.

    Hey, when I use a word it means exactly what I intend it to mean. Me and Humpty Dumpty.

    And please tell me, precisely what volumes does the fact that in your comment you provided your own personal definition of "real" speak? 

    Does your non-standard usage of "real" act as a barrier to understanding?

    And how do you suppose we non-psychics are supposed to know what your definition of "real" is?

    -- Eric

     

    http://weblogs.asp.net/jgalloway/archive/2007/05/10/performant-isn-t-a-word.aspx

    "Not a real word":

    http://www.urbandictionary.com/define.php?term=Performant

     

    Perhaps, given that the word "real" doesn't really mean what you think it does, the argument we should be having is about whether "performant" is a cromulent word. I believe it is. Perfectly cromulent word, performant. It embiggens the language. -- Eric

  • I've had the "Real Word" discussion many times in the past, and adopted a simple philosophy: words are simple a means of communicating ideas or emotions from one person to another. If I say it and you understand it, it's a word. Eric, I understood exactly what you meant by 'performant', therefore it's a word. And there ain't nothin' you can say to change my mind!

    If circumstances require precise understanding and minimum ambiguity, then the parties involved can agree on definitions, either directly of via reference; happens all the time in legal documents.

  • A very nice article, thank you Mr Lippert, and as always it appeared on-line just when it was pertinent to my daily work. It seems that after all you must be a psychic!

    A little off topic I know, but I sense that someone (begins with Dave and ends with R.) is really talking from somewhere that most people use for other, more base, functionality!

    Words tend to make it into a dictionary once they can be cited from multiple sources, whilst having a common intended meaning.

    A good article on how words make it into the Oxford English Dictionary can be found here

    http://www.askoxford.com/worldofwords/newwords/newwordsdict/

    But this in itself does not make a word real or unreal -or maybe it does because we still do not know what criteria must be fulfilled to make a word either or.

    I, and several people I have interlocutionary acts with, can be heard using the word "munter" from time to time, and as far as I know it is not in any dictionary, but when people who do not know this word hear it being used  they do not say "you cannot use that word, it's not in the dictionary", they simply ask "what does that mean?"

    Your article quite clearly defines your usage of the word "performant", and therfore I find your usage of that word to be performant in those terms, as I'm sure do many.

    Language is not set in stone, be it a natural language or an artificial one -do we argue that you can't use the keyword "dynamic" in .net 4.0 just because it is not defined in .net 3.0?

    I'm not sure what volumes it speaks, but it certainly does speak volumes (I believe according to Dave's volumes at least)

    What did the Red Queen say? "A word means exactly what I mean it to mean, at exactly the time I mean it to mean it" or something along those lines, but certainly on this topic we most definately are "running to stand still".

     

    "When I use a word," Humpty Dumpty said in a rather a scornful tone, "it means just what I choose it to mean – neither more nor less."
    "The question is," said Alice, "whether you can make words mean different things."
    "The question is," said Humpty Dumpty, "which is to be master – that's all."
     

     

    And as Columbo said so many times before, "there's just one more thing ..." NOBODY likes a smartass, I just love it when people are so far up their places that Dave R. speaks from that they can't use their real name in a post ... if someone thinks he's great, he is for the most part quite tiny!

    I make mistakes and am proud of it how else would I learn? -notice that I'm proud that I make mistakes, I'm not proud OF them. Admitting you are wrong is a very difficult thing to do when you never are, and learning is very difficult to do when you already know everything.

    Why do people who know everything already bother reading blogs posted by mere mortals?

  • Great article, even better ripostes.

    It seems to me though Eric - if I may call you Eric? - is that the ability of customers to give clear performance requirements would seem to make our jobs easier.  So just to summerize, you saying that the time it takes to move the code from acceptable to er... acceptable is wasted time? Surely not. ;)

  • I agree with the majority of your post (especially test first, assume never), and I appreciate that you are bringing this underappreciated topic into the light. However, in every application I have worked on performance *is* functionality. Granted, I write 'scientific computing' type code in video processing & mathematical modelling, but in every case a faster algorithm gives me more freedom to implement automated optimisation or generate more accurate results through faster polling. In my experience, if someone claims that current computers / a given algorithm are 'fast enough', I conclude that they are not thinking hard enough about what their application *could* do for their customer.

    It cannot do anything for the customer if it never ships. Yes, everything could always be better, but you can't let that stop you from shipping it when it is good enough. And how do you know when it is good enough? -- Eric

    Do you want to be renowned for 'bare minimum', or 'sheer genius'? Granted, this does not apply to nearly trivial cases of applications with clearly defined, limited scope and functionality. If you're writing a notepad for your granny, why are you reading a blog about performance anyway?

    Aside: "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." is valid for a sufficiently constrained value of 'small' but is trite and misleading, say 97% of the time.

    Bottom line, if functionality != performance, you're not being creative enough.

  • Doug: Your argument doesn't generalize at all.  Pick another highly performance-centric area: video games.  If (and it's not unheard of) you blow 90% of your development time on engine optimization and have to squeeze gameplay into the remaining 10%, there's no way you hear 'sheer genius' or 'creative' describing you or your product, you just hear derision!  Optimization should be about everything, not just speed.

  • I picked a different approach.  Since you have to iterate the entire dictionary anyway, I wrote a matching function that compares the pattern to each word of appropriate size, rather than pre-computing all possible variations.  The ReadAllLines version below runs in just over 100 ms on my machine and the version using your original method of reading lines runs in around 400 ms.  (The difference being that the disk access is not happening during the timing, not that one method is faster than the other)

    I don't guarantee that it works perfectly but in the checks I have made, it appears to be working properly.

           private static IEnumerable<string> SearchDictionary(string Letters)

           {

               char[] S1 = CanonicalizeToArray(Letters.Replace("?", ""));

               return from line in FileContents

                      where line.Length == Letters.Length || line.Length == (Letters.Length + 1)

                      where CompareString(S1, CanonicalizeToArray(line))

                      select line;

           }

           private static string[] FileContents = System.IO.File.ReadAllLines(@"C:\projects\scrabble_word_finder\twl06.txt");

           private static char[] CanonicalizeToArray(string s)

           {

               char[] chars = s.ToUpperInvariant().ToCharArray();

               Array.Sort(chars);

               return chars;

           }

           private static bool CompareString(char[] S1, char[] S2)

           {

               int Wilds = S2.Length - S1.Length;

               int Counter1 = 0;

               int Counter2 = 0;

               while (Counter1 < S1.Length && Counter2 < S2.Length)

               {

                   if (S1[Counter1] == S2[Counter2])

                   {

                       Counter1++;

                       Counter2++;

                   }

                   else

                   {

                       if (S1[Counter1] < S2[Counter2])

                       {

                           return false;

                       }

                       else

                       {

                           Counter2++;

                       }

                       Wilds--;

                       if (Wilds < 0)

                       {

                           return false;

                       }

                   }

               }

               if (Counter1 + Wilds == S2.Length)

               {

                   return true;

               }

               if (Counter1 < S1.Length)

               {

                   return false;

               }

               return true;

           }

  • A while ago I had a similar need.  I needed a routine to validate whether or not what was being typed was possibly a word (given a limited dictionary) or not, while the user typed.  

    So that as they type:

    A       [Valid]

    AP     [still possible]

    APE   [Valid word, and a root for others...]

    APEF [BZZT]

    And came to the conclusion that the slow performance was looking up the (possible) word.  What I wound up doing was writing another program to take the dictionary and serialize it as a trie  (http://en.wikipedia.org/wiki/Trie) and compiling it into the EXE as an embedded resource.  I'm sure it was horribly inefficient for disk space and would have won me no friends at a code reivew... but it worked.  Splendidly.  The customer's needs (mine) were met and I got to write a string trie implementation in C#.

  • I read an interesting post on Eric Lippert’s blog recently where he was using LINQ to Objects to find

Page 2 of 3 (36 items) 123