The thing about performance work is that it's very easy to be fooled into looking into the wrong areas.  That's why you want your changes to be firmly grounded in data whenever they can be.  Or if you're planning, you want to be thinking about what your customer experience needs to be, what the key metrics will be, and how your overall architecture will support those needs.  Whatever it is that you're doing should be as grounded as you can make it for the stage your are in.  After that you should be doing whatever you can to verify your assumptions, as often as you can with reasonable cost.  The trick is to not go crazy applying too much effort doing all this because because that's just as bad.

Take this little parser example.  You could conclude that all that string allocation in in GetToken() is the crux of the problems -- and you'd be partly correct -- but it's not really the full picture.

To try to see what was going on, one of the first things I did was try to eliminate the GC as a source of possible issues while leaving the rest as similar as possible.  This actually illustrates a cool technique even though it turns out that it's not really the right approach.  I wanted to eliminate all of those Substring calls while parsing -- since I expect we'll be parsing over and over again.  The substring itself is just one example of looking at the input more than once -- a definate no-no in any high performance parser.  But lets see where this leads us.

What I want to do is add a member that will hold the various fact and token strings that occur in the predicates once and for all.  I'd like to have something like this:

private static System.Xml.NameTable ntTokens = new System.Xml.NameTable();

Now the reason I like this class so much is that it's one of few that offers you the opportunity to create strings without having to first allocate a string.  You can call it starting with a character buffer and get a string back.  That's just what we need... only the rest of the plumbing isn't quite right.

I want to make these changes:

<                return strInput.Substring(idx, ichInput - idx);
becomes
>                // returns an existing string if there is one
>                return ntTokens.Add(arrayInput, idx, ichInput - idx);

and

<                    return strInput.Substring(idx, 1);
becomes
>                    // returns an existing string if there is one
>                    return ntTokens.Add(arrayInput, idx, 1);


To do that I also have to change the input from a string to a char array

<        private static string strInput = null;
becomes
>        private static char[] arrayInput = null;


<        static bool EvaluatePredicate(string predicate)
<        {
<            strInput = predicate;
becomes
>        static bool EvaluatePredicate(char[] predicate)
>        {
>            arrayInput = predicate;

There's a couple more places where strInput has to be changed to arrayInput, like the getc() function, but these are very straight forward.

For my test case I wanted the predicates in character arrays now so I did this quick and dirty thing (in a real program you'd likely read the predicates from a file so you could just arrange for them to be in char arrays in the first place)

        struct CharArray
        {
            public char[] ptr;
        }

        static CharArray[] predicateChars = new CharArray[predicates.Length];

and then later:

            for (i = 0; i<len;i++)
                predicateChars[i].ptr = predicates[i].ToCharArray();

            for (i = 0; i< len; i++)
                Console.WriteLine("{0} => {1}", predicates[i],
                      EvaluatePredicate(predicateChars[i].ptr));

And for the test loop itself simply call EvaluatePredicate using predicateChars.

OK that little experiment is about the smallest change that removes any need to create substrings while changing nothing else.

On my test system the original code was taking 4.839s to run; the new version with these improvements takes 4.680s.  That's about a 3.3% savings.

Now I have to tell you in all honesty that if you had asked me to comment on that change off hand I think I would have told you that I'd expect a much greater impact than just 3.3% -- not that 3.3% is anything to sneeze at (see The Performance War -- Win it 5% at a time) but that isn't really the killer issue.

Lets go back and look at what the performance report shows and see if any of this makes sense.

Well the telling line is this one right here:

Exclusive% Inclusive% Function Name
0.01       4.62             WKS::GCHeap::Alloc(unsigned int,unsigned long)

The sum total of the allocations on that substring path, including all the garbage collections, was only 4.62%.  It's no surprise then that we got less than a 4% improvement by only eliminating the garbage and doing nothing else.  The collector was actually doing a pretty darn good job of cleaning up that trash on the cheap!

Even looking further at the string creation costs you'll see that InternalSubStringWithChecks has an inclusive time of about 22%. That means the bulk of the time creating these substrings had more to do with setting up for the array copy, moving the memory, checking the indexes for out of bounds, and so forth than it did to do with allocation.  We can only attribute 4.62% of the time to actually allocating the memory.  All of that checking and rechecking is costing us.

Let's look elsewhere:  the getc() function is taking a total of 10.49% of the time -- that's more than the collector!  It doesn't do anything but a couple of checks into an in memory array, but it's called an awful lot.

Looking at some other interesting operations, you can see that just hashing strings in this run, so that we can do fact lookups, is costing us 2.9% That's an astounding figure because you'll recall that ALL of our allocation activity was only 4.62%. Who would have expected that the allocation figure would be so low that just the string hashes could compete with it.  And that's not all, if you look at the full comparison situation you'll see that checking for existing entries in the hashtable is accounting for 22% of the run time.  Again, that whole operation is something we do merely to find out if any given fact is true or false.

Well, all of this really changed my mind about how to attack this problem.   My plan looks like this:

  • preparse the predicates so that no string processing is required for evaluation
  • assign facts a number in sequence as they are encountered in the predicates
  • the preparsed predicate is a mix of fact numbers and operators probably in postfix order
  • when looking up facts simply access an array of truth bits indexed by fact number
  • as facts change simply change the array contents and then reevaluate predicates as desired
  • do the evaluation itself with a simple stack machine; no recursion in the evaluation

I'll blog some code that does this in the next part and we'll see how it does.

Any predictions based on what we've seen so far?

See the continuation in Performance Quiz #8 -- The problems with parsing -- Part 4