However careful you are with your performance culture when creating everyday professional code you need to be doubly careful when creating libraries for other developers.  This is of course because when writing a library you can't know in advance exactly which functions will be used in which ways and how often.  Even esoteric looking functions might become vitally important in some customer scenario -- so more than the usual caution is warranted.

Below is an edited bit of advice I gave my own team a while ago, presented for your benefit.  I present these as guidelines but like all performance related advice you have to take it with a grain of salt... there are times to break rules. These experiences are based on what I've seen in our own Base Class Libraries (BCL) but they're often applicable to other libraries.

The motivation for many of these rules comes from the fact that simple functions in the BCL frequently end up getting called O(n) times (or even O(n*lg(n)) times) in the course of larger scale operations.  It is because so many things are built on top of what we've done that we must be extra careful.

Principle #1:  Hashing anything must be a linear time operation and must not allocate any memory

If we spend more than O(n) time (n representing the size of object to be hashed) doing the hashing then we risk impacting the effectiveness of the hash collection in a major way.  We actually must do better than just O(n) -- we want to make exactly one pass over what needs to be hashed.  Similarly if we're allocating any memory we will likely churn the heap like crazy when it comes time to do those high-speed lookups.

Principle #2:  Comparing any two objects must be a linear time operation and must not allocate any memory

This rule is perhaps even more important than case #1 -- comparison operations are frequently used in the context of sorting, or in hash tables so the operation might be repeated many many times.  O(n*log(n)) for a sort for instance.  An example of a bad way to compare two tree type structures is to call ToString() on both and then compare the serialized form...  Again we cannot afford to make more than one pass over each of the two input objects to do the comparison and of course comparisons that can early-out are better than comparisons that cannot.

Principle #3:  When Parsing anything, know the input language and use a suitable technique for that input

A sure sign that you parsing something is that the name of the function you are writing includes the word Parse in it. :) :)

When parsing, it is vitally important that you fully describe the input language -- not only for your own benefit but for the benefit of those that come after you.  Good ways to do this include:  regular expressions, grammars, and so forth.  In addition to describing the input language you must describe what the semantics of that language are, i.e. what do the various forms mean?   This is also vitally important because others looking at the code later will otherwise have to reverse engineer the language you parse and will be faced with the dubious prospect of wondering whether exotic input forms accepted by your parser, but unlike the canonical input forms, were really intended or are bugs.

Having specified the language, apply a suitable technique.  Good parsers make exactly one pass over their inputs and create no, or virtually no, temporary objects in the course of parsing i.e. they create the output data structure and nothing else.  Bad parsers do things like make one pass over the input to convert all whitespace to blanks followed by another pass to convert backslashes to forward slashes followed by assorted substrings to create string pieces which can then be passed to helper functions.   Generally speaking, the sorts of things you need to parse in a low-level API can be described by a regular expression and are therefore parseable, in place, with only a tiny amount of state to directly generate the output. 

To underscore these first three principles, very modest transformations in BCL logic netted a savings of 1.4M (yes that's megs) of temporary strings a test case that creates 5 basic app domains.  That's 1.4M of 2.2M total in that scenario.

Principle #4:  Temporary string objects are not a good thing

Naturally there are exceptions but by and large if you are using String.Substring or String.Split in low level APIs you've probably made a mistake.  Consider using a technique in which you keep track of offsets within your original input string and take advantage of the general String.Compare forms for comparing in the middle of the string.  This is especially important in the context of #1 and #2 above.   Note: this advice is really unique to libraries and even then only certain low-level libraries like mscorlib.

Principle #5:  Don't create abstractions to enumerate things which can be trivially enumerated without an abstraction

For instance -- if you have an ArrayList, don't use the Enumerator just to walk the elements -- use an index, it's much faster and requires no extra memory.  In one scenario I found 12% of their total allocations were going to simple ArrayList enumerators -- those could have been integers with no loss of clarity.

Similarly, composing BinaryReaders on top of MemoryStreams so that you can read bytes out of a byte array is not such a good thing to do.  You could perhaps index the bytes with an int.

Principle #6:  When creating data structures, pretend that following a pointer is a very costly operation

Actually you don't have to pretend because it really is costly compared to say going to the next element of an array.

Modern processors can issue multiple instructions per clock when things are going great, but following pointers means more cache misses which means more CPU time spent doing nothing.  You especially have to beware of data structures which are built up over a long period of time -- if the allocations did not happen near each other in time then the extra benefits of locality that the GC provides can be lost.  That means cache misses which brings your average performance way down and an otherwise nice looking data structure ends up killing you.

Principle #7:  You're not done with your implementation until you've measured its performance

If you remember nothing else, remember this:  it's terribly easy to hook up a something like clrprofiler to your new implementation (you don't even need to rebuild!).  Please do look at what your code did and study what got allocated.  Make sure that it is satisfactory and as you expected -- if you don't do this, the job's not finished yet.

In a case I was just looking at the other day, innocuous looking changes for readability and better factoring of the code (laudable goals) inadvertently causes a 2x space regression (i.e. the new way used twice the memory).   Those sorts of things jump right out at you if you use allocation profiler -- and they're almost always drop-dead-simple to fix.

I'd like to close with what I began with -- these are only guidelines and the most frustrating thing about performance work is that it there are powerful secondary and tertiary effects so "even the wisest cannot tell" what any given change might do without proper experimentation and measurement.  So take these thoughts all under advisement only, then remember #7.

Thanks for listening,