In my role I interact with a lot of developers and it is continuously amazing how naive an individual can be regarding the interactions their own code has with other components. I'll shorten this example to the critical points... In one instance I recall conversing with a developer regarding a performance issue in their code that was causing delays responding to requests. In this code path we were looking for configuration information to make a decision on on how to handle a particular request. The lookup of the configuration information was resulting to a call to the Active Directory. In this customer's case they had a stale DNS record that was periodically impacting the lookups to GC/DCs. Because the developer that wrote the client code wasn't aware of the process of the lookup, which was written by another team, they didn't realize a dependency was added to their code path for outbound RPC, network connectivity, and inevitability DNS.

This situation took over two weeks to nail down. Why did it take two weeks? Several factors, issue timing, customer scheduling, data transfer...

  1. It wasn't clear from the customer's explanation of the problem if the performance issue was server or client side. Needed to take a network monitor trace between the client and server. Looking at the trace, the requests where clearly making to the server and the server was taking a long time to respond.
  2. Hmm, so the problem is server side, let take a look at perfmon - yep request taking a while to respond to the client. Guess need to take some user mode dumps of the process.
  3. Found the client request thread... hmm.. its waiting on a DS Lookup. Looking backwards in the stack we find where the lookup is made. When I simply discussed it with the dev, they were very apprehensive to believe it is occurring until I showed them the call stack. The dev just discovered the dependency... opps. They were never aware of it because it always just "worked".
  4. Looking at a second netmon taken server side, we could see the server attempting the communication to the directory and timing out, which eventually led us to the stale DNS record.

This is just one example of how important it is to be aware of who is calling your code and what components you call and the interaction of each. You may have the most benign piece of code that does a quota calculation doing a user lookup that isn't itself critical, however someone could easily pickup and use it in a critical code path. Now that benign piece of code all of a sudden becomes critical since it is now called anytime some saves data.

This might sound silly and most people may respond with a big "Duh, I know that", but it still happens more often than not.

I would highly recommend when looking at design elements, look at them from END to END, not just your piece of the puzzle. Document the critical code paths and know the components used at every stage in those critical code paths. Customer don't call stating that "Hey that configuration caching thingy is behaving poorly".

You need to know your Ins and Outs... What are the paths where I accept data and process it and what are the paths where I communicate externally. How can I know how far a request has been down a critical code path and where did the request have problems?

This leds to my next post... supportability in your code.