What is a CritSit?

It’s short for “Critical Situation”, and basically, it's when my red phone rings. It means I have to drop everything and jump on a plane because there is a SharePoint server somewhere in Europe, Middle East or Africa that needs my help.

Needless to say this is something we really want to try to avoid.

  1. It’s bad for our customers because a service important to their business is not operational.
  2. It reflects badly on the product, because the initial assumption is that its SharePoint at fault.
  3. It’s bad for Microsoft because it’s pretty expensive to fly me around.
  4. It’s bad for me because my dinner plans usually get cancelled.

The interesting thing is, when I look back over the CritSit’s that I have responded to over the past year or so, a significant number of them follow this pattern:

  1. SharePoint has been deployed, its running nicely, people start using it
  2. Deployment grows, people have feedback and want to change aspects of the products functionality
  3. Advanced "Tree Navigation", both at the portal level, and within team sites, is a very popular request
  4. A developer works on a fancy tree navigation control, tests and then deploys
  5. The SharePoint environment becomes unstable with the following symptoms:
    1. SharePoint page render times become slower
    2. Over time this gets worse and worse
    3. Portal or team sites eventually become unresponsive
    4. On the server the W3WP process is consuming lots of memory
  6. At this point, the W3WP process is recycled, either by the application pool, or manually, and the Portal comes back to life
  7. However, over time, the exact same issue repeats itself

This whole process might happen multiple times per day, or just once or twice per week. It is intermittent, so it's hard to reproduce, and often people just "live" with it.

Now, the reason it is hard to reproduce, and why it appears intermittent, is because the amount of load on the server has a big impact on how frequently it occurs. The more users on the system, the more frequently the SharePoint "crash" happens.

When I arrive onsite and find myself facing such an issue I usually start by doing the following:

  1. Build up a test environment, ensuring it as closely as possible resembles the production environment, including customisations
  2. Restore the production data into the test environment
  3. Configure at least one Application Center Test client
  4. Create a simple script, with maybe 20 users, and then use it to start hitting the server

In most cases I’m able to reproduce the problem fairly reliably. For example, I can say "Running test script X with 20 users for X minutes will always cause a 'crash'". At this point I typically remove the "Tree Navigation" control and re-run the script, only this time of course there is no crash, thereby proving the control is causing the issue.

With this testing complete, and our problem control identified we can look into it in more detail. Here there is also a pattern, in nearly every occasion the cause has been one of two things:

  1. Custom code not disposing of SPWeb or SPSite objects correctly.
  2. Portal or Team Sites that breach the capacity planning guidelines, particularly around sites and document libraries.

The good news is that we *finally* have a great whitepaper that describes, in detail ,some of the key code practices you should implement in order to avoid this type of issue, and therefore avoid a CRITSIT:

Best Practices: Using Disposable Windows SharePoint Services Objects

Of course, the other way of avoiding such issues is to ensure your code is fully tested, including complete load testing. It is incredibly important that you create an environment, with production data, that can be used to simulate the production load. This environment then allows you to develop baselines, which you can use to determine the overall impact of a particular customisation.

Anyway, take a look at the whitepaper, and good luck coding!

 

Written and posted using Microsoft Word 2007 Beta 2