I recently sat down and thought a little about the typical user experience when troubleshooting IIS6, assuming s/he had little/no IIS context that long-time users have... and the picture did not look so good.

Now, I know that IIS7 will make huge improvements in this area (and will unfortunately obsolete some of this information... but not the general concepts! :-) ), but it is not available yet. I am also certain that users will continue to push us to literally start to self-diagnose the issue instead of providing useful information for users to make corrective actions (and certainly not merely dump trace data, like we do in WS03SP1...), so there has to be a good balance.

But in the meantime, I wanted to gather together some of the basic troubleshooting steps that I go through to diagnose issues which appear to involve IIS, as well as explain some of the rationale of the steps, so that the reader can better understand the process.

Preserve System State

First and foremost, when you are trying to troubleshoot why something is not working, you want to preserve the system state for examination. I cannot emphasize this enough... especially when you want someone else to help you. I can tell you that developers will not even look at your issues unless you give them the exact state that triggers the issue, and their rationale is simple - they want to diagnose the actual issue... not some psuedo or related issue. Only the Real Thing.

Thus, you want to treat the system more like a "crime scene" where you take measured steps around the chalk figures, take non-destructive samples, record the location of everything, etc... If you must make changes, make note of the order in which you do them as well as any consequential effects. I prefer to keep a little temporary "journal" in notpad.exe of my actions - so that I remember what I did, why I did them, and in what order... so that subtle patterns may emerge in later analysis without needing to gather data again.

For example, RESIST the urge to try to uninstall/reinstall IIS, change filesystem ACLs to wide-open, make user accounts to be administrator, or lower security settings on the server or in the browser just to see if things work. These sorts of actions merely destroy system state such that you may never know *why* something started working (and since you destroyed system state, no one else can help re-interpret for you)... so you never know *how* to deal with it in the future. In other words, you fail to learn and grow from the experience, and it can only cost you more time/effort in the future. Resist the urge to snap up short-term gains and focus on the long-term benefit. Simply finding solutions is ok; learning how to solve a class of problems is even better. Gee... these same ideas apply to life equally well... ;-)

This is why on newsgroups/Q&A, when someone tells me that they reinstalled, wiped ACLs, changed Administrators, etc during their "troubleshooting", I simply have to turn off and stop helping them... not because I do not care... but because at that point I am no longer dealing with the original issue but some new meta-issue.

Now, I know that some of you enjoy "troubleshooting" by tinkering around with settings in the UI to see if any combination works and if it does, great. But realize that this simply means that you condemn yourself to be limited by whatever is accessible via the UI, and in the case of an open platform like IIS, understanding fundamentals of how things plug together is really important because the UI simply cannot and does not represent all useful interactions. Thus, the astute reader should notice that I never write a blog entry from the perspective of the UI but always from the perspective of the IIS Core Server and what is going on... and the UI is merely a means to configure the necessary settings as appropriate.

In general, I prefer to the classic warfare approach of first gathering and analyzing my "recon data" before formulating any strategy and making any intrusive deployments and movements...

Useful Logs

A common pitfall that I see users fall into is to claim:

I did not see any useful errors in X, so I decided to do something more drastic.

While I understand that users are trained to look for errors in a variety of places (Event Log seems to be a common place where people expect everything to be logged), and IIS does not log everything everywhere (and certainly does not flood the event log), it pays to be patient. Even when I am dealing with some system I do not fully understand (like the Windows Firewall, Exchange, SQL...), I still first search the web for any diagnostic aid, search the filesystem/registry for any log file clues, etc... before trying anything else... simply because I value preserving system state.

The most valuable log files for IIS6 (and their locations) are:

  • HTTP.SYS Error Log - %windir%\System32\LogFiles\HTTPERR (Default location; configurable)
  • IIS Website Log - %windir%\System32\LogFiles\W3SVC# (Default location; configurable) **
  • Event Log (both System and Application)

** I'm not going to complicate things with Centralized Binary Logging and such... because the point of the logfile remains the same; just location differs. Materially it does not change the discussion nor rationale.

One of the more non-intuitive things when dealing with IIS log files is that there is no single location to correlate all of them. Worse, it often appears that information is haphazardly split amongst them, so it is never clear where to look for what. Barring unintentional bugs, here is how I logically classify the log files:

One common source of "debugging information" that I dissuade you from relying upon for troubleshooting is the HTTP response displayed by your web browser. Quite frankly, do not trust the browser for troubleshooting unless you have nothing better. Use a tool like WFetch from IIS Resource Kit Tools. Browsers simply have too many "usability" features that limit their usefulness, including:

  • Browsers may not display the actual HTTP Response but some rationalized response, so you never detect the flawed output of custom ISAPI code running on the server
  • Browsers may re-interpret HTTP Response and generate some other pre-canned response, so you never get the actual HTTP Response Code
  • Web Server may not send a useful Custom Error page to the Client despite logging all the information to Log files on the server

In short, first look at the aforementioned Log files on the server before doing anything else...

Use Non-Invasive Monitors

When the log files do not tell the whole story, I resort to pragmatic, non-invasive monitors like FileMon, RegMon, and NetMon to observe and record what happens on the situation in question.

  • Suppose the problem has to do with the request accessing some file or registry key and returning access denied or file not found. I suggest using FileMon (aka File Monitor) or RegMon (aka Registry Monitor) from www.sysinternals.com to track which resource is accessed and the associated error.

    Some representative blog entries:
  • Suppose the problem is a repeating user dialog popup, or the browser is "hanging" and eventually times out, or just any other unexpected HTTP sequence. I suggest using NetMon (aka Network Monitor) from Add/Remove Windows Components / Management and Monitoring Tools / Network Monitor Tools to capture the incoming/outgoing request/response.

    In particular, this approach is necessary if ISAPI Filters or ISAPI Extensions are involved in the request because they can cause arbitrary server behavior. So, you need to capture these arbitrary behaviors, determine which is wrong/right, and go from there.
  • Suppose you know the actual request which will generate the misbehaving hang, then you can consider using a tool like WFetch to independently make that request and then observe the raw HTTP Response to figure out what is wrong with it. Using browser plugins like Fiddler in IE is not as independent nor direct as you need.

Use Real Debuggers

Suppose the log files and pragmatic monitors fail to tell the whole story... I then attach debuggers like NTSD or WINDBG from the Microsoft Debugging Toolkit (do NOT use Visual Studio because installing/using that tool changes too much machine state which may be relevant. Visual Studio is more a development platform than a debugging tool), set up symbols as directed, and then investigate the process responsible for handling the request in question.

A couple of useful breakpoints to use include:

  • w3isapi!ServerSupportFunction - any time an ISAPI Extension makes a pECB->ServerSupportFunction call, you can trap it and based on the ID, examine every single parameter value. Now, with IIS6 on WS03SP1, this information can be obtained by turning on ETW Tracing, but I still enjoy the generic method of simply setting a breakpoint on whatever I am interested in trapping and then observing it.
  • w3core!FilterServerSupportFunction - any time an ISAPI Filter makes a pfc->ServerSupportFunction call, you can trap it and based on the ID, examine every single parameter value. With IIS6 on WS03SP1, this information can be obtained by turning on ETW Tracing as well.

In conjunction with Log files indicating the error response/codes, Network Monitor indicating what is wrong with the response, and trapping what ISAPI Filters/Extensions do on the server, you can usually track down whether the problematic response came from IIS or some particular ISAPI... which is a huge step forward in troubleshooting.

Conclusion

Ok, I am going to stop at this point because I do not want this blog entry to be some all-encompassing novel that takes me forever to write, edit, and perfect and hence never publish. ;-) I hope that this information provides a useful scaffold for any user of IIS to effectively gather data to troubleshoot a variety of IIS-related issues by employing various associated diagnosis techniques (which I intentionally did not mention... though they would be nice subjects for future blog entries).

Please note that my troubleshooting steps do not involve changing the system's or even IIS's configuration other than replaying the request or action that triggers the issue under investigation... because to me, preserving system state and recording my actions/observations is most important. Why? Well... suppose I cannot actually resolve the issue... then I definitely do not want to prevent anyone else from helping me resolve the issue and want to provide them with the best environment and all the information I had already independently gathered to save them time.

Yes, I realize that this approach does not make you feel empowered because you do not actively change anything... but please realize that troubleshooting is about making correct changes quickly; not quickly making correcting changes... ;-)

//David

[2006-07-14] Hmm... minds thinking alike. I recommend this URL:

http://weblogs.asp.net/steveschofield/archive/2006/07/08/Troubleshooting-process.aspx#comments