A case study... kind of
So, what is a typical case that I might handle?
There isn’t really a typical case but I will give you a fictional case that wouldn’t look out of place in a typical day.
I don’t generally get cases when they first come in. We have engineers in the various countries who do the initial work on the case. These are smart guys and they solve about 80% of cases. The remainder get escalated up to CPR. Before I speak to the developer with the problem, they will have spoken to at least one other engineer and their case will have been triaged by a group of engineers led by a technical lead who is another senior engineer who specialises in a particular technology. So, when I get a case, I have the initial description of the problem which is normally fairly brief plus notes from the engineer or engineers who have already worked on it. I also attend the triages (basically, brainstorming on the case) so I may have heard about the case before. If possible, I will have a chat with the previous owner to get his impression of a problem. This process can happen very rapidly for a high severity case.
So, I get a case. I see from the description that the customer has an ASP/ASP.NET hybrid solution using C# and VB6 components. It has been working fine for a while but now so requests are never coming back to the client browser. This happens several times a day. The engineer who had the case before has checked that the request got to Internet Information server and that no reply was sent. That means that whatever went wrong was in IIS or one of the related processes. This would fall in to my area because the runtimes are from the DevTools even if they ship with the operating system. In the US, this would go to an ASP.NET engineer. In EMEA, we work a little differently so this would fall in to my area as well.
So, what would I do with this? I would troubleshoot it. It is generally a good idea to approach this with an open mind. However, there is at least one rule where I tend not to be open minded. If the behaviour of the system has changed then something else has changed. It might be a big change or it might be a small one but something did change. I know a Chinese gentleman who struggles a little with idiomatic English but is an exceptionally clear thinker. He once told me “Do nothing, nothing happen. Do something, something happen”. It is a startlingly helpful rule.
So, the first question that I would ask here is “What has changed?”. This can be a difficult question to get answered. Sometimes the answer is “Nothing at all”. More often, the answer is “Nothing important”. I like the second answer. That means that we can focus on what changed. If it is the first answer then I will have to do some detective work because we know that something has changed even if no-one knows what it was yet.
So, the first thing that I would want to look at in this case would be the event log. Maybe there are fatal errors in the log that tell me that a request failed. It might seem that this would be a really obvious thing to do and that someone would have already checked that. Well, maybe they did. If so, then I haven’t wasted much time. However, it is all too easy to miss obvious things when you are in a hurry. Sometimes people look for the wrong thing. I recall one case where the developer of a custom app looked at the log and saw nothing related to his application so concluded that he was looking in the wrong place. The log was full of entries saying that the license for a third COM party component had run out but it didn’t mention his app so he ignored it.
I might also look at perfmon logs. There is a lot of good stuff in there about what is happening inside ASP.NET and classic ASP.
Let us assume for the sake of argument that I don’t find anything at all interesting in any of those. In that case, I fall back on what I know best. I get memory dumps from the customer. Memory dumps are like a box of chocolates. Well, actually, they are less fattening but like a box of chocolates, you don’t know what you will get until you get it but you may expect certain things and there will be goodies. For the best dumps, always use ADPLUS. They contain the information that we need. If you haven’t heard of ADPLUS then you might want to consider reading up on it at http://support.microsoft.com/default.aspx?scid=kb;en-us;286350
So, what do I do when I have the dumps? I look at what they are telling me. I generally like to look at the log files first for an overview, especially in crash mode. They are not terribly revealing in hang mode. Unless there are good symbols on the system that produced the dump, it can be a bit tricky to work out what is going on from just the log. When that is done, I can open up the dumps in WinDbg and have a read through them. So, typically, with a case like this, what I would be looking for is stuck threads where they are never going to come back.
Since this is a hypothetical case, I can make the problem be whatever I want it to be. If only that were true in reality… anyway, let us have some threads which are waiting on a critical section in the CLR with no timeout. They have been waiting for a very long time. If the thread never comes back then you would expect IIS to notice and return an error page to the client. However, if the application config file has DEBUG set to true, this is disabled. The reason for that is that debugging takes a while and we don’t want to fail the request that we are debugging. I can look in to the metadata and check – and lo, all becomes clear. This is indeed set to debug even though it is a production system. This happens a great deal since it is the default configuration and people forget to change it when the system goes live. It really messes up performance and a lot of scalability problems can be solved simply by turning it off. If we turn it off here, the requests will come back as failures. This is better than not coming back at all. It will also help the system because it isn’t doing a lot of things that it really doesn’t need to do on a production server.
So, I ask the customer to turn off debugging. System performance is better but in this case, they still have a problem. That isn’t very surprising as the threads were clearly in trouble. So, time for more detective work. I look at the critical section and see that it is one that I would expect threads to use when they make a particular call in to the CLR. So far, so good. The rules on critical sections are simple. You get them as late as you can. You release them as soon as you can. You avoid calls that could result in a deadly embrace. I always like to see a critical section grabs and returned within a screenful or two of code. So, I dig around to see who owns the critical section that one of the threads is waiting on. No-one does. That isn’t so good. We call this an orphaned or leaked critical section.
This is about where I would start looking at the CLR code. I find the routine that uses the section and see that it is lovely code in every way. We grab the critical section. We do some stuff. We release it. Hmmm. Well, somehow we are not releasing it. I start walking code. Everything looks good. Nice error handling. Oh look, we can throw an exception and leave if this call fails. That makes sense. Where do we release the critical section in that case? Ah… damn. Might have been good to put that in the catch block before rethrowing the exception.
Now, this is a made up bug but it is the sort of mistake that happens. Error paths are always popular places for bugs to linger. You are developers yourself. Care to bet one whether you have tested ever error path perfectly? This wouldn’t be an easy error to spot because it would have thrown the exception just fine. It is a later request that would fail. That isn’t so easy to spot when testing.
Ok, I found the bug. This time around, I didn’t need help from the developers to find it but they would have helped me if I had needed it. A lot of the time, there is more scratching of the head and chats with colleagues. Swearing is not unknown.
Since it is error handling code that is causing the problem, maybe we can workaround the problem by avoiding the error. We still need to fix the bug though. If we can’t workaround the problem then we need to look at getting a fix urgently. I will talk about that in my next blog.
- Oops I posted this once with a different URL. That was a problem between the keyboard and the chair -