A case study... kind of

Published 01 August 05 10:30 AM

So, what is a typical case that I might handle?

 

There isn’t really a typical case but I will give you a fictional case that wouldn’t look out of place in a typical day.

 

I don’t generally get cases when they first come in. We have engineers in the various countries who do the initial work on the case. These are smart guys and they solve about 80% of cases. The remainder get escalated up to CPR. Before I speak to the developer with the problem, they will have spoken to at least one other engineer and their case will have been triaged by a group of engineers led by a technical lead who is another senior engineer who specialises in a particular technology. So, when I get a case, I have the initial description of the problem which is normally fairly brief plus notes from the engineer or engineers who have already worked on it. I also attend the triages (basically, brainstorming on the case) so I may have heard about the case before. If possible, I will have a chat with the previous owner to get his impression of a problem. This process can happen very rapidly for a high severity case.

 

So, I get a case. I see from the description that the customer has an ASP/ASP.NET hybrid solution using C# and VB6 components. It has been working fine for a while but now so requests are never coming back to the client browser. This happens several times a day. The engineer who had the case before has checked that the request got to Internet Information server and that no reply was sent. That means that whatever went wrong was in IIS or one of the related processes. This would fall in to my area because the runtimes are from the DevTools even if they ship with the operating system. In the US, this would go to an ASP.NET engineer. In EMEA, we work a little differently so this would fall in to my area as well.  

 

So, what would I do with this? I would troubleshoot it. It is generally a good idea to approach this with an open mind. However, there is at least one rule where I tend not to be open minded. If the behaviour of the system has changed then something else has changed. It might be a big change or it might be a small one but something did change. I know a Chinese gentleman who struggles a little with idiomatic English but is an exceptionally clear thinker. He once told me “Do nothing, nothing happen. Do something, something happen”. It is a startlingly helpful rule.

 

So, the first question that I would ask here is “What has changed?”. This can be a difficult question to get answered. Sometimes the answer is “Nothing at all”. More often, the answer is “Nothing important”. I like the second answer. That means that we can focus on what changed. If it is the first answer then I will have to do some detective work because we know that something has changed even if no-one knows what it was yet.

 

So, the first thing that I would want to look at in this case would be the event log. Maybe there are fatal errors in the log that tell me that a request failed. It might seem that this would be a really obvious thing to do and that someone would have already checked that. Well, maybe they did. If so, then I haven’t wasted much time. However, it is all too easy to miss obvious things when you are in a hurry. Sometimes people look for the wrong thing. I recall one case where the developer of a custom app looked at the log and saw nothing related to his application so concluded that he was looking in the wrong place. The log was full of entries saying that the license for a third COM party component had run out but it didn’t mention his app so he ignored it.

 

I might also look at perfmon logs. There is a lot of good stuff in there about what is happening inside ASP.NET and classic ASP.

 

Let us assume for the sake of argument that I don’t find anything at all interesting in any of those. In that case, I fall back on what I know best. I get memory dumps from the customer. Memory dumps are like a box of chocolates. Well, actually, they are less fattening but like a box of chocolates, you don’t know what you will get until you get it but you may expect certain things and there will be goodies. For the best dumps, always use ADPLUS. They contain the information that we need. If you haven’t heard of ADPLUS then you might want to consider reading up on it at http://support.microsoft.com/default.aspx?scid=kb;en-us;286350

 

So, what do I do when I have the dumps? I look at what they are telling me. I generally like to look at the log files first for an overview, especially in crash mode. They are not terribly revealing in hang mode. Unless there are good symbols on the system that produced the dump, it can be a bit tricky to work out what is going on from just the log. When that is done, I can open up the dumps in WinDbg and have a read through them. So, typically, with a case like this, what I would be looking for is stuck threads where they are never going to come back.

 

Since this is a hypothetical case, I can make the problem be whatever I want it to be. If only that were true in reality… anyway,  let us have some threads which are waiting on a critical section in the CLR with no timeout. They have been waiting for a very long time. If the thread never comes back then you would expect IIS to notice and return an error page to the client. However, if the application config file has DEBUG set to true, this is disabled. The reason for that is that debugging takes a while and we don’t want to fail the request that we are debugging. I can look in to the metadata and check – and lo, all becomes clear. This is indeed set to debug even though it is a production system. This happens a great deal since it is the default configuration and people forget to change it when the system goes live. It really messes up performance and a lot of scalability problems can be solved simply by turning it off. If we turn it off here, the requests will come back as failures. This is better than not coming back at all. It will also help the system because it isn’t doing a lot of things that it really doesn’t need to do on a production server.

 

So, I ask the customer to turn off debugging. System performance is better but in this case, they still have a problem. That isn’t very surprising as the threads were clearly in trouble. So, time for more detective work. I look at the critical section and see that it is one that I would expect threads to use when they make a particular call in to the CLR. So far, so good. The rules on critical sections are simple. You get them as late as you can. You release them as soon as you can. You avoid calls that could result in a deadly embrace. I always like to see a critical section grabs and returned within a screenful or two of code. So, I dig around to see who owns the critical section that one of the threads is waiting on. No-one does. That isn’t so good. We call this an orphaned or leaked critical section.

 

This is about where I would start looking at the CLR code. I find the routine that uses the section and see that it is lovely code in every way. We grab the critical section. We do some stuff. We release it. Hmmm. Well, somehow we are not releasing it. I start walking code. Everything looks good. Nice error handling. Oh look, we can throw an exception and leave if this call fails. That makes sense. Where do we release the critical section in that case? Ah… damn. Might have been good to put that in the catch block before rethrowing the exception.

 

Now, this is a made up bug but it is the sort of mistake that happens. Error paths are always popular places for bugs to linger. You are developers yourself. Care to bet one whether you have tested ever error path perfectly? This wouldn’t be an easy error to spot because it would have thrown the exception just fine. It is a later request that would fail. That isn’t so easy to spot when testing.

 

Ok, I found the bug. This time around, I didn’t need help from the developers to find it but they would have helped me if I had needed it. A lot of the time, there is more scratching of the head and chats with colleagues. Swearing is not unknown.

 

Since it is error handling code that is causing the problem, maybe we can workaround the problem by avoiding the error. We still need to fix the bug though. If we can’t workaround the problem then we need to look at getting a fix urgently. I will talk about that in my next blog.

 

- Oops I posted this once with a different URL. That was a problem between the keyboard and the chair -

by marklon

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

# Jerry Weichbrodt said on August 2, 2005 11:10 AM:
Good stuff, and quite entertaining. Would you consider throwing in some more links or at least a cheat sheet to some of the abbreviations? CLR, for instance, isn't a familiar term for me, but then I haven't worked in the deep, dark recesses of low-level coding under Windows.

Thanks, and enjoyed your blog entries very much,
Jerry
# marklon said on August 2, 2005 11:43 AM:
Thank you for your kind words. Maybe I should do a blog just on the TLAs (Three letter acronym) that we use... but for now:

CPR = Critical Problem Resolution
ASP = Active Server Pages with .NET being the managed version
VB6 = Visual Basic version 6
US = United States <smile>
EMEA = Europe, Middle East, Africa (and Scandinavia and some of Asia)
ADPLUS = AutoDump+ which is a tool to dump the memory images of a process or processes when certain triggers are hit
IIS = Internet Information Server
CLR = Common language runtime. We sometimes call it NDP for .NET Developer Platform
URL = Universal Resource Locator

I also throw around the following quite a lot

QFE = Hotfix. Short for Quick Fix Engineering. Quick is a relative term
DCR = Design change request. Like a hotfix but when we are changing something that was by design. Applies to the next version of a product.
CDCR = Critical Design Change Request. This is a change to the design of a released product. We hardly ever do this. It really isn't something that we like doing and there has to be an incredibly good reason before we will even consider it.

Hope that this helps

Mark
# marklon A case study kind of | adirondack chairs said on June 14, 2009 9:12 AM:

PingBack from http://adirondackchairshub.info/story.php?id=2240

Leave a Comment

(required) 
(optional)
(required) 

  
Enter Code Here: Required

This Blog

Syndication

Page view tracker