Larry Osterman's WebLog

Confessions of an Old Fogey
Blog - Title

My life is a House episode

My life is a House episode

  • Comments 17

Fox TV here in the US has a show called "House".  Valorie and I started watching it sometime towards the end of the 2nd season, the 3rd season started last week.  House stars Hugh Laurie as a genius drug addicted, lame doctor who, with his brilliant associates, finds the root cause of impossibly complicated diseases.

Each episode starts with someone arriving at the hospital with some mysterious ailment, and house and his impossibly pretty team go to work trying to diagnose the person's problem.  They almost always succeed and the patient goes home cured (with several notable exceptions).

Last week, I realized that aspects of my life are very similar to House's (without the drug addition, the handicap, and the crazy-good looking sidekicks part (sorry folks, but nobody on the audio team quite matches House's team, especially me :)).  I'm also not the boss of the team, just a peon.  One of the hallmarks of the show is that they perform a "differential diagnosis" - diagnosis based on the symptoms of the disease.  Invariably their original diagnosis is almost always wrong, but they eventually find the root cause.

 

But there's so much of my life that works like a House episode.  Take last week.

One of the people on my team was looking at the Vista RC1 OCA information and noticed that we had a single crash bucket that had a significant number of hits in one of our components.

I took a look at the crash dump and immediately diagnosed a concurrency issue.  I worked up a fix based on the call stack of the crash (by default OCA crash dumps contain the call stacks for the threads in the process and the registers and not too much more), and I was done.  Nothing out of the ordinary.

I built the fix, verified it on my machine and started the checkin process (there are a number of steps that have to be taken for any checkin, including code reviews, test signoff, etc).

 

Unfortunately, I had this nagging feeling about my fix - the call stack didn't have quite enough information to completely diagnose the problem - my fix would explain the crash, but if the problem was the one I thought it was, I would have expected that there would be side effects.  Things didn't quite add up (the doctors original diagnosis was wrong - the patient should have had other symptoms).

So I went and I asked the internal OCA web site to collect more information from our customers - I wanted a more detailed version of the crash dump that contained the contents of the heap (the doctors asked for more tests to be performed).

It didn't take long (a day or so) for a couple of new occurances of the crash to be reported with the heap dumps.  With the new info, I was quite surprised by what I saw (the new tests that the doctor ordered showed some data that both confirmed and disputed the diagnosis).  The crash was occurring in code that looked like the following:

 

for (i = 0 ; i < class->cElements ; i += 1)
{
    class->GetElement(i, &class->_ValueArray[i]);
}
x = class->_ValueArray[0];

The crash was occurring when accessing _ValueArray[0].  The code was:

move ecx, [esi]+24
move eax, [ecx]

The crash was occurring at the mov eax instruction, eax was 0.  When I got the heap dumps, I saw that class->cElements was 8, and _ValueArray pointed to valid memory!  I looked at the code, the _ValueArray value was located 24 bytes from the start of the class, so the problem wasn't some wierd compiler issue.  There was no question that the value was 0 at the time of the crash, but apparently the memory pointed to by ESI wasn't 0 (the test results were inconclusive - they didn't rule out the original diagnosis, but they didn't confirm it).

So I went back for more information.  One of the OCA options you can do is to ask the customer to fill out a survey which can be used to help diagnose the problem.  I set up the crash bucket to ask the customers for a survey (the doctors went back and took a new version of the patient history).

Unfortunately, even with all this data, we still didn't have confirmation that my original diagnosis was accurate (there was no additional information in the patient history).  Bummer.

Fortunately, late on Thursday afternoon, I got an email from a tester in another part of the Windows organization.  She had gotten this crash running this one series of tests and was wondering if anyone on our team wanted to look at it (the patients mother-in-law remembered something that was important). 

It turns out that she had hit exactly the same bug that the customers had, and she had a live debugger attached to the machine, which meant that I could diagnose the problem directly.  And on her machine, I saw the side effects I had expected to see in the crash dumps (the doctor's eventually performed exploratory surgery and identified exactly the problem that was occurring, and saved the day).

I then talked to the guys who are responsible for the OCA reports.  It turns out that the reason I didn't see the expected side effects on the crash dumps was because of other services that live in the same process as our service.  It turns out that because of those other services, the process of generating OCA crash dumps doesn't preserve the entire state of the process at the time of the crash - some threads continue to run after the crash occurred.  So the information for the current thread is completely accurate as of the time of the crash, other information in the process may not reflect the state of the process at crash time (the patient had another symptom that masked the expected side effects, complicating what would normally be a simple diagnosis).

 

Yeah, I know the House analogy is a bit tortured, but it was all I could think of while I was looking at the problem - "Darn it, my diagnosis is good, I know I found a problem, but I can't tell if it's the root cause or not".

  • > the patient had another symptom that masked the expected side effects, complicating
    > what would normally be a simple diagnosis

    Actually, that seems about right for an episode of "House" :p
  • Larry,

    Awesome post.  I spent a good portion of my dev career in medical research and I've often thought that the similarities are striking.  No doubt, diagnosis of the problem is my favourite (yes, I'm Canadian) part of the job.

    My dad was a tech (electronics, control systems, networks) and always loved the chase of the problem.  I guess I just grew up with it.

    Anyhoo, I always love your articles/Channel9 stories.  Keep it up buddy!
  • To be perfectly honest, if a programmer hasn't felt like he is part of a House episode, then he hasn't programmed enough.

    I'm sure someone will pipe up an say "Real programmers don't even get themselves into this type of situation."  Yeah, right.  
  • > The crash was occurring at the mov eax instruction, eax
    > was 0.

    Reading the rest of your report, I think you mean that ecx was 0.

    > One of the OCA options you can do is to ask the customer
    > to fill out a survey which can be used to help diagnose the
    > problem.

    Sometimes I've been prompted with surveys asking if Microsoft's suggested solutions have any value (99% of the time the answer is no) but they haven't allowed me to state which Microsoft self-signed driver caused the BSOD.  But wait a minute, these haven't been for Vista.  Vista, yeah Vista, here's what happens with Vista:

    A Microsoft self-signed video driver causes a BSOD 100% of the time when trying to shut down.  Microsoft can't diagnose it from a minidump, so Microsoft asks for a full memory dump, and I diligently waste 2 hours complying.  Microsoft rejects the full memory dump because it's larger than 50MB.  I respond with an average amount of cynicism.  End of story.

    Well, at least I can help out your analogy a bit.  As programmers of course we're not addicted to drugs (except caffeine).  We're addicted to computers.
  • One of the starting ideas for the show was "what if a doctor said in front of a patient what they normally say about them behind the scenes (out of earshot)".  There are lots of blogs, series (eg BOFH) where IT people give the "behind the scenes" talk.  I expect most would be fired saying it directly to people's faces :-)
  • I have seen that kind of strangeness while testing a Windows Update beta that shared svchost.exe with about 20 innocent victims. I noticed the problem because it would cause my VPN to hang up randomly.

    As far as the medical analogy is concerned, I think Monty Python said it best: "There's nothing wrong with you that painful and expensive surgery can't prolong."
  • This might be a little off-topic, but seeing that you are on the Audio team I thought I'd bring this up here.  I saw a previous post that talked about audio channels and Vista, but here is my problem:

    I have a DELL Latitude X1 (which comes with SigmaTel AC'97 audio), and where in XP I could choose to have a 5.1 speaker setup (along with Quadro-speaker, Stereo, etc.), I only get a "Stereo" option in the speaker setup dialog in Vista RC1.  Will this behavior change or will Vista render my Logitech 5.1 speakers useless on my laptop?
  • Miguel, you need to install the drivers that came with your Dell.  My suspicion is that you've gotten generic sigmatel AC97 adapters, not ones that are specific to your audio solution.

  • Larry,

    I am using the latest drivers from DELL.com, still the same issue, also, I think I just found a bug:

    After I reinstalled the audio drivers the audio speaker in the system tray had a red 'x' on it, i clicked on it and the audio volume dialog came up, empty, and it will NOT go away.

    I manually restarted the audio endpoint viewer/audio services and while Windows now reports that the sound card is working and I can access the speaker setup dialog (only to confirm that the problem persists, I don't get 5.1 speaker setup, only the Stereo option) the ghost dialog still won't go off my screen.

    The way I got it off was by manually killing "SoundVol.exe" from task manager.

    Anyway, problem persists with the drivers from DELL.
  • So, I couldn't reproduce that "bug" I wrote about in my other post, I guess it was just a fluke because the other times that I reinstalled the drivers (or the audio service was stopped) the sndvol.exe app wouldn't start.  Anyway, So after replacing my drivers with the stock drivers from DELL yesterday (and noting that it still didn't work), Windows Update informed me of updated SigmaTel drivers today (dated a month after the ones from DELL), I installed them and still nothing.  All I get in speaker setup is the "Stereo" option.  I would appreciate some help on this because it really is the most annoying thing in Vista for me at the moment, and seeing that Vista is out of beta I Iwouldn't want to let that slip by.
  • The line that basically sums up House is where he says "$50!". The other doctor says, "You're going to bet on a patient's health?", and then House says, "Is that bad luck?"
  • > So I went and I asked the internal OCA web site to collect
    > more information from our customers "

    If only there was a way for ISV's to request full dumps instead of minidumps..... :(

    > It didn't take long (a day or so) for a couple of new
    > occurances of the crash to be reported with the heap dumps.

    If only (external?) WER data didn't have an 8-10 day lag on it.
  • Monday, September 18, 2006 5:37 PM by Sam
    > If only there was a way for ISV's to request full dumps
    > instead of minidumps..... :(

    I could just imagine.

    Microsoft asked me for a full dump instead of minidump, and I wasted hours complying.  Microsoft rejects full dumps that are larger than 50MB.  Transmission doesn't stop at 50MB.  The user gets to wait for the entire transmission and then days later if they're lucky they'll get feedback about the failure.

    Just imagine if Microsoft themselves weren't the only party that could inflict such waste on gullible volunteers.

    Later Microsoft closed the bug because it's 100% reproducible and 100% unavoidable only outside of Microsoft.

    Oooh, this just in:  When Windows Media Centre displays a message saying it won't run because it didn't install necessary files, and it tells the user to restart Windows Media Centre or restart the PC, but it doesn't tell the user how to install necessary files and it doesn't tell the user how to exit this infinite loop of restarts, Microsoft says this is BY DESIGN.  The correct solution is to run Windows Media Player because Windows Media Player works[*] where Windows Media Centre doesn't, but Microsoft doesn't tell the user that, BY DESIGN.

    Mr. Osterman, how in the world did you ever obtain permission to fix any bugs?  How did the bugs avoid being closed down as not reproduceable and being implemented by design?

    [* Well there are probably some exceptions.  Remembering Microsoft's self-signed video drivers in Windows 2000, now undergoing a sequel, there are probably some exceptions.  Those who study history get to relive it.]
  • <<without the drug addition, the handicap, and the crazy-good looking sidekicks part>>

    Carefully keeping the genius part :-)
    I guess you are just like me: smart, good-looking and modest :-)
  • PingBack from http://woodtvstand.info/story.php?id=7367

Page 1 of 2 (17 items) 12