Bug Psychology

Bug Psychology

Rate This
  • Comments 30

Fixing bugs is hard.

For the purposes of this posting, I’m talking about those really “crisp” bugs -- those flaws which are entirely due to a failure on the developer’s part to correctly implement some mechanistic calculation or ensure some postcondition is met. I’m not talking about oops, we just found out that the product name sounds like a rude word in Urdu, or the specification wasn’t quite right so we changed it or the code wasn’t adequately robust in the face of a buggy caller. I mean those bugs where you were asked to compute some value and you just plain get the result wrong for some valid inputs.

Let me give you an example.

The first bug I ever fixed at Microsoft as a full-time employee was one of those. To understand the context of the bug, start by reading this post from the early days of FAIC, and then come back.

Welcome back, I hope you enjoyed that little trip down memory lane as much as I did.

Now that you understand how a VT_DATE is stored, that explains this bizarre behaviour in VBScript:

print DateDiff("h", #12/31/1899 18:00#, #12/30/1899 6:00#) / 24
print DateDiff("h", #12/31/1899 18:00#, #12/29/1899 6:00#) / 24

This prints –1.5 and –2.5, as you’d expect. There’s a day and a half between 6 AM December 30th and 6 PM December 31st, and two and a half days between the other two dates. This is perfectly understandable. But if you just subtract the dates:

print #12/31/1899 18:00# - #12/30/1899 6:00#
print #12/31/1899 18:00# - #12/29/1899 6:00#

You get 1.5 and 3, not 1.5 and 2.5. Because of the bizarre date format that VT_DATE chooses, when you convert dates to numbers, you cannot safely subtract them if they straddle the magic zero date. That’s why you need the helpful “DateDiff”, “DateAdd” and so on, methods.

The bug I was assigned was that testing had discovered a particular pair of dates which DateDiff was not subtracting correctly. I took a look at the source code for one of the helper methods that DateDiff used to do one of the computations it needed along the way. To my fresh-out-of-college eyes it looked something like this:

if (frob(x) > 0 && blarg(y)) return x – y;
else if (frob(x) < blarg(y) && blah_blah(x) > 0 || blah_de_blah_blah_blah(x,y)) return frob(x) – x + y + 1;
else if…

There were seven such cases.

My urge was to dive right in and add an eighth special case that fixed the bug. But my ability to get it right in the face of all this complexity concerned me. It seemed like this was an awfully complicated function already for what it was trying to do.

I researched the history of the code a bit and discovered that in fact variations on this bug had been entered… seven times. Each special case in the code corresponded to a particular bug that had been “fixed”, a term I use guardedly in this case. A great many of those “fixes” had actually introduced new bugs, regressing existing correct behaviour, which then in turn were “fixed” by adding special cases on top of the broken special cases that had been added to “fix” previous bugs.

I decided that this coding horror would end here. I deleted all the code (all seven lines of it! I was bold!) and started over.

Deep breath.

Spec the code requirements first. Then design the code to meet the spec. Then write the code to the design.

Spec:

  • Input: two doubles representing dates in VT_DATE format.
  • VT_DATE format: signed integer portion of double is number of days since 12/30/1899, unsigned fractional part is portion of day gone by. For example: –1.75 = 12/29/1899, 6 PM.
  • Output: double containing number of days, possibly fractional, between two dates.  Differences due to daylight savings time, and so on, to be ignored.

Design strategy:

  • Problem: Some doubles cannot simply be subtracted because negative dates are not absolute offsets from epoch time
  • Therefore, convert all dates to a more sensible date format which can be simply subtracted.

Code:

double DateDiffHelper(double vtdate1, double vtdate2)
{
  return SensibleDate(vtdate2) – SensibleDate(vtdate1);
}
double SensibleDate(double vtdate)
{
  // negative dates like –2.75 mean “go back two days, then forward .75 days”:
  // Transform that into –1.25, meaning “go back 1.25 days”.
  return DatePart(vtdate) + TimePart(vtdate);
}

I already had helper methods DatePart and TimePart, so I was done. The new code was shorter, far more readable, generated smaller, faster machine code and most important, was clearly correct. No special cases; no bugs.

It’s not that my coworkers were dummies. Far from it. These were smart people. But computer geek psychology is such that it is very easy to narrow-focus on the immediately wrong thing, and try to tweak it until it does the right thing.

When faced with these sorts of “crisp” bugs, I try to restrain myself from diving right in. Rather, I try to psychoanalyze the person – who is, of course, usually my past self – who caused the bug. I ask myself “how was the person who wrote the buggy code fooled into thinking it was correct?” Did they not have a clear specification of what the method was supposed to do? Was it misleading? Did they have a clear plan for how to proceed? If so, where did it go wrong?

If there never was either a spec or a plan, then for all you know the whole thing might only be working by sheer accident. There could be any number of design flaws in the thing that just haven’t come to light yet. Editing such a beast means adding unknown to unknown. which seldom leads to good results. Sometimes coming up with a new spec, a new plan and scrapping an existing bug farm is the best way to proceed.

For many years after that, I would ask how to implement DateDiffHelper as my technical question for fresh-out-of-college candidates that I was interviewing for the scripting dev team. I reasoned that if that was the sort of problem I was given on my first day in the office, then maybe that would be a reasonable question to ask a candidate.

When you ask the same question over and over again, you really get to see the massive difference in aptitude between candidates. I had some candidates who just picked up a marker, wrote a solution straight out on the board, wrote down the test cases they’d use to verify it, mentally ran a few of the tests in their head, and then we’d have another half hour to chat about the weather. And I had some candidates who tried earnestly to write the version using special cases, despite my specifically telling them “you might consider transforming this bad format into something more pleasant to work with”, after they got stuck on the third special case. I’d point out a bug and immediately they’d write down code for another special case, rather than stopping to think about the fact that they’d just written buggy code three times already and told me it was correct three times.

  • Looks like this has been on your mind a lot recently: it's basically a longer version of something I saw you write on StackOverflow recently:

    http://stackoverflow.com/questions/921180/c-round-up

    Indeed, that was the inspiration. -- Eric

  • Of course, you are still using cases. It's just that by focusing on one VT_DATE at a time to translate it into your sensible format, you only have to deal with 2 cases, rather than the many cases that come from the combined properties of 2 VT_DATEs, apparently so many they have not been successfully enumerated.

    But why were you using the horrible VT_DATE format in the first place?

    VBScript uses that format to be compatible with VB. VB uses that format to be compatible with OLE Automation. OLE Automation uses that format to be compatible with Excel. Excel uses that format to be compatible with Lotus 1-2-3. With hindsight, the right thing to do would have been to say "Lotus chose a lousy way to represent dates, let's write a translation tool that turns their goofy date format into something sensible and industry-standard". But we did not do so when we had the chance and we're stuck with it now. -- Eric (PS: As a historical note, VB and OLEAUT were owned by the same team, so it is perhaps not quite fair to say that VB does it that way "because of OLEAUT". They both evolved that behaviour at the same time.)

  • That's amazing that the Lotus bug dictated the new Day One, unbelievable! I just read about a 1000 line Switch block the other day. So the case by case approach is very much alive and cultivated.

  • Why even have the unnecessary (vtdate >= 0.0) check when everything can fall into 1 case:

    return DatePart(vtdate) + TimePart(vtdate);

    Good point; my way is a potentially premature optimization. I'll fix it. Though in fact, given that probably 99.99% of dates we see are positive, and given how tight the rest of the code was already, it was actually probably a reasonable optimization to take the early out. I don't recall the details; this was 13 years ago after all. - Eric

  • The Lotus bit reminds me of this story: http://www.astrodigital.org/space/stshorse.html

    Finding a way to replace useful but flawed code seems to be one of Microsoft's greatest challenges.

  • So, do these candidates that do *exactly the same as your smart coworkers* fail immediately or do they have a chance?

    When I specifically tell the candidate "this problem is difficult to solve by cases, consider transforming the data into a better format for arithmetic", and they keep on trying to solve it by cases, no hire. When I write a bunch of interesting test cases on the board and they write me code that does not pass two of the four test cases, and they tell me it is correct, no hire. Heck, I've had candidates who were unable to write the code that computes the absolute value of the difference between two doubles. Those people will not be successful on a compiler team. No hire. -- Eric 

  • "we just found out that the product name sounds like a rude word in Urdu"

    with Urdu as my native language, I would love to hear this story.

    Regrettably, there's no story there. I just made that up as characteristic of the sorts of geopolitical issues we sometimes run into. Usually we don't have too much difficulty with geopolitical issues on the programming language tools teams, but anyone who, say, displays a map of the world as part of their user interface is required to ensure that the map is carefully designed to not implicitly "take sides" in geopolitical disputes. I'm sure you can think of examples. -- Eric

  • "...immediately they’d write down code for another special case, rather than stopping to think..."

    ...they are just trying to SERVE!!! How many times have we all been told that SOFTWARE DEVELOPMENT IS A SERVICE INDUSTRY? :-) Some people just confuse "service" with "subservience" or -- worse -- "servility". So, I've writtent some buggy code?! Sorry, Master! Right away, Master! Whatever you say, Master! Here is another "if-else," Master...

    A sad state of affairs, I'd say... :-(

    @configurator: I'd probably ask them to 1)relax, 2)shape up a bit, and 3)try again, calmly. Then I'd decide whether they deserve a chance or not...

  • I'm sure I remember Code Complete containing a warning against the root cause of all these VT_DATE issues - combining two pieces of data into one data item. The VT_DATE format would be better formulated as a tuple or struct containing (days_offset, offset_in_day), but instead those two parts have been shoehorned into a single double, leading to the mistaken assumptions about how arithmetic works with a VT_DATE because its representation is a familiar type.

  • Off topic, but any news of Wes Dyer? Very quiet on his blog...

    Wes drops by the C# language design committee meeting every now and then but I haven't heard much from him either lately. He seemed to be very well when last I saw him though. -- Eric

  • Great post. I always thought that a large IF is 90% a result of a bad design.

  • I really like this line:

      I ask myself "how was the person who wrote the buggy code fooled into thinking it was correct?"

    When you comes across things you wrote that are clearly not working right, especially when someone else is looking over your shoulder, it's not always easy to admit you goofed.  (OTOH, it does seem to be rather easy to point out that someone else goofed.)

    No matter who goofed, this is a good time to start with some assumptions:  

     - We are all smart people (at least somewhere on the upper end of that continuum)

     - We all really try to do the right thing (i.e. not deliberately mis-coding)

    The comment from @Denis points out how knee-jerk we become when encountering errant code.  When you step back from the quick fix and assume that someone reasonably intelligent was trying their best to do the right thing, you have to say "How did they (I) get fooled?"

    When you can stop and do this, you might find out something.  It could be a bad spec or the need to be compatible with some legacy feature.  Or, it could be they never stopped to analyze the underlying issue, like you are now doing.

    Either way, you'll be ahead of the game if you ask the question.

  • Of course the Chevy Chase approach to Jane Curtain during Weekend Update on SNL (in the 1970's) is an alternative approach to dealing with peoples mistakes [You should hear some of the names I have called myself].

  • if (frob(x) > 0 && blarg(y)) return x – y;

    else if (frob(x) < blarg(y) && blah_blah(x) > 0 || blah_de_blah_blah_blah(x,y)) return frob(x) – x + y + 1;

    else if…

    Ha ha! So true.

  • // negative dates like –2.75 mean “go back two days, then forward .75 days”

    Sweet. Did you ever consider making VT_DATE turing-complete?

    Lol. That almost made me spit-take my Diet Dr. Pepper. -- Eric

     

Page 1 of 2 (30 items) 12