Larry Osterman's WebLog

Confessions of an Old Fogey
Blog - Title

Larry's rules of software engineering #2: Measuring testers by test metrics doesn't.

Larry's rules of software engineering #2: Measuring testers by test metrics doesn't.

  • Comments 30

This one’s likely to get a bit controversial J.

There is an unfortunate tendency among test leads to measure the performance of their testers by the number of bugs they report.

As best as I’ve been able to figure out, the logic works like this:

Test Manager 1: “Hey, we want to have concrete metrics to help in the performance reviews of our testers.  How can we go about doing that?”
Test Manager 2: “Well, the best testers are the ones that file the most bugs, right?”
Test Manager 1: “Hey that makes sense.  We’ll measure the testers by the number of bugs they submit!”
Test Manager 2: “Hmm.  But the testers could game the system if we do that – they could file dozens of bogus bugs to increase their bug count…”
Test Manager 1: “You’re right.  How do we prevent that then? – I know, let’s just measure them by the bugs that are resolved “fixed” – the bugs marked “won’t fix”, “by design” or “not reproducible” won’t count against the metric.”
Test Manager 2: “That sounds like it’ll work, I’ll send the email out to the test team right away.”

Sounds good, right?  After all, the testers are going to be rated by an absolute value based on the number of real bugs they find – not the bogus ones, but real bugs that require fixes to the product.

The problem is that this idea falls apart in reality.

Testers are given a huge incentive to find nit-picking bugs – instead of finding significant bugs in the product, they try to find the bugs that increase their number of outstanding bugs.  And they get very combative with the developers if the developers dare to resolve their bugs as anything other than “fixed”.

So let’s see how one scenario plays out using a straightforward example:

My app pops up a dialog box with the following:

 

            Plsae enter you password:  _______________ 

 

Where the edit control is misaligned with the text.

Without a review metric, most testers would file a bug with a title of “Multiple errors in password dialog box” which then would call out the spelling error and the alignment error on the edit control.

They might also file a separate localization bug because there’s not enough room between the prompt and the edit control (separate because it falls under a different bug category).

But if the tester has their performance review based on the number of bugs they file, they now have an incentive to file as many bugs as possible.  So the one bug morphs into two bugs – one for the spelling error, the other for the misaligned edit control. 

This version of the problem is a total and complete nit – it’s not significantly more work for me to resolve one bug than it is to resolve two, so it’s not a big deal.

But what happens when the problem isn’t a real bug – remember – bugs that are resolved “won’t fix” or “by design” don’t count against the metric so that the tester doesn’t flood the bug database with bogus bugs artificially inflating their bug counts. 

Tester: “When you create a file when logged on as an administrator, the owner field of the security descriptor on the file’s set to BUILTIN\Administrators, not the current user”.
Me: “Yup, that’s the way it’s supposed to work, so I’m resolving the bug as by design.  This is because NT considers all administrators as idempotent, so when a member of BUILTIN\Administrators creates a file, the owner is set to the group to allow any administrator to change the DACL on the file.”

Normally the discussion ends here.  But when the tester’s going to have their performance review score based on the number of bugs they submit, they have an incentive to challenge every bug resolution that isn’t “Fixed”.  So the interchange continues:

Tester: “It’s not by design.  Show me where the specification for your feature says that the owner of a file is set to the BUILTIN\Administrators account”.
Me: “My spec doesn’t.  This is the way that NT works; it’s a feature of the underlying system.”
Tester: “Well then I’ll file a bug against your spec since it doesn’t document this.”
Me: “Hold on – my spec shouldn’t be required to explain all of the intricacies of the security infrastructure of the operating system – if you have a problem, take it up with the NT documentation people”.
Tester: “No, it’s YOUR problem – your spec is inadequate, fix your specification.  I’ll only accept the “by design” resolution if you can show me the NT specification that describes this behavior.”
Me: “Sigh.  Ok, file the spec bug and I’ll see what I can do.”

So I have two choices – either I document all these subtle internal behaviors (and security has a bunch of really subtle internal behaviors, especially relating to ACL inheritance) or I chase down the NT program manager responsible and file bugs against that program manager.  Neither of which gets us closer to shipping the product.  It may make the NT documentation better, but that’s not one of MY review goals.

In addition, it turns out that the “most bugs filed” metric is often flawed in the first place.  The tester that files the most bugs isn’t necessarily the best tester on the project.  Often times the tester that is the most valuable to the team is the one that goes the extra mile and spends time investigating the underlying causes of bugs and files bugs with detailed information about possible causes of bugs.  But they’re not the most prolific testers because they spend the time to verify that they have a clean reproduction and have good information about what is going wrong.  They spent the time that they would have spent finding nit bugs and instead spent it making sure that the bugs they found were high quality – they found the bugs that would have stopped us from shipping, and not the “the florblybloop isn’t set when I twiddle the frobjet” bugs.

I’m not saying that metrics are bad.  They’re not.  But basing people’s annual performance reviews on those metrics is a recipe for disaster.

Somewhat later:  After I wrote the original version of this, a couple of other developers and I discussed it a bit at lunch.  One of them, Alan Ludwig, pointed out that one of the things I missed in my discussion above is that there should be two halves of a performance review:

            MEASUREMENT:          Give me a number that represents the quality of the work that the user is doing.
And      EVALUATION:               Given the measurement, is the employee doing a good job or a bad job.  In other words, you need to assign a value to the metric – how relevant is the metric to your performance.

He went on to discuss the fact that any metric is worthless unless it is reevaluated at every time to determine how relevant the metric is – a metric is only as good as its validity.

One other comment that was made was that absolute bug count metrics cannot be a measure of the worth of a tester.  The tester that spends two weeks and comes up with four buffer overflow errors in my code is likely to be more valuable to my team than the tester that spends the same two weeks and comes up with 20 trivial bugs.  Using the severity field of the bug report was suggested as a metric, but Alan pointed out that this only worked if the severity field actually had significant meaning, and it often doesn’t (it’s often very difficult to determine the relative severity of a bug, and often the setting of the severity field is left to the tester, which has the potential for abuse unless all bugs are externally triaged, which doesn’t always happen).

By the end of the discussion, we had all agreed that bug counts were an interesting metric, but they couldn’t be the only metric.

Edit: To remove extra <p> tags :(

  • And even worse is when testers get docked pay for filing duplicate bugs. I've never had more shit from testers than when I've tried to return bugs as duplicates.
  • I'll bite on the controversy.

    That's a bit one-sided (says Drew the tester).
    Blindly-applied metrics like that can motivate developers to be unreasonable, too. When a dev manager says "if you have more than X open bugs you can't do new work on project Y" (yes- this can happen), we have devs who refuse to fix real bugs, resolving "won't fix" or "by design" or even "no repro" so that their bug counts can remain low. That's incredibly frustrating as a tester. Especially when the dev is in a different org or even a separate division of the company - we testers can't even get away with schmoozing to get the fix then.
    For the record, I tend to agree that lumping several related problems into one bug is easier to track. The flip side is that I've also had bugs resolved as fixed when not all of the problems in the bug report were fixed yet. The right style seems to depend on the developer the bug is assigned to as much as the tester that filed the bug.

    I'm not so sure the validity of the metrics is the problem either. A true measurement could still be meaningless. IMHO, the root of the problem is determining how to make information from data.
  • Ok, the wife just has to chime in here....

    Yup, I think that using the number of bugs filed metric as *the* most important part of a tester's review is silly. I'm a crappy tester from that standpoint. I'm even a crappy tester from Larry's standpoint of spending the extra time to figure out what's going on. Sorry dear, I don't write code, I just break it.

    What's missing from the review process is exactly the input that's most important: the customer of the tester or developer who is being reviewed. Why can't a tester be graded using the metric of all bugs filed (by design, won't fix or spec included) and require the developer/closer of the bug to assign a "relevance of bug" scale and an "effort involved" scale? If you find a nit bug (like a spelling mistake), the relevance should be medium but the effort should be minor. If someone finds a nasty race condition, the relevance should be high and the effort (should hopefully) be high. This allows the customers of the testers (aka the developers) to have input in the tester's review. Testers get almost immediate feedback from the developers about what kind of testing they are doing and how much effort the tester percieves they are putting into their job. Wouldn't this help testers to do their jobs better and developers to forge better relationships with the developers? Is it more time and energy? Yes, but the goal is to make better testers (and better developers) on the way to making better products. The adversarial relationship between many testers and developers is nothing more than a time and energy waster.


    The same thing can be said of the review of developers. Part of being in a group at Microsoft is the sense of family that develops. I would hope that everyone at Microsoft in any sort of a development position should have the ability to fill out a short questionaire for some number of people in their immediate group. Something simple like
    1) I find the individual's code to be understandable.
    2) I find the individual easy to work with.
    3) I find the individual takes responsibility for errors or ommisions in their code.
    4) I find the individual to write code appropriate to their level.
    5) I find the individual to be an asset to their group.
    (I know, there's a good reason I'm not an HR kinda person...)

    Larry should be filling out a short questionaire for everyone in his group as well as anyone he's had a serious technical discussion with in the last month. He should be filling out a questionaire for his primary tester(s)and anyone who has written code that he's had to integrate into his stuff. I would think that everyone who's been at Microsoft for over a year could come up with 5 or 10 individuals about whom they could write some sort of quickie review. These reviews should help a manager get a better view of their employee. I know a good manager doesn't really need this as much as one who is overworked, but I don't percieve it harming the good managers.

    Ok, so I gave you $2.00 rather than $.02. It's the decimals that do me in.
  • Being an STE myself I must agree with Mrs. Osterman. I have to say that basing evaluations of a Tester solely on bug stats seems completely short sighted. Even if you are taking into account the priority and or severity of the bugs found.

    There are other aspects to being a software tester than writing bugs. Quantifiable stats such as test cases written, test scenarios written, test cases completed, test scenarios completed, just to name a few.

    Evaluation becomes more difficult when we attempt to quantify the more intangible quality criteria such as leadership, communication, thoroughness, etc.
  • I guess you meant "omnipotent", not "idempotent".
  • No, I meant idempotent :) From Dictionary.Com:
    http://dictionary.reference.com/search?q=idempotent

    Having said that, a better word would have been "interchangable". My usage follows from the jargon usage (a header file is considered idempotent if it can be safely included twice - thus w.r.t. order of inclusion it is interchangable) but that's a stretch.
  • Hmmm. Maybe I'm missing something, but there seems to be an even more obvious flaw with such a metric: *if* each tester is working on different pieces of code or testing different apps or modules of an app, then there will exist a correlation between the number of bugs found by a tester and the developer responsible for the code being debugged. That is, you can't separate the coder from the tester. If a group of testers (T1...TN) are testing code written by a group of coders (C1...CN) then it could be possible (given the nature of the org or some such) that some subset of testers reviews the code from a subset of coders more often than others. That is, T1-T3 may get code from C2,C4, C5 more often than from C1,C3,C6-CN. The number of bugs reported by T1-T3 is then proportional to the number of bugs produced by the C2,C4 and C5 coders.

    The point being, you would need to ensure that there is no statistically significant correllation between the tester and the person coding the app being tested. Otherwise, you could be measuring the coders skill at coding as well.
  • An interesting point Quentin.
    I think that in general, most metrics of this type ignore quality differences between different members of a given development team, which is, of course silly.

    Which gets back to the question of "Is the metric valid?" In the case you're describing above, it's not clear that it is. In other circumstances, it may be (the testers might be testing a completely new product developed by developers whose defect rate is constant). It depends on the circumstances.
  • Chiming in way late here, following a link from Joels page of last week, not sure if you've discussed this futher elsewhere, but:
    Valorie, isn't the customer of the tester the customer of the business? So rather than bug counts or dev reviews, the testers rewards should be based on bug reports filed by clients?

    This'd have the positive impact that the quality of the team is measured, rather than the quality of the dev or of that individual tester - both of which are subjective measures anyway.

    It would also make your assessment of severity easier : a spelling mistake mentioned by your 10,000 seat client *may* have more business impact than when the 20 seat client screams at you about the false positive on file save.
  • Isn't it a little bit moronic to discuss how testers spend their time? It is their time that they waste anyway! And you are being paid for fixing those bloody bugs, you should concentrate on fixing them, or better yet, not produce them.

    I know I'm asking too much...

    If developers like you were not so incompetent, then testers could concentrate on the important stuff. I'm not saying that testers are brilliant, but your jerk attitude isn't helping either.

    And of course you can invent the most important metrics for testers, but it will just get fatter and fatter until it explodes on your face. Developers who produce buffer overruns should not be allowed to modify metrics, because you must be a lot more careful when designing metrics. You should get a book on statistics if you insist. But anyway, it would be like saying that someone who can't balance his own checkbook should run for president. Resolve your own misunderstandings before you try to resolve someone else's.

    Peace.
  • Larry, you are so obviously right that even a discussion about it is sick.

    I'm a volunteer Mozilla contributor and now consultant since 1999, mainly as developer, but I also hunted bugs, mainly during dogfood. As it turned out, I was top 12 bug filer ever, top 3 of non-Netscape people, measured by reported bugs (valid and invalid). Top 1-3 are well-known (at least by me) for filing a mass of duplicate and stupid bugs, some even wrong.

    Without exaggeration, Top 1 actually filed a number of bugs "http://mozilla.org should be http://www.mozilla.org/", one for each instance of the URL in the product. (For those who don't know: the slash is optional, the URLs are semantically identical. I prefer the version without slash.)

    Top 2 was a Netscape developer, he filed bugs without checking for existing bugs at all, didn't think things through, and sometimes even fixed the bug right away, and marked it fixed. Quite often, it was later discovered that it was a dup, and that he missed important facts that were already considered in the inital description or discussion of the older bug.

    If you count bugs open or fixed, dups and wontfix count negative etc., the numbers shuffle around a bit (Top 1 gets Top 3, I get Top 18 or so) but not all that much. Top 1-3 are still the same people and no better than before.

    As you rightly said, the correctness, clarity/reproducability and severity of a bug are very important for the usefulness of a bug report. But if you try to measure this merely by bug DB fields, you'll get quarrels about Severity: major vs. normal instead of people fixing and finding bugs.



    Judging programmers by lines of code produced is just as stupid. The best solutions are often the shortest.
    A manager who proposes or uses something like this should be fired.
    Any attempt to measure quality (of people, products etc.) based on statistics is bound to fail, in my observation, and does serious harm. Nothing can replace a rational, well-intended and -directed mind in judging.
  • Correction, sorry, it should read: "http://www.mozilla.org should be http://www.mozilla.org/", i.e. he only wanted the slash added.
  • Great post Larry. Thanks for picking up this "burning issue" that test leads/managers world wide face today.

    My thoughts are:
    Though it has been fiercely fought, extensively debated and beaten to death - "We should not measure tester’s performance only one the basis of # bugs they log", I think if we were to put a metric around that to make it SMART, # of bugs may be closest that one can think of. Among several work products the testers produce, "bugs" are single most important and have direct impact on quality of the final product that goes out customer.

    So are there any efforts to classify the bugs logged standardize and normalized so that we can compare 2 bugs. once we have decent framework to classify the bugs to take care of all possible variable that make one bug different from other, it would be fair to measure the performance of two testers.

    Having said all this, I strongly believe that there should way that quantifies output of a tester and measure.



    Shrini
Page 1 of 2 (30 items) 12