One of the benefits we get from Watson and our internal usage of OneNote is that we get a different set of feedback from internal users when we have a new build of OneNote for everyone to use. It works like this:

  1. We build a new version of OneNote each day and in some cases several times per day.
  2. We evaluate each build to see if it is "good enough for dogfood" - in other words, can folks around Office use this build of OneNote to get their work done? If it has some rough edges, that is OK, but we don't want anything to block people from using OneNote.
  3. If it meets this standard, we roll it out.
  4. Ultimately, the goal is that each build we get each day gets rolled out, by the way.
  5. Once we deploy it, our work on the test team is not done.

Here's where we look at the Watson data for any new spikes or hits. This can be a little tricky since not everyone will update at the same time, some folks will delay updating for weeks and so on. The end result is that we may only get a small number of users on any given build, so we have to take that into account for any Watson reports.

For example, suppose we look at the data for the first 50 people that upgrade. If anyone hits a crash, that represents a 2% failure rate, so we have to take that very seriously and start an investigation. As more people upgrade, we start to get more reports and then we can start prioritizing according to how many people are hitting each error.

Having fewer users makes this much more critical to get right since any one user can represent a large percentage of the reports. But we also have to take each report and examine it thoroughly so that we don't introduce a bug that a whole lot of people will see as well - we can afford to let these initial reports slip by.

It's a challenge we face each day on the test team and I thought it would be interesting for you to think about. There's more I could say here from a statistical viewpoint if you like, Just let me know.

Questions, comments, concerns and criticisms always welcome,