Larry Osterman's WebLog

Confessions of an Old Fogey
Blog - Title

It's on the whiteboard

It's on the whiteboard

Rate This
  • Comments 14

Way back when, when we were first shipping NT 3.1, checking files into the source tree was pretty easy.  You made your changes and checked them in. Not a big deal, since there were only 20 or so people working on the code base - the chances of collision were relatively small, and the codebase was pretty managible.  There was a small team of people who had the job of doing nightly builds, it was their responsibility to ensure that a build was done every day, and that the build worked and passed BVTs (the team was something like 5 people, if I recall).

At some point, a number of groups joined the core NT team, and the NT team grew to a couple of hundred developers.  Not surprisingly, the system that had worked for the 20 or so people didn't scale to the hundreds of people who were now using the system.  It got so bad that we often went for days at a time without being able to have a good build.

We tried community shame (if I had a scanner here at work, I'd scan in the picture of me from back in those days wearing goat horns), it didn't work (I have no shame).  We tried staged checkins (each team gets a dedicated hour to check in).  We tried darned near everything, but the problem was that our old system simply didn't scale to the size of the group. 

Eventually things got so bad that Dave Cutler ended up moving into the NT build lab to directly supervise the builds.  It was a varient of the "community shame" solution, but instead of being forced to wear a silly costume, you had to explain why you screwed up to Dave directly, and it was FAR more effective (there's nothing like being grilled by Dave Cutler to instill fear into a developer).

In order to manage the volume of changes, Dave instituted the "Whiteboard".  Basically he got Microsoft to buy a 4 foot by 12 foot whiteboard, and had them mount it vertically in the build lab across from where he sat.  When you had a change ready to check in, you went to the whiteboard and wrote the your name, the bug #, the module being changed, and a contact number.  Dave would then periodically run down the board and call individuals to get them to check in their changes.  The cool thing about this mechanism was that Dave could control the build process - he could do sanity builds after individuals (like me) who had a propensity of breaking the build, he could batch changes from the same group together, etc.

It also provided a dramatic visual representation of the state of NT - when the whiteboard was full, the product had lots of bugs, when it was clear, we were close to being done.  And when it was empty, we had shipped the product.

 

Of course, the whiteboard didn't really scale, even to a project the size of NT 3.1.  And today, Vista is vastly more complicated - there are several thousand developers contributing code into a a bazillion binaries composed of a gajillion source files (I don't know how many, but there's a lot of them).  There's no way that the whiteboard could concievably scale today.  Instead, we have a main build lab (which produces the final bits of the product) and a series of "virtual build labs", each of which is responsible for aggregating changes from a set of Windows developers.  Its far more scalable than the old system, and significantly more flexible (at a minimum, it doesn't require that a Senior Distinguished Engineer spend all his time making sure that the build completes successfully).

 

  • I met Dave Cutler at the NT announcement event in San Francisco in the Summer of 1992. What stood out for me based on lobby conversations and chatting was how he showed a very distinctive trait of system architects. He was very clear on the invariants that he would hold onto no matter what, confident that anything else would be fixable.

    It is no surprise as development processed moved to shipping that Dave would be involved in the operational end, finding the key thing to keep tied down.

    That's a great story. Thanks.
  • > when the whiteboard was full, the product
    > had lots of bugs, when it was clear, we were
    > close to being done. And when it was empty,
    > we had shipped the product.

    When the whiteboard was full, you were far from shipping. When it was clear, you were close to shipping. And when it was empty, you had shipped the product. The number of bugs fluctuates independently of that status.
  • Norman, every software product ever shipped has shipped with known bugs.

    Every single one of them. The only question is if the bugs are sufficiently bad to justify holding up the shipment.

    So as long as there are bugs bad enough to justify holding the product up, there will be fixes for those bugs (as we find the bugs, we fix them).

  • Nice posting, very informative. Please blog more of these stories
  • hmm, well I guess I should be paying attention in /Software Engineering/ then... However much I happen to hate it.
  • Manip:

    I'm not exactly sure what your "Software Engineering" course might teach, but a quick scan of Google makes me think you REALLY need to pay attention to that class.

    My biggest day to day issues that I have with my level 1 newbie programmers all the way up to my level 4 Greek God programmers are failings in software engineering. Requirements gathering, testing, playing nice with others, maintainability are things I see day to day.

    We all tend to come out of school thinking we are the smartest and best programmers who will magically produce the best code that the user could ever hope to have. It never works out that way. My job as a lead is to nicely (will, sometimes not so nicely) beat that out of people. My goal isn't to teach my guys how to produce software that is the coolest software ever created. My goal is to teach my guys how to deliver functional and maintainable software that meets the needs of the user so we can all sit on the lawn at 1:30 drinking beer because the user is happy and the software "just works".
  • "you had to explain why you screwed up to Dave directly, and it was FAR more effective"

    That was a much better explanation than G. Pascal Zachary was able to manage in 20 pages of his Showstopper! book. Thanks for the information and history.

    I have to give in to a bit of pedantry: Cutler wasn't a distinguished engineer back then, was he?
  • I have to sit there and laugh, actually it is amazing you could do that with 20 people. Let alone thousands. We still have problems sometimes with a 5 man team tripping over each other. Also as far as shipping software with bugs. Yep I have had to do the same myself. Sometimes it has to happen

  • I'd love to hear how you got past the whiteboard and scaled up to hundreds of developers. In other words, it sounds like at one time it was very dependent on the Dave Cutler personality, so who was instrumental in moving beyond that to a more scalable build arrangement? Great story, thanks.
  • The vbl system isn't all skittles and beer... The multiple layers of branch heirarchy and then buerocracy involved in getting fixes integrated between them means it can literally take months for a fix to propagate through the system. Even so it's better than what we had before where you counted yourself lucky to get a set of source that didn't have a build break.
  • Thursday, October 13, 2005 1:30 AM by LarryOsterman
    > Norman, every software product ever shipped
    > has shipped with known bugs.

    There are a few that ship only with unknown bugs. I think we can agree by deleting one word from your sentence: every software product ever shipped has shipped with bugs.

    > The only question is if the bugs are
    > sufficiently bad to justify holding up the
    > shipment.

    Sure, but we have pretty big differences of opinion over what kind of bug is sufficiently bad. I think that a bug which destroys all files in a disk partition is an example of sufficiently bad.

    But get this: even when Windows 95 did that to me, even after I got it tracked down, I felt somewhat understanding because I know that every product ships with bugs. What made the difference was your company's reneging on warranties, refusal to allow bug reports to be submitted without the victim paying a fee, and then when your company accidentally allowed a contact which led to discussion of this bug, your company told lies denying it, and then switched to lies saying that it wasn't serious because the number of victims was low and the product was old. (The number of victims was not low, only the number of victims who understood it was low. And Windows 95 was still being shipped to corporate customers.)

    So we agree on the fact of bugs, but we have widely differing opinions on what kind of bug is serious and whether fixes should be delivered to customers.
  • Perhaps the whiteboard could be scaled using a "draft" repository and a "real" one. Checking a change into the former queues up a whiteboard-style entry on an integration list; the integration team pulls in the changes at their discretion, as they did with the whiteboard, and anything that breaks the build gets dumped back in the lap of the engineer who did the check-in, at which point he has to provide a replacement or followup.

    I've never worked with that large an organization, so this is purely speculation.
  • Of course, this procedure must surely have prompted the question, "What if Dave Cutler was hit by a bus?".

    There has been a similar question asked in the GNU/Linux world for years. It was finally answered by the following experiment:

    http://web.archive.org/web/20000422005356/http://segfault.org/story.phtml?mode=2&id=38b40d78-087dd360
    [via geekz.co.uk]

    Were similar studies performed at M$?
  • Monica, in fact, for NT 3.51 to somewhere during the Win2K process, the whiteboard was replaced with three Exchange public folders that comprised the "electronic whiteboard". You submitted (using a custom form) your change to the "pending" queue, the build lab picked it up and put it in the "processing" queue, when the change was committed, it went to the "completed" queue and mail was automatically sent out.

    That worked for NT 4, but by the time W2K came out, it was unworkable. For Win2K, Mark Lucovsky and several others redesigned a totally different system. He went into some detail in his usenix talk at: http://www.usenix.org/events/usenix-win2000/invitedtalks/lucovsky_html/Lucovsky.ppt

    For Win2K, there were about 1400 developers working on Windows. There are significantly more developers than that contributing to Vista, that's why the new structure is so critical.

    Mike's also right - the branching architecture adds reliability at a cost of bug fix latency. Fixes can live in a build lab for three or four weeks before they hit the main tree, which can be a big deal. On the other hand, if a particular vbl is seeing a particular problem, it's usually pretty easy to make the fix in their branch and wait for it to propogate into your branch.

    On the other hand, winmain builds are known to be of VERY high quality, which helps a LOT.
Page 1 of 1 (14 items)