Plumbing channels waste water into a series of larger and larger pipes till it is expelled. That's because sewage flows downstream, which explains the quality of goods that test, operations, and sustained engineering teams receive. After all, they are downstream of design and development.

I've written about pushing quality upstream for testers in "Feeling testy" (chapter 4), and making services resilient to failure for operations using the five Rs in Crash dummies. Like most engineers, I've neglected sustained engineering (SE), also known as the engineering sewage treatment center. No, on second thought, that analogy implies that what we release to customers has been cleansed. SE is more akin to environmental cleanup after an oil spill—thankless, difficult, and messy.

Imagine what must go through the minds of those cleanup crews as they wash oil from the feathers of sea birds. Naturally, there's empathy for the birds (customers). There's frustration at the inevitability of mistakes that lead to tragedy (buggy software). And there's a palpable desire to have the jackasses who caused the spill be forced to take your place (the engineers who let the bugs slip by).

You make the call

Should the engineers who design, construct, and test the code be the same engineers who fix the bugs found after release? This is the quintessential question of SE.

If the engineers who built the release fix the post-release bugs, you typically get better fixes, the engineers feel the pain of their mistakes, and the fixes can be integrated into the next release. Then again, the next release may not happen because its engineers are being randomized.

If you have a dedicated SE team you build up knowledge of the code base outside the core engineering team, you can potentially pay a vendor to sustain old releases, and you don't distract or jeopardize progress on new releases. Then again, SE teams get little love, their fixes can be misinformed, you duplicate effort, and the core engineering team isn't held accountable for their mistakes.

Tough call, huh? Nope, not at all. While both models can work, having the engineers who build the release also fix post-release bugs is far better. Only idiots believe a lack of accountability leads to long-term efficiency and high quality. Of course, the world is full of idiots, but I digress.

Someone's got to take responsibility

Yes, a dedicated SE team can work, but long term it will only cause grief for team members and customers. Why? Because you can mitigate post-release fixes distracting the core team, but you can't mitigate the problems with a dedicated SE team.

Let's go through those dedicated SE team problems again.

§  Little love. What would it take for the dedicated SE team to be appreciated as much as the core engineering team? A disaster, right? And what would it take on a day-to-day basis? Non-stop disasters. In other words, the conditions for loving the SE team are undesirable.

§  Misinformed fixes. To get a fix right, recognizing all the implications of changes, you need to deeply understand the impacted portion of the code base. Let's fantasize that the core engineering team has that level of depth. The core team is always considerably larger than the SE team. The SE team has no hope of truly appreciating the impact of fixes. Reality is only worse. Sure you can have the SE team consult with the core team, but doing that all the time defeats the purpose.

§  Duplicate effort. Whenever you have two teams fixing issues in the same code you duplicate effort, by construction. You've got two teams learning the same code, debugging the same code, changing the same code, and testing the same code. There's no getting around it, unless you neglect to incorporate the fixes into the next release, which is even worse.

§  Accountability for mistakes. The whole point of the dedicated SE team is to avoid derailing the core engineering team, protecting them from dealing with fixes. The core team doesn't correct their mistakes in the old code, and doesn't know to prevent those mistakes from recurring in the new code. What's worse is that there's no reinforcement of good and bad behavior. Conscientious heroes don't get to write more quality code, while careless villains fix past mistakes. Thus, we can never expect to improve. A great recipe for joyful competitors  and sorrowful customers.

What do I do now?

In contrast, there's plenty you can do to avoid jeopardizing future releases while the core engineering team fixes prior mistakes. Let's run through the relentless, randomizing requests and resolve them.

§  Triviality. How do you avoid wasting the core team's time with issues that aren't software bugs, or have trivial workarounds? You have a small dedicated team triage the issues. Note this team isn't a development team. It's purely an evaluation team that determines which issues are worth fixing. That way, only worthwhile work is passed onto the core team.

§  Prioritization. How do you balance bugs fixes for the last release with work on the new release? You have the dedicated evaluation team prioritize the fixes. There are four buckets: immediate fix (the rare "call the VP now" issue); urgent fix (next scheduled update); clear fix (next service pack or update); and don't fix. These buckets send clear signals to the core team about which bugs to fix at what time.

§  Unpredictability. How do you make inherently unpredictable post-release issues easy for the core team to schedule around? You make them regular events. Deploy one update per month. The urgent fixes each month are queued up by the evaluation team. The core team sets aside the necessary time each month and the fixes are designed, implemented, tested, and deployed on a predictable schedule. This is just as good for customers as it is for the core engineering team. Everyone likes predictability.

In addition, the evaluation team can create virtual images for easy debugging by the core team, improve the update experience for customers, and reflect customer needs and long-term sustainability features back into future releases.

Eric Aside

Of course, it isn't as simple as a small evaluation team prioritizing issues. There's a bunch of orchestration and system support necessary to make SE run smoothly. That part is unavoidable. What is avoidable is duplicating effort, uninformed fixes, and ignoring accountability.

This won't hurt a bit

See, it's not that complicated. You save on staff. You get better fixes. You catch similar issues in advance. You achieve predictability. And you ensure the core engineering team is accountable for quality and learns from its mistakes. All it costs is a relatively tiny dedicated team to manage the monthly update process by evaluating and prioritizing issues. Even that team feels valued due to their differentiated and important role and their direct engagement with solving customer problems.

Yes, sewage flows downstream and no one likes cleaning it up. However, by putting some simple processes in place, you can reduce the sewage and have those responsible mop up the mess. To me that smells like justice.

Eric Aside

What do you do if you are stuck on a dedicated SE team and are experiencing little love, misinformed fixes, duplicate effort, and no accountability from the core team? Here are a few ideas:

§  Create a rotational program with the core team. Everyone spends a month or two a year on the SE team. It's not ideal, but I've already established that point.

§  Measure your efficiency and effectiveness, perhaps by the average time to resolve issues for each bucket, the regression rate, team morale, and customer acceptance of fixes (a balanced scorecard). Optimize, publish your results, and show the core engineering team how great work gets done.

§  You ship updates once a month—celebrate once a month.