A recent flood of build breaks triggered a wave of tool suggestions to plug the cracks in our code. Some argued for faster builds. Some argued for deeper branching. Some argued for a “gauntlet” service that simulates official builds and blocks problem code submissions. All of these suggestions are awash in the seeping sewage of the flood—none of them address the root cause, and many would only pressurize the leak until it truly exploded in a tsunami of stifled stench.
Build breaks and other quality issues aren’t created or resolved by tools any more than the code is. Problems are created by people, and they are resolved by people. Misguided, well-meaning nitwits protest, “Of course people create and resolve problems, but tools can have a huge impact.” Heck yeah! Tools can have an enormous impact—they can make checkins slower, costlier, and less frequent; they can frustrate engineers to the point of leaving the project and company; and they can remove all creativity, agility, and pride from development until our code is a mindless mush molded to match the meaningless mechanisms of our monochromatic, masochistic machine.
Tools serve us, not the other way around. Before you suggest a tool, before you jump to a solution, before you make mayhem with mechanism, start with the human problem. What are people trying to accomplish? What’s getting in the way of success? What alternatives are available for the range of situations? Once you understand the true goals, then you can ask how tools might help. Don’t start with a tool. Don’t be a tool.
The problem in this case is a bunch of build breaks. Actually, the problem is that you need your product to build or you can’t deploy it or sell it—so you build it all the time (a best practice). When someone checks in code that breaks the build, then no one on the team can retrieve the current code and build it. That can slow down value being added to your product in the form of code enhancements.
What are you really trying to accomplish? You’re trying to maintain the pace of value being added to your built product. Bad code checkins slow that pace for the individual and the team.
What alternatives are available to avoid bad code checkins and maintain the pace of value being added to your built product?
Eric Aside
Gauntlet systems only succeed when their results match official build and test results. Of course, the gauntlet system can’t be identical to the official build system (different queuing mechanism, different code signing, different build machines and environment, different publishing, and different performance optimizations). Maintaining identical results for separate systems isn’t feasible, thus gauntlet systems often don’t work properly—in addition to often adding hours to checkins to perform their validation.
Each alternative has its strengths and weaknesses. Which one is right? Ah, we haven’t considered the range of situations.
The complexity of your build system itself could be causing breaks. Many of these systems evolved slowly over time and haven’t received the engineering rigor we take for granted with modern production software. Investing in build systems isn’t sexy work, but it has a huge force multiplier when every engineer gains an hour or more of productivity per day.
There are three general categories of changes that can cause build breaks. Each has a different risk profile.
A gauntlet system can protect you from all three categories, but it’s only worth it for the last category, and it’s prone to failure. A private branch helps the last two categories, but it’s overkill for the first category and just moves the pain around. LKG builds and trusting engineers to be diligent work for all three categories, but will let build breaks through.
Wow, what do you do? Oh wait, that’s right—it’s a people problem, not a tool problem.
Tools can help, but you start with people. Too much reliance on tools quickly makes them a crutch, causing people to shut off their brains and hit the button instead of applying discretion.
How do you best avoid bad code checkins and maintain the pace of value being added to your built product? Allow and expect people to use engineering discretion.
How do you keep people from breaking the LKG? You don’t. As long as breaks are rare, the pace of value being added to your built product will be high. So you trust engineers to be professionals and to apply the appropriate level of verification for their changes.
When you make the impact and choices clear to people, and you publicly expect and trust them to make the right decisions, they feel empowered and responsible for their work. The right example is set and folks fall in line—for each other as teammates. The risk is minimized, build breaks are rare, and value to customers is maximized. That’s what happens when people are the solution.
Many teams highlight the trust placed with engineers by conferring a lighthearted, embarrassing token to folks who fall short of earning that trust. My old team used a big stuffed bear called “Buster, the Build Break Bear.” The last person to leave the build broken for an hour had to keep Buster on display in his office. As long as it’s not mean spirited, this kind of token is an effective reminder.
For enormous teams, the sheer number of engineers causes even rare events to become common. To protect the LKG, you’ll need separate private branches and LKGs for each large subgroup. You’ll want rolling build integrations in both directions to keep the private branches in sync.
How large should subgroups be? You want as few private branches as possible because each extra branch slows down code movement. So you want subgroups as large as you can make them and still have only a few LKG breaks a week (100 to 500 people each has been my experience).