6 min read

Chaos Returns

In my previous newsletter I noted that things did not go as expected when I implemented new bug spawning logic.

It was déjà vu. Back in May 2021 I wrote a post that I originally titled “Chaos” (but later renamed “Sweet Spot”) about a simulation scenario where a tester found thousands of bugs and overwhelmed the programmer with work.

The scenario I'm working on now is different from that one. It is an exercise in estimation. I wanted there to be bugs to make it feel realistic, but I didn’t want to add other elements (like testers) to the setup. So I wrote new bug spawning logic that randomly generated bugs based solely on a reliability score.

The first time I ran the new scenario, it generated no bugs. “That’s weird,” I thought. So I lowered the starting reliability, thus increasing the odds of bugs spawning. Suddenly the team was awash with bugs and couldn’t finish the release. Chaos all over again.

I raised the reliability score a little. Again, no bugs. Lowered it a bit. The number of bugs exploded. After some more experimentation I determined there was a tipping point. If reliability was above the tipping point, there were no bugs. If it was below the tipping point, there were so many bugs the team couldn’t finish the release before the end of the simulation. There was no middle ground, no reliability score where the team could deliver the release, just a little more slowly than if they were working with a higher quality code base.

This was not what I intended. Clearly something was broken, but what?

I had test-driven the code. It has 100% code coverage. (No, that's not a typo. It really is 100%.) So I was confident that the code was doing what I’d written it to do. But if the code was doing what I’d intended, how could the outcome be so surprising?

It was time to get another perspective. Fortunately, Abhi Hiremagalur and I had a scheduled pairing session coming up. Abhi is observant, insightful, wicked smart, and a highly skilled, very experienced developer.

We worked through the code and wrote additional tests to characterize the behavior we were seeing. At each step we saw that the implementation was what I’d originally intended, but the outcome remained quite surprising.

“Hold up,” Abhi said. “Tell me again the conditions under which a bug should spawn?”

We walked through it one more time.

“Here’s the thing that bothers me,” he continued, “Over the course of the entire simulation, if we look at the probability of a single event spawning a bug, and the number of those events, what does statistics say about the typical number of bugs we should see overall?”

I thought I knew the answer, but clearly I was wrong. So we sketched out some equations, modeling the probabilities given the logic I had coded.

The reliability score is a decimal number ranging from 0 to 1. Bugs had a chance to spawn any time the team delivered work. If a spawning event occurred, and a feature had a 0.999 reliability, theoretically it would have a one in a thousand chance of spawning a bug. Only it wasn’t that simple. The way I had written the code, each and every delivery—whether a feature or a bug—had a chance of spawning a new bug against each and every one of the existing items already delivered.

Intuitively, that model matches reality. When you make a change to an existing system, the biggest risk is usually not that the change is buggy by itself but that it interacts with other parts of the system in surprising ways. The larger and more complex the existing system, the more bugs a given change is likely to spawn.

(As an aside, this is why so many organizations have historically had a change control board: if you can't reduce the size or complexity of the existing system, at least you can theoretically reduce risk by managing the number of changes you make to it. Ironically that strategy increases rather than decreases risk. But that is a discussion for another time. Back to the story at hand.)

Let’s walk through the bug spawning logic for the first ten things the team delivers. The first item has no chance at all of spawning any bugs because there are no other delivered items to spawn a bug against. The second item has one chance to spawn a bug, with the probability of a bug actually spawning controlled by the reliability score. The third item has two chances. The fourth item has three. And so on. Each one of those chances could produce a bug. That means the tenth item delivered could theoretically spawn up to nine bugs.

Although my algorithm felt intuitively correct, mathematically the progression of potential spawning events (0 chances, 1 chance, 2 chances, etc.) meant that the expected number of bugs in the scenario was the sum of the series of probable bug counts for each delivery. And that series does not converge.

Abhi and I used our equation to calculate the number of bugs we should expect in the scenario given the number of features in the release. The math predicted our observation: with a reliability score below the tipping point, any given delivery late in the release had a high probability of spawning more than one bug. It was the proverbial one step forward, two steps back. The amount of work remaining in the release would increase at an exponential rate. No wonder the simulation had two modes: no bugs or too many bugs.

I still haven’t fixed the simulation. But as I pondered the problem, I realized that there is a deeper lesson to learn here about the nature of software projects.

The knob we were turning in the simulation–the reliability score–has a linear scale, but the corresponding effect was a quadratic growth curve in the work remaining. When I tried to fiddle with the reliability score to tame that curve, I couldn’t.

The implication hit me like a ton of bricks: the simulation showed that linear knobs may be able to trigger quadratic effects, but they can’t control them.

Back in the real world it turns out all we really have are linear knobs. We don’t even have that many of them. Sure, we can pour effort into improving code quality, test coverage, or getting better requirements. Or we can grow the team if there aren’t enough people on staff to do all the work. But once a code base has gone past the tipping point, there aren’t any easy answers. There is no magic dial we can turn to fix the situation.

In short, it is much easier to create a mess than to clean it up. The only real solution is not to let things get that bad to begin with.

This is far from a new insight. Agile processes may have given us new practices to support simplicity, frequent delivery, and fast feedback, but our industry recognized the need to manage complexity long before that fateful February day in 2001 when the original authors of the Agile Manifesto gathered at Snowbird.

Yet despite our best of intentions, complexity keeps sneaking back in. It’s so easy to slide down the slippery slope to working on larger and larger deliverables in the name of “efficiency,” or to accept cutting corners as a necessary expediency.

Seeing the wild and uncontrollable pace at which the amount of work to do multiplied in the simulation gave me a new appreciation for how critical it is to keep our problems small. Left unchecked, complexity will grow unbounded. It takes enormous effort, and sometimes tremendous intestinal fortitude, to constrain the size of the problems we tackle. But it’s the only way to keep the problems small enough that they remain tractable.

This is one of my big motivations for creating this simulation. My hope is that it can give leaders a visceral sense of the tradeoffs they’re making when they choose to push for features over other concerns like investing in higher quality, knowledge sharing, a more maintainable code base, or better infrastructure to support faster feedback.

So what’s next? After I fix the estimate scenario and push the updated version to the staging site, I plan to spend a little time experimenting with the simulation engine. I have a number of scenarios in mind but I want to see how they play out in pure code before I invest in creating an interface for them.

It’s also worth noting that I have been spending time on other projects recently. I am continuing to work on the simulation, just more slowly. So while I don’t expect to let another four months elapse between updates, I also don’t expect to return to a weekly publication schedule.

I remain incredibly grateful to everyone who has spent time playing with the existing app and offering feedback. All that input has been incredibly helpful in shaping the future of the simulation.

Stay curious,

Elisabeth