For the last few weeks I have been working on both building out and exercising the (still unnamed) simulation engine. This week’s newsletter walks through the simulation results for a particular scenario.
First, I have to set the stage.
Flashback. Many years ago. I was leading development of an enterprise product. We’ll call it “InfraDuck” (not its real name).
It had been a rough several weeks. Everyone was frustrated.
Engineering was busy working on the next major InfraDuck release. At the same time, customers were calling the support hotline about issues in the current release. One engineering team in particular was feeling the pressure. They were running behind and their big feature was at risk of missing the next major release.
“Something has to give,” the product manager grumbled at me. “At this rate, this feature is never going to ship.”
“Yes, something has to give,” agreed the support manager. “We have too many calls from customers and we don’t have good answers for them.” To his credit, he maintained a professional demeanor as he spoke. I wondered how much self-control that took.
“The next release of InfraDuck has a vastly improved architecture,” a senior engineer spoke up. “It is resilient to many more corner cases than the current version. All this time we are spending on support tickets is time we are not spending on the new architecture so we can eliminate these recurring problems for good.”
Of course the InfraDuck team was far from the first to experience the tension between building the future and supporting the present. It is a common situation and is at the root of the frequently-asked question: “When estimating, how do you account for interruptions from unscheduled work like support tickets or bugs?”
Now that the simulation engine can model an increasingly wide variety of situations, I wanted to experiment with a scenario involving competing priorities to see what I could learn.
Here’s the simulation scenario setup.
A team of 6 engineers pull work from a common backlog. The product manager adds 10 new stories a week for a total of 130 stories, each requiring 1 - 5 days worth of work. The numbers work out such that theoretically the team should be able to finish all the feature work in approximately 13 weeks (1 quarter). There’s a support manager who may or may not be filing customer support tickets in the team backlog.
I ran the scenario in three configurations:
- No support tickets. The team is blissfully uninterrupted as they worked to deliver the release.
- Non-prioritized support tickets: the support manager files 1 support ticket per day, adding it to the end of the team backlog after all the other currently-scheduled feature work. Each ticket requires 2-4 hours of work.
- Escalated support tickets: the support manager files 1 ticket per day as before, but prioritizes the ticket at the top of the backlog.
In order to get a good sampling, I ran the simulation in each configuration fifty times and took averages. (For good measure, I did that twice.)
Before I tell you what happened, it’s worth noting that the simulation is an abstraction of the real world. It makes some simplifying assumptions:
- All work is discrete and independent. There are no circumstances in which two programmers could accidentally stomp on each other’s changes, or where the work for one story would cause problems in another. (This is obviously totally unrealistic.)
- Each team member works alone: no pairing, mobbing, or swarming. (Alas, this is realistic, but I'll save my opinions about teams that don't actually collaborate for another time.)
- The team members are all equally capable of doing any of the work. That means they can perfectly parallelize the work and no team member is a bottleneck.
- Even with escalated support tickets, team members work on one thing at a time, finish it, then pick up the next thing from the top of the backlog. They do not pick up and put down work in progress and there is no context recovery overhead.
- Unlike other scenarios I’ve created, this scenario does not include testers. That means the team does not have to contend with bug reports on top of everything else.
In short, this is very much a frictionless universe scenario. In the real world, context switching, bugs, and coordination overhead would probably increase the time required to deliver the release. However since that overhead would apply to both feature work and support tickets, I don’t think including those additional real world considerations would yield better insights in this scenario.
So, what happened?
The first thing I noticed was that adding support tickets did not change the release schedule as much as I had expected. Without any support tickets, the team typically finished all 130 stories in an average of 13 weeks, 1 day (very close to the theoretical schedule). The support tickets scenarios typically extended the schedule by just a day or two or three.
Although I was initially surprised, I realized that the numbers made sense. The support tickets added ~200 hours of additional effort spread across 13 weeks, 6 programmers, and 65 small perfectly parallelizable tickets.
This suggests that if you happen to be able to recreate the frictionless universe conditions of the simulation (low context switching costs, relatively small amount of work from support tickets, and ability to spread the load), there does not have to be an inherent contradiction between an engineering team delivering software and providing second tier support.
However, where things became really interesting is in the latency between when something enters the team backlog and when someone starts working on it. In the context of the simulation, how would support tickets affect that latency?
First, a quick digression on Little’s Law.
Little’s Law expresses the relationship between the wait time in a queue, the arrival rate, and the number of items in the queue:
Queue Length = Arrival Rate * Wait Time in Queue
The genius of Little’s Law is that you do not need to know how long each item in the queue will take. It all works out in the averages. So if you want to know how long it will take (on average) from the time a product manager adds a story to the time a programmer picks it up, you don’t have to know how long each item will take; you just have to know how many stories the product manager typically adds in a given time interval and the average queue size.
Consider our simplest variation of the scenario where there are no support tickets. The product manager adds 10 stories a week. I measured the backlog length periodically and found the average backlog length was 3. Little’s Law tells us that means a new story would typically wait approximately 3/10 weeks = 1.5 days before a programmer picks it up. The simulation results bear this out: the average wait time for a story in the simplest scenario was approximately 10 hours.
(With a larger sample size taken at more granular intervals, I suspect the data would conform much more closely to Little’s Law. The simulation has some lumpiness in the data because the product manager adds all 10 stories on Mondays. For now the numbers were close enough to suggest the metrics I gathered from the simulation were not wildly off base. I will leave further simulation experiments around flow and Little’s Law as an exercise for the future. )
If you apply Little’s Law to a simple scenario, you don't need a simulation engine. You can just commit math.
But what happens when work comes from two sources, particularly when one source of work is prioritized over another? That’s much harder to reason about from first principles or simple calculations. And as it turns out, the average wait time for a story changed quite a bit between the simulation configurations.
When I ran the simulation with the support manager filing support tickets at the end of the team backlog, the average latency for features went from 10 to 17 hours. But the real problem was with the latency for support tickets: they waited on average about 22 hours before being picked up. That would be nearly 3 days before anyone even looked at the issue. Yikes!
Going back to my real world story for a moment, one of the key points of contention between the InfraDuck support and engineering teams was that tickets tended to languish. “We need you to give us a guaranteed turnaround time!” the support manager was emphatic. “When customers don’t get answers, they just keep calling us back. We have to tell them something, but we need input from engineering.” He shook his head.
I believe it’s better to set priorities than deadlines. So instead of defining a service level agreement (SLA) between engineering and support, I proposed that we prioritize support tickets higher. The InfraDuck support manager begrudgingly agreed. The product manager was skeptical but willing to go along. The team was the most resistant of all, but they eventually agreed that we could not afford to wait for the next major release of InfraDuck to solve all the open customer tickets.
The third simulation configuration implemented that prioritization approach: support tickets went to the top of the backlog. When I ran that simulation variation, the wait time for support tickets went down from nearly 3 days to just over 3 hours. Much better! Oh, but there was a tradeoff: the latency before a team member picked up a feature story went up to about 23 hours, or nearly 3 days.
In short, the simulation replicated the tensions I witnessed working on InfraDuck all those years ago: you can optimize for turnaround time on support tickets, or features, but not both.
This outcome begs a question: should you set up a dedicated maintenance team to deal with the support requests?
Maybe. Maybe not.
In my real world story, InfraDuck was deeply, technically complex. That meant the people who built the product were in the best position to diagnose difficult customer issues. If we had tried to establish a separate maintenance team, we would have struggled to staff it with people who had enough expertise. So we optimized for improving current customer satisfaction and sacrificed the focus on next-generation feature delivery.
That strategy worked for us. Support (and thus our customers) were happier with engineering’s improved responsiveness. Engineering learned a lot about what could go wrong with InfraDuck out in the real world. We sped up the cadence of bug fix releases. And thanks to an improved partnership between engineering and support, the support engineers learned more about the internals of the system. Better yet, more frequent bug fix releases and support’s deepening expertise made it so that support needed less help from engineering over time. If we had established a dedicated maintenance team I doubt we would have seen that result.
Just because that’s how my story turned out doesn’t mean that approach will work for you. There are no absolutes, no right answers. Every context is different and every choice is a tradeoff. The trick is to find the right set of tradeoffs to optimize for the outcomes that matter most in your context.
Ultimately that's why I’m having so much fun building out this simulation: it allows me to run a series of what-if scenarios to explore those tradeoffs. Other scenarios I plan to explore in coming weeks involve issues with flaky tests, bottlenecks in CI, long (loooong) feedback cycles, technical debt, and having too much work in progress (WIP).
Do you have a scenario in mind that you’d like to see modeled? Please reach out! I’d love to hear from you. The more scenarios the better!