BACKGROUND
Our team has been using Kanban for a while now, with pretty good results I must say. As we are developing and maintaining a long-term project, we deal with both new features development and bug fixes. Bug fixes are divided into three classes of service : “regular” (they actually have no name), “urgent” and “panic”. Regular bugs flow normally and urgent ones must be pulled before regular ones : as their impact is more important on customers, we want them to be fixed before. Panic bugs are critical, they possibly mean crashing the whole system, a loss of customers’ data or so. Dealing with a panic bug fix can break the WIP limits and usually stop the whole “line”, triggering the swarming effect.
SWARMING
If you’ve never heard of swarming applied to software development teams before, you might want to read the following articles:
- Swarming: What’s the Point?, Dan Puckett (InfoQ)
- What is “Swarming”?, Tom Perry (Agile Tools)
- Should a Team Swarm on to One Backlog Item at a Time?, Mike Cohn (Succeeding with Agile)
- The Kanban Story: Swarming, Pawel Brodzinski (Software Project Management)
- Last year a controversial article by Jim Coplien praised the one-piece flow as an alternative to Kanban, and presented it as the good way to do Scrum. In short, if we put aside the you-are-a-bunch-of-idiots-and-I-am-the-only-one-who-knows-what-lean-is part, the article pretends that the team should focus on only one piece, one backlog item, at a time, and swarm around it. This article is definitely worth reading too.
In short swarming occurs when quite everybody on a team focuses on a specific backlog item.
A SIMPLE OBSERVATION
In our team, a “panic” item is usually analysed-developed-validated-released within one or two working days, whereas a regular bug fix takes an average of three to nine working days to be deployed. So one might come to the conclusion that we should swarm more items than just panic bugs and thus improve the average cycle time…but this is not that simple. We must study the reasons why swarming is efficient on panic items to know if we could swarm around other, lower priority, items.
NORMAL FLOW VS SWARMING
When a bug is reported to our team, we first try to evaluate the possible consequences of it. This often implies that we need to find the causes of the bug. When the consequences are critical enough, we name the issue a “panic” bug, otherwise the bug goes to the backlog and a fix will be developed for it later. This part is always kind of swarmed as everybody is interested in finding what is wrong with the product so, when a bug is found to be a “panic” bug, the environment is already set up in swarming mode and the team easily self-organizes to develop a fix. In fact analyse and development are often one single stage if the fix is trivial, like “The problem comes from here : this condition is wrong as it does not handle this case (that we never thought about before…). We just need to replace it by blah blah”, and the “dev” part is nearly done.
For the “QA” part this is slightly different. Indeed the validation on its own is the same – and must be the same – as for any development. It takes the same amount of time as for a regular bug fix. The difference here is that in swarming mode, as the QA guys are involved from the beginning, they can do some parallel work to speed up validation, like preparing a specific test configuration. It is also possible to validate the fix step-by-step, part-by-part if the implementation is not trivial.
Once the fix has been validated, we immediately release, following the normal procedure. Still we can prepare some steps of the release before the end when in swarming mode
BENEFITS
The first benefit we can point out is that we avoid any kind of buffer. When a fix is developed – when it is considered as “done” regarding the development stage – it sometimes has to stand a little while before someone from the QA validates it. Similarly when a fix has been validated, it might have to wait before we release it. Even if the flow is usually pretty smooth the cycle time will necessarily be shorter in swarming mode.
The other big difference between the normal flow and the panic flow is parallel work. When in the normal flow there is no parallel work at all. First we analyse the problem, then we develop and test a fix, then we validate it at a higher level, and then we release it. When in swarming mode some of the tasks are paralleled. We can see that as a sort of read-ahead, or maybe a read-above : “as we know what you are developing we can set up the right testing configuration”, “as we know that there is an emergency release, let’s start the release process right now”. Of course daily stand-up meetings also provide a kind of read-above, but not as much as swarming
THE EMERGENCY EFFECT
I think that most of the speed gain also comes from what I call the emergency effect. This is some kind of distributed adrenaline rush that explodes when the team is dealing with a “panic” bug. Actually I think that “panic” is not the right term, as nobody is panicking, nobody is scared as we know this can happen, the team is prepared for it. Let’s say we are…excited. That’s why “panic” is definitely not the right word : the team as a whole is just excited about a challenging bug to fix.
Anyway the emergency effect that accompany a panic issue contributes to the success of swarming, as it catalyses self-organization. But what if we change our policies to swarm around every kind of bug? The emergency effect will vanish and we will lose its extra piece of motivation.
DRAWBACKS
The main drawback I see here is throughput. Indeed, by swarming around a single item at a time you miss the opportunity to pipelining. So unless avoiding buffers and doing parallel work can give you enough speed, you won’t have the same throughput as with the normal flow.
In this article, Mike Cohn also point out that swarming “introduces too many opportunities to be in someone else’s way as they try to make progress”. There might be some big team management problems coming from swarming too often.
CONCLUSION
Swarming is efficient. It really is. It contributes to drastically decrease the cycle time. But swarming should not be enforced too often as it relies on a natural and spontaneous cohesion of the team. You know a swarm is not a flock of sheep. I think that one cannot force a team to gather and perfectly work together around any kind of PBI. There must be some kind of emergent behavior to make swarming really efficient. That’s why a critical bug is a good candidate to swarming : the whole bee hive is in danger and must be protected. But when the team does not naturally swarm, then it must be because there is no need to swarm.
One thought on “To swarm or not to swarm : return on experience”