Thanks David, I will look up on CONWIP - sounds like an M/M/1 queue. Except, in our case, I think it was an M/M/15 multi-server queue - kind of airline check-in with 15 serving terminals but all passengers lined up in a single queue.
It worked extremely well for us. Tech Support was able to get a definitive date to ship the fix to customers, which was far better than an unpredictable, even if an earlier date. Team members were not pulled into every direction by rapidly changing priorities, which made them motivated (we had the least attrition in this team, which is very unusual considering the sustaining engineering is not always the first choice of engineers). Program Manager was actually the most critical role in effective gatekeeping - he was like the usher at this multi-service queue, tracking every service completion and directing the next assignment to that server, and not allowing anyone to break the queue unless absolutely critical (and also doing other things like varying the commitments based on actual availability of people, like when some people would be on vacations, etc.).
We did continuous improvement to the process. For example, we found that Dev teams could really pick up another task the moment QA signed-off happened (i..e, before Tech Support had verified the fix). This was based on the very high success rates of QA-signed off fixes at the Tech Support verification stage. This allowed us to shorten the cycle by reducing the number of steps required. After some time, we found that most bug fixes were not failing even at the QA sign- off stage - so we again shortened the process: if an experienced developer said QA was not required and the QA engineer also agreed with it, we would not do the QA cycle on it and directly send to Tech Support. This allowed us to work on more than one escalation at any time (e.g. a likely scenario might be: Tech support is looking into escalation 'n', QA is validating escalation 'n+1' while Dev team is working on escalation 'n+2') - I guess that is what you meant by building WIP limits at each step. We did not formalize any WIP limits because the Dev, QA and Tech Support on any escalation was unpredictable - rather the process was more driven if that step was needed or not, as opposed to if there was an available WIP capacity before the next step or not. Since we had achieved a steady-state, this was quite efficient and steady, and was never considered that we introduced a major risk in the process by cutting corners.
Once the backlog came down to single digits, there was a lesser need to maintain a large team. We gradually reduced the number of people required on that team (because even with larger team, we could not expedite the bug fix turnaround time any more than what we were already achieving) but maintained the same average turnaround times to our customers. Currently, we have all but disbanded the team - individual product teams have the responsibility to keep a virtual buffer in mainstream Dev team and QA teams to address customer escalations as and when they arrive. Since the incoming rate is very low, that is working for us.
On Mon, Jun 22, 2009 at 5:37 PM, David Peterson <david@...> wrote:
Thanks for summarising.
To me, this sounds like CONWIP (CONstant Work In Process) with a 15 work item limit across the entire system. With Kanban you would expect there to be WIP limits at each stage in the process (e.g. QA would have its own limit).How did it work out?Was the program manager a bottleneck?Was the process tuned over time? If so, how?David
2009/6/22 Tathagat Varma <tathagat.varma@...>
Sure David.We implemented a system of working on customer escalations based on what I think might be a pull system. I would like to know from other practitioners if my understanding is correct.We had a pool of engineers allocated full-time to handling customer escalations on multiple products of the company, but they were assigned one escalation at a time by a program manager. When the engineer would be done (meaning the development was done, code was checked in, integration team had done the build, QA had tested the fix and Tech Support had certified it for customer release), he/she would be assigned a new escalation. As soon as an escalation was 'done', it was released to the customer. In this team of 15 engineers, we had some 15 'flows' at any time and customer escalations were handled asychronously, non-timeboxed and in an interationless manner. When a recently freed-up engineer was asked to work on an escalation that he was not an expert on, he could choose from other equally critical escalations. This allowed us to ensure there was only a single escalation per engineer in the system at any point in time, and all engineers were always allocated work based on their expertize. Unless the existing escalation was completed (meaning delivered to the customer), no new escalation was even assigned for analysis. To handle business criticalities, there was flexibility to stall a work in progress, and take up the most urgent issue with everyone consciously aware of delay on currently being worked upon escalation.For more details, please see my original mail.On Mon, Jun 22, 2009 at 3:48 PM, David Peterson <david@...> wrote:
Hi Tathagat, can you summarise this a bit? It's rather long to read. I think we'll get a better discussion going if you can summarise your key points / questions down to a couple of paragraphs.
David
2009/6/22 Tathagat Varma <tathagat.varma@...>
I have been on this list to learn about application of kanban in software, and have a few questions to people who have used kanban. About 5 years back, we were sitting on a huge pile of customer escalations. We had serious product quality problems that were exacerbated by a high field incoming rate (that was higher than the bug fix rate). There were other organizational issues (like no dedicated team to handle customer escalations, etc.) that were seriously hampering the ability to work down the backlog.Here is what we did to address the problem:
- Created a new team from out of Dev team just to handle customer escalations (one might disagree with that move, but the key motivation was to ensure 100% availability of engineers on the task as opposed to a common buffer, or some other model that had not worked in the past)
- A program manager was identifed to run the customer escalation program (along with other dot releases, so he clearly knew the business priorities on given features) and identify release vehicles (patches, MRs, SPs depending on the business criticality of an escalation). He was based out of San Jose, and was located close to product management and tech support who were other key stakeholders in the customer escalation process
- My team was responsible for sustaining engineering and based out of Bangalore. They included 15 Dev and some 3-4 QA engineers, and sustaining team manager responsible for overall coordination of the sustaining engineering efforts.
- The program Manager would have a list of customer escalations based on severity (Sev 1, Sev 2, Sev 3) and business criticality that he would prioritize in consultation with the Product Managers, Tech Support and senior management in Engineering, and Sales as required.
- In the initial years, we started doing Service Packs (SPs) to clear the huge backlog. (Meanwhile, the QA dept was augmented that was helping us improve software quality steadily). We were also doing small Maintenance Releases (MRs) but the most interesting discussion is perhaps on patches.
- Sustaining team manager was involved in coordinating team activities along with the program manager. The team was composed of engineers with varying levels of competency and specializations, but essentially in a flat hierarchy. There was some element of 'replacibility' of engineers in the context of product knowledge and generic skills, but the expert knowledge was certainly more rarer.
- Doing SPs and MRs required taking up a lot of customer escalations and running like another dot release. The configuration management model was based on engineers checking-in their code and the builds being done a little later in the project cycle to give drops to the QA team. Sometime ~3 years back, we improved the model to do continuous releases: so, we could handle one check-in, do a complete build and then test if that fix worked on the target version and release it. This made getting back to lead customer on that escalation extremely fast as he did not have to wait for all other escalations in the next MR or SP.
- Here was the process we adopted for patches: program manager would inform the current open list to sustaining team manager. He would discuss with his team on who was the next person getting free (from the current task). He would ask that engineer to analyze the problem, identify hardware requirements to reproduce the problem and validate it, fix it and then get the QA turnaround time. These dates would be given back to program manager who would inform tech support. If tech support wanted it earlier, they could discuss with program manager to speed it up. If the team could speed it up by themselves, that would be done. If not, they would either be deferred back to original commitment by the engineering, or the Tech support was free to escalate with upper management to reprioritize and reschedule the engineering effort. It was a highly iterative negotiation process, especially for real hot fixes. In case expert knowledge was not available to analyze it immediately, it would get a quick assessment by one of the senior engineers.
- Eventually, the escalation would get assigned to one of the engineers who would complete the task, check-in and the integration team would make a formal release for QA. Once QA certified it, it would go to Tech Support and then go to customer. Anywhere if it failed the tests, it would come back to Dev teams. The different levels in testing have different levels of rigor and closeness to customer's actual network: QA can't simulate all customer scenarios and will try its level best to simulate customer's network traffic conditions to test, but the tech support would have the most rigorous setup that resembled the customer as much as possible. (we tracked metrics on rejection rates to assess health of each stage gate).
- All this was tracked and reported by the Program Manager on a weekly basis to all engineering teams and right upto the CEO.
- In essense, we were able to create a process where there was one-task-per-engineer at any given time, and the next task was assigned (and analyzed) only when the first one was completed. Program Manager was gatekeeper of the process who ensured no backdoor entries were allowed directly to engineering teams. Invariably, there would be some backdoor entries- things like deal-blocking issues that must be fixed to bid for a new customer. We worked on institutionalizing the process by quick-and-dirty (more quick than dirty actually) analysis and if there was mandate from upper management, it would get on top of the stack. Very rarely we would have to stop and ongoing work to accommodate those so-called 'customer specials'. Overall, there was no timeboxing because every customer escalation is different and the amount of time, effort and resources it needs could turn out to very different. Some escalations were big enough to be completed by just one engineer, but in terms of scheduling, it meant assigning work to a group of engineers and tracking them as if they were one single 'engineer' just like any other working on a single problem.
- Today, that team doesn't exist. What started out as a dedicated 15-memer Dev team is now down to 3-4 part-time Dev engineers because that backlog doesn't exist anymore and thanks to significant improvements in product quality on subsequent dot releases, our field incoming rate is down to a handful. We still use essentially the same process, but it continues to be fine-tuned based on the current needs.
From my limited understanding of application of kanban in software development so far, this was a kanban-based system of fixing customer escalations for software products. Of course, we did not know anything about kanban, but what we did was purely out of some serious performance issues. Do you agree that this was (is) a good adaptation of how kanban-based system could (or should) work in software development, or at least in context of customer escalations ?