Day 228
Week 33 Day 4: How to Diagnose Recurring Fires
The first step in building systems is diagnosing which fires are recurring. Not all problems need systems. Only the ones that keep coming back.
Lesson Locked
A one-time problem deserves a one-time fix. A recurring problem deserves a system. The skill is distinguishing between the two. If the same type of problem has occurred three or more times, it is recurring and needs a systemic solution. If a problem has occurred once, fix it and move on. Investing in systems for one-time problems is over-engineering. Ignoring systems for recurring problems is negligence.
Here is the diagnostic framework for recurring fires. Step one: create a fire log. For the next 30 days, log every unplanned interruption, incident, escalation, and emergency. Record the date, the nature of the problem, the impact (how many people were affected and for how long), and the resolution. Do not filter -- log everything. Step two: categorize the fires. After 30 days, group similar fires together. You will likely find 5-8 categories. Common categories in engineering teams: deployment failures, customer escalations about the same feature, cross-team dependency delays, environment configuration issues, access/permission requests, data inconsistencies, and monitoring false alarms. Step three: count the frequency and total impact of each category. Multiply the number of occurrences by the average impact (person-hours lost per occurrence). This produces a priority ranking. Step four: identify the top three categories by total impact. These are your systemic fire sources. Everything else is noise that does not warrant system investment. Step five: for each of the top three, run a '5 Whys' analysis. Ask 'why did this happen?' five times, each time drilling deeper into the root cause. The surface cause is always obvious (the deployment failed). The root cause is usually structural (the deployment process has no automated validation step, so human error goes undetected until production). I ran this diagnostic on my team and found that 73% of our firefighting hours were caused by three root causes: no automated deployment validation (causing deployment incidents), no customer-facing error documentation (causing support escalations), and no cross-team dependency tracking (causing blocking surprises). Three systems -- a deployment pipeline check, a customer help center, and a dependency board -- eliminated 73% of our firefighting load within three months.
The fire diagnostic framework implements what quality management calls the 'Plan-Do-Study-Act' (PDSA) cycle (Deming, 1993) applied to organizational failure analysis. The fire log is the 'Study' phase -- systematic data collection to replace anecdotal impressions with empirical evidence. Research by Reason (1997) on organizational accidents demonstrates that managers' perceptions of their top problems are poorly correlated (r = 0.25) with actual data on problem frequency and impact, because perception is biased toward dramatic/memorable incidents rather than frequent/impactful ones. The categorization and Pareto analysis in steps two through four implement what Juran (1951) called 'separating the vital few from the trivial many' -- the principle that quality improvement should focus on the small number of root causes that produce the majority of failures. Research by Bicheno and Holweg (2009) on lean manufacturing found that organizations that implemented systematic fire logging and Pareto analysis reduced unplanned work by 35-50% within six months. The '5 Whys' technique was developed by Ohno (1988) at Toyota as the foundation of the Toyota Production System's continuous improvement methodology. Research on the effectiveness of the 5 Whys (Murugaiah, Benjamin, Marathamuthu, and Muthaiyah, 2010) found that the technique identified root causes that surface-level analysis missed in 80% of cases, and that root cause fixes reduced problem recurrence by 70% compared to surface-level fixes, which reduced recurrence by only 20%.
Continue Reading
Subscribe to access the full lesson with expert analysis and actionable steps
Start Learning - $14.99/month View Full Syllabus