Day 229
Week 33 Day 5: Building a System Is Slower Than Fixing the Problem (and That Is the Point)
The reason leaders default to firefighting instead of system-building is that fixing the immediate problem takes an hour and building a system takes a week. But the leader who invests the week saves months.
Lesson Locked
There is always a faster way to handle the fire than building a system. You can manually run the deployment check. You can personally answer the customer escalation. You can chase down the cross-team dependency yourself. Each manual fix takes less time than building the automated solution. The problem is that you will be doing the manual fix again next week, and the week after, and the week after that. The system costs more upfront but eliminates the recurring cost.
Here is the investment math for system-building. Take the most common recurring fire on your team and estimate two numbers. First: the cost of each occurrence. Include the time everyone spends responding, the opportunity cost of what they would have been doing, the customer impact if any, and the recovery time. Express this in person-hours. Second: the cost of building a system to prevent it. Include the design time, the build time, the testing time, and the maintenance cost over the first year. Express this in person-hours. Divide the system cost by the per-occurrence cost. This gives you the number of future occurrences the system needs to prevent to break even. If your recurring fire happens weekly and the system would break even in four occurrences, the system pays for itself in a month and generates returns every week thereafter. Here is a concrete example from my team. Our recurring deployment failure cost approximately 8 person-hours per occurrence (3 engineers for 2 hours of diagnosis and fix, plus 2 hours of post-incident review). It occurred an average of twice per month. Annual cost: 192 person-hours. The automated deployment validation system took one engineer 3 weeks to build (120 person-hours) and required approximately 20 hours per year to maintain. First-year cost: 140 person-hours. First-year savings: 52 person-hours. Second-year savings: 172 person-hours. Third-year savings: 172 person-hours. Five-year savings: approximately 820 person-hours. A three-week investment produced 820 person-hours of recovered capacity. That is the math of systems over firefighting. The hard part is not the math. The math is obvious. The hard part is protecting the three-week window for system-building from the constant pressure of this week's fire. That is where leadership discipline meets organizational reality.
The investment math framework applies what operations research calls 'total cost of ownership' (TCO) analysis to organizational process design. Research by Kaplan and Anderson (2007) on 'time-driven activity-based costing' demonstrates that most organizations dramatically underestimate the cost of recurring manual processes because they measure only the direct time cost (the 2 hours of diagnosis) while ignoring indirect costs (context switching, opportunity cost, stress, error compounding). Their research found that the true cost of recurring manual processes was 2-4x the direct time cost, which means the system-building investment typically has even higher returns than the basic analysis suggests. The difficulty of protecting system-building time is documented by Repenning and Sterman (2001) in their research on the 'capability trap' -- the organizational dynamic where short-term pressures consistently override long-term investments. Their systems dynamics model demonstrates that the capability trap is self-reinforcing: the more the organization invests in short-term fixes, the less capability it builds, the more fires it produces, and the more it is forced to invest in short-term fixes. Research by Wheelwright and Clark (1992) found that organizations in the capability trap allocated 60-80% of engineering capacity to maintenance and firefighting, leaving only 20-40% for new development and improvement -- a ratio that Repenning and Sterman's model predicts is below the threshold needed to escape the trap. The escape requires a deliberate, management-protected reallocation of capacity to prevention, even when the short-term cost feels irresponsible.
Continue Reading
Subscribe to access the full lesson with expert analysis and actionable steps
Start Learning - $14.99/month View Full Syllabus