One of the hardest parts of building a business is understanding how to balance the trade-offs between up-front cost (Capital Expenditure, aka CapEx in finance terms) and ongoing costs (Operational Expenditures, aka OpEx in finance terms) when trying to add something to the business. Often times getting this right is the difference between improving the business in some material way or drowning in work while your customers suffer.
Over the last few years, first at Daybreak Health and now at The Commons I’ve come to realize that you can make better decisions if you can predict When it will break and How it will break. The exact implementation of the “it” doesn’t really seem to matter.
The following is my current model for predicting When and How it will break.
When Will It Break?
The when seems to follow a simple rule → Orders of Magnitude.
A system that works at 1 won’t work at 10, 10 won’t work at 100, 100 won’t work at 1000, and so on.
To put this differently, a system that delivers a high quality service well at a given throughput will not deliver the same service at the same quality at an order of magnitude higher throughput for the same fully loaded cost. It “won’t work”.
Fully Loaded Cost doesn’t just mean CapEx or OpEx in dollars, it’s the combination of dollars, time and energy to deliver the service across both CapEx and OpEx.
How Will It Break?
Ok, so we know, roughly when it’ll break, but knowing when something might break isn’t enough information to make a decision. We need to know how it’ll break so we can respond appropriately.
Complex Systems are more like steel and less like glass. Glass is perfectly fine under load until all of a sudden it explodes in a shower of shards. Steel slowly yields, deforming, until it snaps in half. Both break, but if you watch steel carefully you’ll notice it deforming before it snaps.
Systems under load experience deformation in these areas:
Quality - The quality of the service provided by the system
Operational Cost - The cost in time, energy and dollars to provide the service
Time - The time, per instance, required to offer the service
Staff Emotional Energy - The amount of emotional energy required by the operators of the service to provide an instance of the service.
Dollars - The volume of dollars, per instance, required to offer the service
If load isn’t removed (throughput lowered) then the system might snap. There are two common ways systems snap:
A failure cascade is created. This occurs when a failure causes the capacity of the system to be exceeded, thus causing more strain, thus causing more failures, and on and on.
e.g. A customer support team at an e-commerce site at capacity during Black Friday / Cyber Monday has a critical employee quit. The capacity is now lower and quality drops which leads to employees feeling they aren’t doing their job. Lack of satisfaction in their work results in another employee quitting, further reducing capacity.
Demand exceeds the capacity of the infrastructure, requiring new infrastructure.
e.g. A startup crosses the magical “Product Market Fit” line and is inundated with new customers. The tooling and systems that onboard new customers was originally built for a handful of customers (<10) per month and the sales team is now trying to onboard 100 or more a month. No amount of all-hands-on-deck or new-hires will make it possible to achieve the previous quality of service until new infrastructure is built.
Relationships between Quality, Time, Dollars & Staff Emotional Energy
What is Staff Emotional Energy?
The concept of Emotional Energy is something I stole from Ruby K. Payne’s a framework for Understanding Poverty. In the book, “Emotional Energy” is a battery and that battery is charged and discharged throughout the day at different rates by different tasks. Thus Staff Emotional Energy is a subcycle within the broader cycle we’re discussing.
Dollars (Labor Cost) by Available Work Hours
More Available Work Hours can be purchased for Dollars (labor cost). It’s generally a sawtooth relationship though where one spends more money to buy hours in a batch.
Available Staff Emotional Energy by Quality
Quality and Staff Emotional Energy can be traded off to a point. This relationship is generally fraught since Emotional Energy is one of the few things that cannot generally be purchased; it must be inspired. This is often what people refer to when they say “Discretionary Effort”.
Demanded Work Pace by Available Staff Emotional Energy
One can drain Staff Emotional Energy at a lower rate with a lower Demanded Pace of Work or drain it faster with a higher Demanded Pace of Work.
Non-Labor Cost by Quality
As Quality decreases the Non-Labor Cost per transaction will trend up due to mistakes driving customer attrition or additional work. As Quality increases, to a point, Non-Labor Costs per transaction will go down.
Time / Transaction by Quality
More time can be spent per-transaction to increase Quality, though there are diminishing returns. Conversely, less time can be spent and Quality will go down.
Stability
The relationship between Quality, Time, Dollars & Staff Emotional Energy is only stable within a narrow range at the center of the graph. Beyond this point the system becomes unstable and a Failure Cascade is likely.
Well spent capital, in the form of process or automation (aka infrastructure), can shift the point of equilibrium.
All Together Now
Systems often break at orders of magnitude. They appear to break catastrophically when they encounter hidden infrastructure limits or are mismanaged into a failure cascade. Fortunately systems tend to deform before they snap; careful leaders can observe the Quality, Time, Dollars and Staff Emotional Energy to see this deformation.
Systems should be built such that either they ship with this observational ability or they’re able to build these observational tools before the next transaction magnitude is reached. By doing so leaders will be able to determine when and how to make capital investments to avoid a catastrophic failure. Systems that do not have these observational tools will suffer extreme deformation leading to catastrophic failure and that failure will be a surprise.
Hopefully this helps you avoid the mistakes I’ve made.
A nice article Luke, and it highlights the important of socio-technical awareness.