Feb 19, 2024 4 min read leadership

Perverse incentives and DORA metrics

“Move fast and break things” isn’t what anyone wants. Plan for complex systems to break. Reframe the rallying cry to “move fast with reliability” and stop wasting time measuring failure.

Measure what matters or end up with folks gaming the system.

I’ve never been a fan of DORA metrics. I appreciate the copasetic thought leadership behind them and their intended purpose of tracking a company's ability to reliably move fast. But it’s always struck me as odd that one of the metrics is Change Failure Rate. This is an unnecessary metric that works against the other metrics.

If you track Change Failure Rate, it really means that you’re afraid of change. Moreover, if you’re afraid of change, you can’t move fast. If you can't move fast, your business will lose to one which can.

What are DORA metrics?

Briefly, DORA metrics are four key performance indicators (KPIs) that attempt to measure the effectiveness of a software delivery process. Those KPIs are:

Deployment Frequency: how often an organization deploys code into production.
Lead Time for Changes: the average time it takes for a code change to go from commit into production.
Change Failure Rate: the percentage of deployments that fail.
Mean Time to Restore: the average time it takes to recover from a failed deployment. This is often abbreviated as MTTR.

The folks behind DORA metrics are respected luminaries and their recommendations come from an incredibly large set of studies and surveys from a diverse audience of global companies. Sadly, while their recommendations are quite popular, it's my experience that only one of the four metrics really matters. What's more, Change Failure Rate creates a tension with Deployment Frequency and results in a perverse incentive that ultimately can reduce Lead Time for Changes.

Competitive advantages

Speed is the primary competitive advantage a company has in any market. Therefore, the only metric that matters to all parties in a business is Lead Time for Changes (commonly shortened to Lead Time). This metric is broadly the time it takes for a business initiative to reach customers.

Lead Time for Changes can be reduced by making it fast and safe to deploy (Deployment Frequency) along with having the safeguards in place to rapidly detect and fix any issues that arise (MTTR). When Deployment Frequency is high and MTTR is low, the business is happy.

MTTR, however, is easier said than done. Indeed, if you want to move fast with some semblance of reliability, you’ll need to invest in three disciplines: observability, failure domains, and fast fixes via rollbacks or via rolling forward. If you can detect issues quickly (often called Mean Time to Detect), have thought through how to limit the blast radius of errors (like avoiding global changes), and can rapidly fix those issues by either reverting the change or deploying a fix, you’ll have agency with MTTR. If you want to be groovy and move fast, you need confidence that you can rapidly resolve issues. Consequently, you have to know when an issue arises, have the ability to limit its scope, and quickly rectify it.

Fear based metrics

If you're tracking Change Failure Rate, you’re afraid. Tracking this metric means your MTTD and MTTR are high. It means you have poor observability. It probably also means you lack automation and the necessary machinery to fix things without a lot of fire fighting. It means your business panics when things break because there's a lack of confidence engineering can fix the issue quickly. It means the business doesn't trust engineering.

Here’s the thing: complex systems break. They break all the time too. All high speed companies with massive scale (like Google, Netflix, Meta) have figured out a way to find and fix those issues quickly. Indeed, failure is inevitable in complex distributed systems; therefore, you can’t reduce all failures. It’s far more efficient to accept failures as an outcome of moving fast and put resources towards figuring out how to deal with them quickly. Tracking Change Failure Rate is a distraction.

When the measure becomes a target

Tracking Change Failure Rate is a perverse incentive. What's more, the word “failure” is massively negative. There are few outcomes in the context of software processes that end well when the word failure is used. No one wants to have their name associated with a metric that uses the word failure.

If you judge people for failing, it’s a certainty that they’ll game the system to avoid the stigma associated with the metric. In this case, they’ll hide failures or slow deployment velocity so as to limit the growth of this nefarious metric. This is Goodhart's law in action, baby. The metric ceases to be a good measure because as soon as folks are afraid to report failures, you don’t learn. And you don't move fast.

Failures happen. Great engineering cultures celebrate and learn from failures while awful cultures punish for failures. Focus on what matters: moving fast with reliability.

Measure what really matters

DORA metrics represent a framework for optimizing a company's software delivery process and boosting its overall performance. Every business wants to move fast as it's the only way to win. Accordingly, the only metric that matters is Lead Time.

If you want to understand this metric, ask the business if it feels Engineering is moving fast enough. It's every business leader's bag to want to move faster, so listen carefully to the answer. If the answer is a frustrating "no" then you have a lot of work to do. You'll need to improve things in short order if want to survive. If the answer is a "yes" then you might want to also worry. Complacency leads to stagnation. Anything in between is the sweet spot for increasing overall engineering transparency and working to ensure that your company can reliably move fast.

Measure what matters by thinking through incentives. Moreover, acknowledge and accept that failures will occur. Invest in detecting failures and rapidly rectifying them. What's more, invest in learning from failures. Get behind the rallying cry of “move fast with reliability” and stop wasting time measuring failure.

Can you dig it?