The Four Key Metrics explained
Lead time is a traditional development and production metric which measures the time it takes something to enter the production process to the moment it reaches customers in the market. The challenge with lead time is that it has been overused to the point of where most folks using the term are likely talking about different things. The definition used in Accelerate is (thankfully) more specific. In the context of the book and this post, lead time refers to the time period between a code being committed to when that code is delivered into production.
This is probably the most straightforward metric, but is also a metric whose value is commonly challenged by both executives and developers. But the correlation is very clear — teams who deploy to production more frequently (multiple times per day) are more likely to fall into the high- or elite-performing category.
Change fail rate
How often do changes deployed to production fail? This metric was the hardest to define and measure practically, but from a high level it indicates the efficacy of your quality-control process. How many changes introduce major bugs or regressions in your production environment?
The average time it takes to recover from a service outage, failed deploy, or incident, Mean-Time-To-Restore (MTTR) is a great operational metric because beyond pure availability metrics (which are also critical), it measures a kind of resilience in the face of failures.
More DevOps metrics? I thought we were talking about business impact...
While it’s clear that these metrics give us insight to a few important aspects of delivery team performance — specifically: time-to-market, cadence, resilience, and quality — it’s a little less clear how they impact the health of teams and that of the business. One of the more interesting outcomes from the Accelerate study was that there was a strong statistical correlation between organizations which performed well on these key metrics and those who were more likely to perform better with regard to high level business outcomes, such as:
Number of customers
Quality of products or services (as evaluated by customers)
Customer satisfaction (NPS or equivalent)
Quantity of products or services provided
Achieving organizational and mission goals
Better yet, the authors’ analyses are (to date) some of the largest and most scientifically credible studies done in the IT industry around delivery performance and DevOps capabilities.
To summarize, rather than relying on anecdotal narratives about the teams “feeling slow” or ambiguous, relative data like story point throughput, Accelerate offers hope of finding a clear path toward measurable software delivery metrics that really impact your business. Most importantly, these metrics are highly correlated with positive outcomes that make sense to executives outside of product development.
Our journey at Contentful
As a director of engineering for groups that own two critical customer personas (the developer and the non-technical creator), I work with teams that own a lot of customer-facing features. Because of this, it made sense for me to take ownership of the Quality Assurance (QA) chapter when I started at Contentful in 2017.
At the time, we had reached a point where we were ready to rethink how we approached our QA practice. We wanted to ensure that our engineering teams as a whole — not just the engineers hired to do QA — owned the quality of the work that they produced and delivered to our customers.
When we looked a bit more closely at our software delivery processes, one thing that stood out was that we would have to make some changes to our test automation and delivery pipeline to facilitate this.
If we wanted to get engineers more engaged in quality for our customers, our automation and tooling needed to provide faster feedback and better confidence, be more reliable and be closer to the engineers’ day-to-day development process.
To help facilitate this, we put together a small team with two senior engineers who had deep experience with test automation and delivery infrastructure, and a keen interest in process improvement and engineering productivity. This was the beginning of our “Test, Automation and Tools” team, also known as aTools (secretly, in my head: the A-Team).
While our first task was to consolidate and improve our test automation, we made the explicit choice from the onset to connect our goals to the broader mission of engineering productivity and effectiveness. Specifically, we wanted to focus on reducing lead time and the risk of getting changes into production.
Measuring success with the Four Key Metrics
Once our mission was clear, the next logical question became, “How do we measure success?” A few of us had been interested in the work that had been done around the Four Key Metrics after reading Accelerate. These metrics seemed to be a great fit for a team focused on improving the delivery pipeline. In particular, as we dug into refining the Four Key Metrics to be more specific to our organization and use case, the first three seemed like must-includes:
Code delivery time
Code delivery time is the name we settled on for “lead time” — to avoid an overloaded term — measures the average time from a code commit (on a local developer’s computer) to the production deployment of the artifact which contains that commit in the changeset. This captures the entire change management and deployment pipeline. While this also captures some of the variance from development time, much of it can happen prior to the first commit. We felt that it was valuable to be able to surface how much of this time was spent in automation — this helped us understand which parts of the pipeline had the most impact on the value stream.
Deployment frequency is a common metric for build automation, and is often used as a proxy for batch size. If our build automation is fast and easy to use with confidence, we should see more changes, with smaller batch sizes, being deployed into production (normalized per developer).
Change fail rate
Change fail rate helped us understand whether not changes to our testing approach impacts quality. Some types of automated testing (e.g. an automated end-to-end test which exercises the UX and all systems beneath it) are more expensive than others. We wanted a metric that could help us understand if moving some of the test coverage further down the automation pyramid would have any impact on quality.
Our infrastructure team was already capturing MTTR, so we decided to leave it out of the scope for the initial version.
As we dug deeper into the problem, it became clear that these metrics could be strongly impacted not just by tools, but development practices. The availability of feature toggles, for instance, impacts both code delivery time and deployment frequency. If developers are not able to integrate partial changes into mainline and production hidden behind feature toggles, the additional coordination of packaging the features into releases will impact both metrics.
Even though we knew we had some gaps in our continuous delivery practice that would impact these metrics, we explicitly made the choice to use the “higher-level” metrics. We felt that they were important because they would highlight and incentivize change — even though a few of the most impactful (mostly process- and culture-related) changes would be outside of the team’s control.
Dev KPIs: Our implementation of the Four Key Metrics at Contentful
So what did we end up with? As we dug into development, we made a few changes to support Contentful’s specific environment and context.
Our first major challenge was in defining and measuring “change fail rate” since not all defects are detected or cause failures, and incidents are not always connected to software defects, but are often caused by load, novel usage patterns and systems interactions. In order to keep the task relatively simple we targeted two situations:
Rollbacks of production deployments. If we need to rollback a change, then we can probably correlate that change with some sort of failure in the previous deployment.
Commits with “fix” in the description. We initially wanted to capture hotfixes, but that was a bit too complex to reliably measure. We settled on splitting out a separate measurement from rollbacks called “fix rate” which counts the ratio changes with “fix” in the description. While we know that this metric isn’t ideal, we communicated this to the dev team and went ahead with it because we were interested in surfacing rework.
Another major challenge was trying to define teams. While our organization has evolved from a chapter model to a more explicit team ownership model, there are still quite a few code bases that are worked on by multiple teams.
We use GitHub extensively to generate these development metrics, but our distributed code ownership can make team-based views difficult or impossible. While we’ve made improvements in this area by doing a major overhaul of team definitions and access in GitHub, there are still a lot of blurry lines. In the end, our current stance is that encouraging collaboration is more important than being overly strict about ownership. So we accept some blurry lines as fair exchange.
As a next step, the team dug into the APIs of our change management and CI/CD service providers and started surfacing the data via Grafana to produce our internal Dev KPIs dashboard. The purpose of this dashboard is to help teams identify and surface areas for improvement. As a side effect of this work, we were able to make major improvements to our CI/CD pipeline, revision control and test automation infrastructure and performance.
Insights and next steps
Where are we now? We now have a pretty full-featured version of the tool and have been working over the past few months on rollout and adoption.
Here are a few things that we learned along the way:
Adoption is challenging! It’s not always obvious to stakeholders or developers why these particular metrics are valuable and it’s a bit of a stretch to ask everyone to read Accelerate (that said, you should really ask everyone to read it). We’ve managed to get the dashboards into our monthly delivery review to encourage teams to look at and reflect on the data on a regular cadence. This has definitely improved visibility.
On the point of adoption: these metrics tend to favor particular lean processes like short-lived branches, early integration, feature toggles, etc. While these are generally considered best practices, some aspects of lean product development like limiting WIP, and treating code as inventory are not intuitive. Bringing along your team will take advocacy, passion, and a whole lot of work. Find influencers in your organization and talk to them. Listen to them, but be willing to challenge orthodoxy. Consider starting a book club. Be voracious as you read. Listen, learn, and share.
Once you can start to see in the metrics where you have room for improvement, it’s not always easy to turn that somewhat abstract number into something specific and actionable. We’ve had to continue to refine the metrics (breaking the code delivery time into multiple phases, for instance) in order to make them more tangible to teams and (hopefully) actionable.
The four key metrics don’t tell the whole story. We’re looking at deeper integration with additional tools like Jira to continue to improve our understanding of the whole product development value stream.
Finally, it’s important to understand that correlation is not causation–blind fixation on these metrics alone is not likely to improve your organization. The best organizations also tend to have invested deeply in their technical, process, and cultural capabilities. It’s the culmination of all these investments that is reflected in both their successes and these metrics.
As I mentioned above, we’re still in the early stages of learning about how to drive change with our Dev KPIs. We’ll continue to share about our journey as we learn more.
Have you built something similar? You can reach out to me on Twitter — I would love to hear your stories and feedback!