What is Change Failure Rate?
Change Failure Rate is one of the four key DORA metrics that measures how frequently changes to production result in degraded service that requires remediation. To calculate this, Uplevel analyzes the sequence of deployments to production in a given repository.
Consider the scenario where a code change was deployed to production, but it contained a defect that was missed during review. This degraded service, so the next day, a hotfix PR was written was written and deployed to production to restore service. In this scenario, the first deployment (containing the defect) would be considered a "failure".
Uplevel classifies each deployment to production within a repository into three categories:
Failures
These are deployments to production environments that are followed a subsequent deployment that includes work that signifies there was defect remediation. Uplevel estimates this by analyzing the PR branch name and title, as well as linked Jira ticket titles and issue types, looking for the following keywords: bug|incident|security risk|defect|vulnerability|hotfix|coldfix|patch|fix forward|rollback
Uplevel also looks for rollback deployments that could indicate that there was a problem with a previous deployment.
Additionally, Uplevel takes timestamps of work into account to avoid bug/polish work that was started before a code change was live in a production environment.
Successes
These are deployments to production that are not followed by repair work. These represent the majority of deployments that are observed by Uplevel.
Fix-only
These are deployments that are observed to only contain repair work. E.g., every PR contained meets the criteria described above OR a deployment is a rollback. Since these deployments aren't planned changes to production, they are removed from the calculation of CFR described below.
How is CFR Calculated?
In a given time period, Uplevel considers all deployments that are attributed to a group of people either because someone initiated the deployment or authored a PR that was included in the deployment. CFR is then calculated using counts (n) of the various deployments:
Fix-only deploys are removed from the denominator of this equation for two reasons:
Consider the following sequence of deployments:
- A successful, routine deployment.
- A deployment that is estimated to include a defect (i.e., a failure)
- A rollback, which only would have occurred to remediate the defect of the previous deployment.
In the above scenario, there are two intended changes to production and thus a 50% CFR, which aligns with the DORA question: "what percentage of changes to production or releases to users result in degraded service".
Additionally, if fix-only deployments are included in the total number of deploys, it may be tempting to game this metric by issuing multiple several fix-only deployments to increase the denominator, and artificially lower the CFR for a team. For example, if there was a problematic deployment, a team might issue three different "fix-only" deployments to fix it to reduce the team's CFR from 100% down to 25%.
The purpose of this metric is not to penalize teams, but rather highlight areas for improvement and facilitate data-driven conversations.
How are Deployments to Production Defined?
Deployments to production are estimated for each repository by the environments configured within Github Actions. Uplevel uses an organization wide regular expression (Regex) to find for production environments, but exclude non-production environments, but can also configure specific repositories. Your organization's global definition is visible by visiting the configuration page, and clicking on an environment. Learn more about how to utilize this page here.