Mean Time to Recovery (MTTR) measures the average time to restore normal operation after a production incident. It's the operational-resilience metric in the DORA set.
What it measures
The average elapsed time from when an incident starts to when it's resolved, across all incidents in the period.
How Leanmote calculates it
mttr = sum(resolved_at - incident_started_at) / incident_count
Includes incidents tied to service degradation, outage, rollback, or urgent hotfix.
Both mean and median are tracked so frequent minor incidents don't hide one bad outage.
Severity labels (
sev1,sev2, etc.) let you analyze recovery quality by impact.
How to interpret it
Trending down means recovery is getting faster — operationally healthy.
Median much lower than mean tells you most incidents recover quickly but a few drag the average up. Look at the outliers.
If you only track sev1, MTTR will look better than reality. Include sev2 and below for a complete picture.
What to do about it
Improve observability — alerts that detect incidents earlier shorten the clock.
Clarify ownership and escalation paths so the right people are paged immediately.
Invest in runbooks for the most common failure modes.
Review post-incident: was the recovery time spent diagnosing or fixing? That reveals whether the bottleneck is observability or change-velocity.
Related metrics
Change Failure Rate
Deployment Frequency
DORA metrics overview
