MTTA vs MTTR: What Alert Fatigue Does to Each (Incident Response 2026)
Updated May 2026. Sources: PagerDuty 2023 State of Digital Operations, Atlassian incident management documentation, DORA 2024 State of DevOps, Google SRE Book Chapter 11.
Definitions That Often Get Conflated
MTTA (Mean Time To Acknowledge) measures how quickly an on-call engineer acknowledges a page in the pager tool. The clock starts when the page fires and stops when the engineer clicks the acknowledge button. MTTR (Mean Time To Resolve, sometimes Mean Time To Recover) measures the total incident duration from page firing to confirmed restored service health. The two metrics measure adjacent but distinct properties of incident response, and they are often conflated in practice because both vendors and engineering teams have inconsistent conventions about what each metric covers.
PagerDuty and Atlassian both publish MTTA in their incident metric APIs, and both define it consistently as page-to-acknowledgment time. MTTR is more fragmented. PagerDuty measures to acknowledgment of resolution. Atlassian Statuspage measures to confirmed restored health if Statuspage is wired into the incident workflow. Datadog and other monitoring tools sometimes report MTTR as time-to-recovery of the underlying signal, which is a different measurement that does not include the human investigation time. When comparing MTTR across organisations, confirm the definition first; comparisons across inconsistent definitions are not meaningful.
A third metric that often gets bundled with MTTA and MTTR is MTTD (Mean Time To Detect), measuring time from incident occurrence to the page firing. MTTD is determined by your monitoring instrumentation, not by your incident response capability; it is a separate diagnostic that points at monitoring gaps rather than response problems. Healthy MTTD for production-impacting incidents is under 5 minutes from incident onset to detection.
Healthy vs Noisy Thresholds
| Metric + condition | Healthy | Median | Noisy |
|---|---|---|---|
| MTTA, business-hours paging | Under 3 min | 5 to 10 min | Above 15 min |
| MTTA, after-hours paging | Under 5 min | 8 to 15 min | Above 25 min |
| MTTA, night paging (22:00 to 06:00) | Under 8 min | 12 to 20 min | Above 30 min |
| MTTR, Sev-1 (customer-impacting) | Under 30 min | 60 to 180 min | Above 4 hours |
| MTTR, Sev-2 (partial impact) | Under 2 hours | 4 to 12 hours | Above 1 day |
| MTTR, Sev-3 (operational) | Under 1 business day | 2 to 5 days | Above 1 week |
The night MTTA degradation relative to business-hours MTTA is the most diagnostic metric for sleep-disruption-driven alert fatigue. A team with healthy business-hours MTTA (under 5 minutes) but night MTTA above 20 minutes is signalling that the night rotation is suppressing pager acknowledgment, either intentionally (engineers know that night pages are usually false positives and delay engagement) or unintentionally (sleep disruption is degrading response capability). Either way, the night-MTTA-gap is a stronger predictor of attrition risk than aggregate MTTA.
How Alert Fatigue Affects Each Independently
Alert fatigue degrades MTTA through learned delay: engineers under sustained alert noise learn that most pages are false positives, and they delay acknowledgment expecting the alert to self-resolve before they need to engage. This is rational behaviour in the face of noisy alerting; it is also the mechanism by which the real signal gets missed. Each individual delay is small (an engineer waits 30 seconds longer to check the page); aggregated across a noisy quarter, the MTTA degrades from 3 minutes to 12 minutes and stays there even after the underlying noise is reduced. The learned behaviour persists.
Alert fatigue degrades MTTR through cognitive overload rather than delay. An engineer who has been paged 8 times in their primary week, even if most pages auto-resolve, carries accumulated context-switching debt. When a real incident occurs, the operator's ability to reason cleanly about the system is impaired by the prior context-switches. The Gloria Mark 23-minute refocus penalty is per-interruption; an engineer recovering from many recent interruptions is operating below their baseline capability for system reasoning. Mitigation that would normally take 20 minutes takes 35 minutes; the marginal 15 minutes is invisible in the metric but real in cumulative impact.
The compounding is critical for diagnostic purposes. A team that has good MTTA but bad MTTR likely has a runbook coverage or training problem: the response starts promptly but the mitigation is slow. A team that has bad MTTA but good MTTR likely has an alert fatigue or pager hygiene problem: when the team actually engages, they mitigate well, but they are slow to engage. A team that has bad both has multiple compounding problems and should sequence fixes carefully. The MTTR-only view collapses these distinct failure modes into one number and produces misallocated fix investment.
Action by Metric Quadrant
Maintain. The system is operating well; focus engineering attention on preventing rather than responding. Invest in SLO-based alerting (read /slo-vs-threshold) to keep this state robust as the system grows.
Runbook coverage and training gap. Engineers respond promptly but cannot mitigate quickly. Highest-leverage investments: runbook coverage on top-50 alert classes (read /runbooks-oncall), incident response training, possibly two-tier on-call (read /two-tier-on-call-cost) so deep-knowledge engineers escalate cleanly.
Alert fatigue and pager hygiene problem. When the team engages they fix things well, but they delay engagement. Highest-leverage investments: alert audit (read /alert-tuning), correlation and dedup (read /correlation-dedup), reduce false-positive rate. The metric improves only after the noise reduces; the learned-delay behaviour persists for weeks after volume reduction.
Compounding failure. Both the engagement and the mitigation are impaired. Diagnose the dominant cause: if the team is small and noisy, alert hygiene first; if the team is large and uncoordinated, runbook coverage and rotation structure first. Sequence carefully; trying to fix both simultaneously often produces neither result.
The Trap of MTTR-Only Dashboards
Many incident-management vendor dashboards default to MTTR as the headline metric and bury MTTA in a secondary view. The implicit assumption is that customers care about total incident duration, not about whether the engineer engaged promptly. This is correct from a customer-impact perspective and misleading from an operational-diagnosis perspective. The two metrics serve different decision-making purposes and both belong on the headline dashboard.
Without MTTA on the dashboard, two failure modes that have similar customer impact but different fixes are not distinguishable. Failure mode one: engineers acknowledge promptly but mitigation is slow (the fix is runbook coverage and training). Failure mode two: engineers acknowledge late but mitigation is fast (the fix is alert fatigue and pager hygiene). Both produce MTTR in the 90-minute range; the metric on its own says "we have an MTTR problem" without telling you whether to invest in runbook authoring or in alert tuning. Investment in the wrong fix wastes engineering attention and does not move the metric.
Configure your dashboard to show MTTA, MTTR, and the gap between them on the same view. Trend each over the same time window. When MTTA and MTTR move together, you have aligned signal: improvements are real, regressions are real. When MTTA and MTTR diverge, you have diagnostic signal: the failure mode is shifting and the fix priority should shift accordingly. This is cheap dashboard work that produces consistently better diagnostic decisions over time.