Home/What is Alert Fatigue

CANONICAL DEFINITION

What Is Alert Fatigue? Definition + 7 Root Causes (2026)

Updated May 2026 | Citations: Google SRE Book, DORA 2024, incident.io 2024, Joint Commission NPSG.06.01.01

"Alert fatigue is the desensitisation that occurs when humans are exposed to too many alerts of varying quality, causing them to miss, delay, or ignore important ones."

The Four Types of Alert Fatigue

The term covers four related but distinct phenomena across different industries. This taxonomy is not published elsewhere in a unified form.

Type	Domain	Primary victim	Primary cost	Structural fix
Alert fatigue	DevOps / SRE	On-call engineers	Missed P1s, MTTR, retention loss	SLO-based alerting + correlation
Alarm fatigue	Healthcare (ICU)	Nursing, ICU staff	Patient safety, sentinel events	Customised alarm parameters + tiers
Notification fatigue	Knowledge work	Office workers, PMs	Productivity, deep work loss	Batching + async norms
Security alert fatigue	SOC / SecOps	Analysts, threat hunters	Missed true positives, breach dwell	SOAR automation + risk-based triage

Root Causes

Alert fatigue in DevOps and SRE environments is rarely a single cause. It is typically the compounding result of several structural deficiencies. Addressing one in isolation has limited impact.

01.Threshold-based alerting without SLOs+

Alerts fire when a metric crosses an arbitrary threshold (CPU > 80%, disk > 90%) rather than when users are experiencing degraded service. Most of these alerts self-resolve before any human can act. The Google SRE Book calls this 'alerting on causes rather than symptoms'. Migration to burn-rate alerting on SLOs eliminates this category.

02.Copy-paste monitor rules+

Engineers clone alert templates from Stack Overflow, vendor docs, or previous employers. Rules arrive without context: what does 'API latency > 500ms' mean for this specific service? Should a 3am page really be the response? Copy-paste rules are rarely reviewed because no one knows who wrote them.

03.No deduplication or correlation+

A single infrastructure failure (a host going down, a network partition) can trigger 50-100 alerts from monitoring, APM, synthetic checks, uptime monitors, and log alerts - all simultaneously. Without correlation, each fires as a separate page. Enabling grouping in PagerDuty or Opsgenie reduces this to one incident.

04.Absent severity tiering+

When everything is P1, nothing is P1. Teams without a defined severity taxonomy default to paging for everything, burning on-call engineers on low-urgency issues. The fix: define clear P1 (immediate customer impact, page now), P2 (business hours response), P3 (ticket, no page) policies and route accordingly.

05.No runbooks+

An alert without a runbook is noise. If the on-call engineer cannot complete the response procedure in the time it takes to read the page, they will acknowledge the alert, attempt to diagnose from scratch, and either miss the resolution or create an incident of their own. Every alert that remains active must have a linked runbook.

06.No audit cadence+

Alert rules accumulate over time. Engineers add monitors when investigating incidents and never remove them. The monitoring tool becomes an archaeological record of every concern anyone ever had. Without a monthly or quarterly audit with a mandate to kill 20% of rules, the list grows indefinitely. Noise compounds.

What It Looks Like in Practice

The 3am disk alert

Page fires: disk at 81%. Engineer wakes, SSH-es in. Disk at 79% by the time they connect. Self-healed. Engineer goes back to sleep. Repeat 4 nights later. On the 5th night, they acknowledge without connecting. The disk is at 99% and fills three days later, causing real data loss.

The #incidents nobody reads

Slack channel receives 400 alerts per day from 6 different tools. Engineering team has it on muted. The real P1 - a payment processor timing out - sits in the channel for 11 minutes before anyone notices via a customer support ticket.

Seven duplicate pages

A Kubernetes node drains. Datadog fires. PagerDuty fires. The uptime monitor fires. The synthetic check fires. The SLO burn-rate fires. The log error-rate fires. The dashboard alert fires. Seven pages, one incident. Six are suppressed by the on-call without investigation because they assume they are all the same thing - and they are right, this time.

The 'high CPU' ritual

Every Monday morning a batch job runs, CPU spikes to 95%. Alert fires. On-call acknowledges it without investigation because they know from experience it will resolve in 20 minutes. This has happened for 8 months. Nobody has silenced the alert because nobody is sure what would happen if they did.

Why It Matters

Reliability

Fatigued engineers miss P1 alerts. MTTA degrades. MTTR degrades. Error budgets burn faster. Incident severity increases because detection lag amplifies blast radius.

Details -->

People and retention

62% of on-call engineers report sleep disruption weekly (incident.io 2024). 41% have considered leaving specifically because of alert load. Replacing a senior SRE costs $200K-$300K.

Details -->

Cost

At 42 pages/week and a $180K fully-loaded cost, direct alert-handling time alone exceeds $61,000/year per engineer. The calculator on the homepage shows your team's number.

Details -->

Historical Origin

1970s-1990s Healthcare

Medical alarm fatigue is documented as early as the 1970s as electronic patient monitoring becomes standard. By the 1990s hospital ICUs have 30-40 monitoring devices per bed, each with its own alarm thresholds. Studies begin documenting 80-99% false-positive alarm rates in critical care. The Joint Commission identifies alarm safety as a national patient safety priority in 2013 (NPSG.06.01.01).

2014-2016 DevOps/SRE

As modern distributed systems scale, monitoring proliferates. Google's Site Reliability Engineering book (2016) articulates the first widely-read alert philosophy for software: alerts must be actionable, urgent, require human judgment. The DORA State of DevOps reports begin measuring MTTR and deployment frequency, providing the empirical link between on-call discipline and engineering performance.

2017-2020 Observability

Charity Majors and Liz Fong-Jones at Honeycomb publish the canonical essays on observability-driven alerting and SLO-based alerting. The concept of alerting on symptoms (user experience) rather than causes (resource utilisation) enters mainstream SRE practice. The term 'alert fatigue' consolidates around the DevOps meaning.

2021-present SecOps

Security Operations Centres begin documenting alert fatigue as a distinct discipline as SIEM tools generate thousands of daily alerts. Ponemon, Mandiant, and Devo publish SOC-specific research. SOC alert fatigue is recognised as a contributing factor to high-profile breaches where threat indicators were present in the alert queue but not investigated in time.

How to Measure Alert Fatigue on Your Team

1.

Pages per engineer per week

Export from PagerDuty/Opsgenie. Google SRE threshold: 14. incident.io 2024 median: 42. Healthy target: under 5 actionable pages per week.

Details -->

2.

False-positive ratio

Count alerts resolved without human action in a 90-day period. Industry median: 60-80%. Healthy target: below 20%.

Details -->

3.

MTTA trend over time

Is your mean-time-to-acknowledge improving or degrading? Degrading MTTA on stable systems signals growing fatigue, not growing incident volume.

Details -->

4.

Acknowledged-but-not-actioned rate

Alerts that were acknowledged but closed without any documented action. PagerDuty and Opsgenie both report this. High rate = engineers clicking to silence, not to resolve.

Details -->

5.

Voluntary rotation exits

Engineers who request to be removed from on-call rotations, transfer teams, or leave the company citing on-call load. 41% report attrition intent in incident.io 2024. Track this quarterly.

Details -->

FAQ

What is the difference between alert fatigue and alarm fatigue?+

Alert fatigue is the DevOps/SRE term for desensitisation to monitoring system pages. Alarm fatigue is the healthcare term for the same phenomenon in ICU environments. Both describe the same cognitive mechanism: humans exposed to too many low-signal interruptions start ignoring even important ones.

What causes alert fatigue in DevOps?+

The root causes are: threshold-based alerting without SLOs (alerts fire on cause, not user impact), copy-paste monitor rules from templates, no deduplication so one failure fires 50 alerts, absent severity tiering (everything is P1), no runbooks (no documented response path), and no audit cadence so rules accumulate indefinitely.

How do you measure alert fatigue?+

Five measurable signals: (1) pages per engineer per week compared to the Google SRE threshold of 14, (2) false-positive ratio (alerts requiring no action), (3) MTTA trend over time, (4) acknowledged-but-not-actioned rate, and (5) voluntary on-call rotation exits in a 12-month period.

Alert fatigue in cybersecurity - is it the same as DevOps alert fatigue?+

SOC alert fatigue is a parallel problem. Security Operations Centre analysts process hundreds to thousands of SIEM alerts daily. Ponemon research found 55% of SOC alerts are false positives, and analysts increasingly ignore low-severity alerts after being desensitised. The Equifax breach dwell time (78 days) is a famous consequence. The interventions differ slightly - SOAR (automation) is the primary SOC remedy, whereas SLO-based alerting is the primary SRE remedy.

When did alert fatigue become a recognised problem in DevOps?+

Google's SRE Book (2016) introduced the first widely-read framework for alert philosophy in software operations. Honeycomb's observability-driven alerting essays (2017 onwards, Charity Majors and Liz Fong-Jones) popularised SLO-based alerting as the structural fix. The DORA State of DevOps reports (annually from 2014) provided the empirical backbone linking deployment practices to MTTR and alerting discipline.

Alert Fatigue Calculator On-Call Cost MTTR Impact Alert Tuning Healthcare Parallel Notification Fatigue Research Citations