Home/Case Studies

DOCUMENTED INCIDENTS

When Alert Fatigue Caused Real Incidents: Case Studies Across Ops, Security, and Healthcare

Documented cases where alert overload contributed to missed detections, extended outages, and preventable harm. All cases cite public-record sources or anonymised composite patterns. Updated May 2026.

case-study-01

HISTORICAL ANCHOR

The Therac-25 (1985-1987): When Alarm Suppression Kills

Healthcare / embedded systems | Multi-fatality sentinel event | Public record

The Therac-25 radiation therapy machine produced software errors that in some configurations delivered massive radiation overdoses. Crucially, the machine displayed an error message ("Malfunction 54") that operators had been taught to dismiss as a benign communication error and reset. The alarm was non-actionable by design in most contexts. Operators had been conditioned to ignore it.

When the same alarm started appearing in a genuinely dangerous context (race condition causing overdose), operators continued to dismiss it and restart the machine, as they had been trained. Six patients received fatal or near-fatal radiation overdoses between 1985 and 1987.

DEVOPS PARALLEL

Therac-25 is the extreme case of alert normalisation: an alarm that was almost always non-actionable trained operators to dismiss it entirely. DevOps teams do this continuously at smaller scale: the disk-at-81% alert, the high-CPU batch job, the synthetic check flap. When any of these finally carries a genuine signal, the trained response is dismissal.

Leveson & Turner (1993) - An Investigation of the Therac-25 Accidents -->

case-study-02

SECURITY BREACH

Target Corporation Data Breach (2013): Alerts Present, Action Absent

Security / retail | 40 million card numbers stolen | Public record

In November 2013, Target suffered one of the largest retail data breaches in history. Attackers gained access via a third-party HVAC vendor, installed malware on Target's point-of-sale terminals, and exfiltrated 40 million credit card records over several weeks.

A Bloomberg investigation (2014) revealed that Target's $1.6 million FireEye malware detection system had alerted the security team to suspicious activity. Bangalore-based analysts saw the alerts and escalated. The alerts were not acted upon by Target's Minneapolis security team. Target had automatic quarantine capabilities enabled during testing but had disabled the automatic response before the breach, preferring manual review. The manual queue was too noisy.

ALERT FATIGUE PATTERN

Alerts were present and timely. The detection system worked.
The review queue was too noisy to prioritise critical alerts correctly.
Automatic remediation had been disabled to reduce false-positive-driven disruptions.
The real-time alert went unactioned for days.

Cost: $162M settlement + $18.5M state AG settlement + reputational damage. Total estimated $300M+.

case-study-03

HEALTHCARE

ICU Alarm Fatigue Sentinel Events: The Joint Commission Pattern

Healthcare / ICU | Preventable patient harm | Joint Commission SEA 50

The Joint Commission Sentinel Event Alert 50 (2013) documents a general pattern from reported sentinel events (unexpected patient deaths or serious harm) related to alarm fatigue. The document does not attribute cases to individual patients or hospitals to protect privacy, but describes the pattern: a monitor alarm fires, the nurse either does not hear it (alarm fatigue, alarm silencing), cannot reach the patient in time, or dismisses it as another false positive. The patient deteriorates.

The Joint Commission received 98 sentinel events between January 2009 and June 2012 in which alarm fatigue was cited as a contributing factor. 80 resulted in patient death. The report notes this is likely a significant undercount as reporting is voluntary and sentinel events are often attributed to primary clinical causes without documenting the monitoring-system failure.

DEVOPS PARALLEL

The consequence in healthcare is patient death. The consequence in DevOps is extended outage, data loss, or breach. Both are caused by the same mechanism: a genuine signal in a noisy alert queue is missed because the operator has been conditioned not to trust the system. The severity differs; the root cause does not.

case-study-04

DEVOPS OUTAGES

Major Outage Amplification: Alert Noise and Detection Lag

DevOps / infrastructure | Multiple documented cases | Public post-mortems

Public post-mortems from major technology companies consistently reference alert noise as a contributing factor to delayed detection and extended outage duration. While companies rarely state "we missed the alert because we had alert fatigue", the patterns are consistent:

Cloudflare June 2019 outage (WAF deployment)

Caused by a CPU exhaustion bug in a regex. The incident propagated globally and affected all traffic. Post-mortem notes multiple alert channels firing simultaneously. Detection was rapid (on-call monitoring), but initial response was complicated by the volume of simultaneous alerts from all globally distributed systems firing together.

Post-mortem -->

Atlassian April 2022 outage (274 minutes)

One of the longest public outages from a major SaaS provider. Atlassian's post-mortem documents that the maintenance script affected the wrong sites. The alert system generated large numbers of signals from cascading failures. The post-mortem is notable for its transparency; alert volume is not directly cited but the propagation delay suggests signal-noise issues during initial detection.

Post-mortem -->

General pattern: the noisy P1

Multiple anonymised post-mortems from incident.io, FireHydrant, and PagerDuty case study libraries document a pattern: a genuine P1 event fires in an alert channel alongside 40-100 lower-priority alerts from correlated but non-critical subsystems. MTTA on the genuine P1 is 2-5x higher than baseline because it requires triaging through noise first.

case-study-05

SUCCESS PATTERN

Composite Success Pattern: 60-90% Noise Reduction in 6 Months

Anonymised composite | Mid-market SaaS (500-2000 employee range) | Vendor case study sources

The following is a composite of recurring patterns from published vendor case studies (BigPanda, PagerDuty, incident.io, FireHydrant). Individual company names are not cited to avoid attributing specific numbers to specific companies without permission.

Starting alert volume

800-1,200/day

After deduplication + correlation

120-200/day

After SLO migration (3 services)

40-80/day

After quarterly audit (6 months)

20-40/day

Total noise reduction

75-95%

MTTA improvement

-40 to -60%

The critical insight: deduplication and correlation alone (steps 1-2) deliver the majority of the reduction with no changes to alerting rules. SLO migration delivers structural reduction in false positives. The quarterly audit sustains the improvement by preventing rule accumulation from reversing the gains.

What the Cases Have in Common

The detection system worked. In every case, a real signal was present in the alert queue. The failure was not in detection -- it was in the operator's ability to prioritise a true positive among false positives.

Volume was the mechanism. High alert volume is the common cause. Whether ICU alarms, SIEM events, or monitoring pages, the quantity overwhelmed the human's ability to triage accurately.

The consequences scale with the context. In healthcare: patient death. In security: breach with financial and reputational impact. In DevOps: extended MTTR and amplified incident cost. The mechanism is identical; the stakes differ by domain.

Structural fixes outperform training. Every case where teams attempted to fix alert fatigue through operator training alone (JCAHO compliance training, SOC analyst refreshers, DevOps on-call preparation) showed limited improvement. Structural fixes (alarm customisation, deduplication, SLO-based alerting) showed sustained improvement.

No team self-diagnosed. In every documented case, the alert fatigue problem was identified retrospectively -- after the missed detection caused consequences. No team had a proactive alert-noise measurement programme that caught the problem before it caused harm.

FAQ

What is the Target breach alert fatigue case study?+

In November 2013, Target Corporation suffered a data breach in which 40 million credit card numbers were stolen. Subsequent investigation revealed that Target's security team had been alerted by their FireEye malware detection system to suspicious activity, but the alerts were not acted upon in time. The security analysts had been experiencing high alert volumes, and the critical alerts were among many in a noisy queue. This is one of the most frequently cited examples of SOC alert fatigue having real-world consequences.

What are examples of alert fatigue causing outages?+

Several well-documented public post-mortems reference alert noise contributing to delayed detection: The Cloudflare 2019 outage (caused by a WAF deployment) was in a high-alert-volume environment that delayed initial response. Multiple hyperscaler postmortems note that alert-channel noise delayed acknowledgement by 5-15 minutes on P1 incidents. These delays compound: at $5,000/minute in revenue impact, a 15-minute delay from alert noise costs $75,000 per incident.

How much noise reduction do real teams achieve?+

Vendor case studies document 60-90% alert volume reduction after enabling correlation and deduplication. A BigPanda case study documents an unnamed Fortune 500 company reducing from 22,000 alerts per day to 1,100 after deploying their AIOps platform. PagerDuty Event Orchestration case studies cite 80%+ reduction in pages after enabling grouping for similarly-scoped incidents. These figures are vendor-published and should be treated with appropriate scepticism; the methodology is not always disclosed.

What is Alert Fatigue?Alert Tuning Tools Comparison Research Citations