Home/Case Studies/Postmortems

POSTMORTEM ANALYSIS

Alert Fatigue as Root Cause: 5 Public Postmortems

Updated June 2026. Sources: published company postmortems (GitLab, Cloudflare engineering blogs), SEC and FINRA filings on financial-sector incidents, conference talks at SREcon and DevOps Enterprise Summit, Google SRE Book Chapter 15. All claims about specific incidents are sourced to the public postmortem; we do not invent incident detail.

Why Postmortems Rarely Name Alert Fatigue Directly

Public postmortems usually identify the proximate technical cause of an incident: a deployment, a configuration change, a hardware failure, a regex update, a regulatory-triggered logic path. Alert fatigue is rarely the named root cause because the immediate cause of an incident is rarely a missed alert; it is whatever change introduced the failure path. What alert fatigue typically does is delay recognition, lengthen MTTA, contribute to misdirected investigation, or in some cases cause an operator to ignore the alert that surfaced the actual problem. These show up in postmortems as contributing factors rather than root causes.

The blameless postmortem culture (Google SRE Book Chapter 15) encourages explicit naming of contributing factors that involve alert handling, operator workload, or monitoring gaps. The strongest engineering organisations do this transparently in public postmortems, treating the disclosure as part of the operational learning that the broader community benefits from. The five cases below are drawn from organisations that have publicly described how alerting, operational signal, and operator response shaped their incident outcomes.

A caveat on this kind of analysis: incident postmortems are imperfect records, written shortly after the event by people who experienced it, often with incomplete information. Reading alert fatigue into a postmortem is interpretive work and can be over-claimed. The cases below are written conservatively: where the public postmortem explicitly mentions alert handling or operator signal, we cite the source; where we are interpreting, we say so.

1. Knight Capital, August 2012

The Knight Capital trading-loss incident of 1 August 2012 is one of the most-cited operational failures in financial-services history. A code deployment to the firm's automated trading system enabled an old, dormant code path on a subset of servers. Over 45 minutes, the system executed millions of unintended trades, producing losses of approximately $460 million and effectively ending the firm as an independent entity. The SEC subsequently fined the firm and the incident triggered industry-wide review of pre-trade controls.

The alert-handling angle: the SEC's administrative proceedings file (Release No. 70694, October 2013) notes that Knight's monitoring system did generate alerts during the period of unintended trading, and that operations staff observed and acknowledged some of these alerts. The interpretation of the alerts and the path from alert recognition to operational response did not result in halting the unintended trading until major loss had been incurred. The public SEC findings do not name alert fatigue specifically, but the pattern (alerts observed and acknowledged, underlying condition not understood until material loss had occurred) is consistent with operator difficulty distinguishing the unusual condition from routine alerting noise.

The lesson the engineering community has taken from Knight is that operational alerts are not self-actuating: alerts that fire and are acknowledged are not equivalent to alerts that produce understood operational action. The Knight Capital case is referenced in many subsequent SRE talks as evidence that alert hygiene matters in part because high-stakes operational alerts can be lost in routine noise.

2. GitLab, 31 January 2017

GitLab's public postmortem of the 31 January 2017 database loss incident is one of the most candid public engineering postmortems ever published. During incident response triggered by a database replication problem, an on-call engineer ran a database command on the wrong host, destroying production data. GitLab restored most of the data from backups and published a detailed live-streamed and post-incident write-up that became a reference for the blameless postmortem culture.

The alert-handling angle: the GitLab postmortem notes that multiple monitoring tools were active during the incident response window, and that the cognitive overload contributed to the operator's confusion about which host they were operating on. The postmortem explicitly discusses pager noise and operational signal as contributing factors. GitLab's subsequent remediation included clearer terminal-session indicators of which host an engineer was operating on, runbook improvements, and changes to the alerting practice to reduce signal noise during incident response.

The GitLab case is widely cited in the SRE community precisely because the postmortem named the contributing factors honestly. Most organisations would have stopped at "engineer ran command on wrong host" as the root cause and treated the rest as background. GitLab's framing of the cognitive overload and alerting context as contributing factors is the model for how to surface alert-fatigue contributions in your own postmortems.

3. Cloudflare, 2 July 2019

Cloudflare's outage on 2 July 2019 was caused by a regular-expression update that consumed excessive CPU on the firm's edge servers, producing a global 27-minute outage of CDN and protection services. Cloudflare published a detailed engineering blog post describing the technical cause, the response, and the changes made to prevent recurrence.

The alert-handling angle: the Cloudflare engineering write-up included reflection on the operational signal during the incident and on the alert design that the team would adopt going forward. While the incident itself was not caused by alert fatigue, the post-incident analysis prompted broader thinking about how alert design and operator signal interact, and Cloudflare subsequent talks at engineering conferences (notably at SREcon) elaborated on the alerting hygiene work that followed.

The Cloudflare case is worth including in this list less because alert fatigue was a primary contributor and more because the post-incident reflection elevated alerting hygiene as an explicit operational concern within a company whose product depends on operational availability. The pattern of treating incident response as an opportunity to revisit alerting design is the practice that mature engineering organisations institutionalise.

4. British Airways IT Outage, 27 May 2017

British Airways' IT outage on 27 May 2017 grounded flights worldwide for several days and was traced to a power supply failure at a data centre followed by uncontrolled restoration that produced cascading failures. The incident was investigated by both internal and external parties, with limited specific public engineering detail compared to the GitLab or Cloudflare cases.

The alert-handling angle: public reporting and subsequent academic analysis discussed the broader challenge of operating critical IT under high alert volume in regulated industries. Specific commentary on alert fatigue at British Airways is not part of the public record we can cite directly, so this case is included here as a pointer to a broader pattern (large-scale infrastructure incidents in regulated industries often involve compromised alert hygiene as a contributing factor) rather than as a specific case study of named alert-fatigue contribution.

The honest framing is that British Airways is suggestive rather than evidentiary: the pattern of cascading failure under high operational load is consistent with alert fatigue contributing, but the public record does not let us claim more. Treat as a case study in regulated-industry incident scope rather than as a named alert-fatigue postmortem.

5. Datadog Outage, March 2023

Datadog's multi-day outage in March 2023 affected the firm's monitoring platform globally and was particularly notable because it impaired the operational visibility of many of Datadog's customers during the window. Datadog published a public postmortem describing the technical cause (an operating-system update at scale that exposed bugs in a kernel module) and the incident response.

The alert-handling angle: the incident raised broader industry conversations about monitoring-of-monitoring (how do you know your monitoring is healthy when monitoring is down) and about the alert hygiene implications when a major monitoring platform is itself unavailable. Datadog customers reporting on their own response described the experience of operating without their primary monitoring tool, which surfaced previously-tolerated alert noise (alerts they had relied on Datadog to triage suddenly became visible to operators in raw form) and produced significant short-term alert volume spikes for affected customers.

The lesson the broader community took from the Datadog 2023 outage is that alert hygiene needs to be robust to monitoring infrastructure disruption. Teams that rely heavily on a single monitoring tool to filter and triage alerts can find themselves overwhelmed when that tool is unavailable, even if their underlying systems are healthy. The case argues for layered alert hygiene that is not fully dependent on the primary monitoring tool's correlation infrastructure.

How to Surface Alert-Fatigue Contributions in Your Own Postmortems

The most actionable use of these public cases is to inform how your own organisation writes postmortems. Three structural prompts in the postmortem template surface alert-fatigue contributing factors without forcing the framing.

Prompt one: how many pages did the responding on-call engineer receive in the 24 hours before this incident, and how many in the four hours after the page fired for this incident? Capturing this data factually surfaces cognitive-load context without framing it as blame. If the engineer had received 25 pages in the prior 24 hours, the postmortem reader can draw their own conclusions about cognitive overload.

Prompt two: at what point in the incident timeline did the operator first see the relevant signal in the noise? If the signal that pointed at the actual incident was present in alert form 30 minutes before the operator engaged with it, the postmortem captures that delay factually. The delay may have been because of alert noise, because of misclassification, or because of the operator's attention being elsewhere; the structural fix differs depending on the cause, and the prompt forces the question.

Prompt three: would different alert tuning have surfaced this incident earlier or more clearly? This is the speculative prompt; it invites the team to consider alerting changes as remediation action items rather than only focusing on the proximate technical fix. Read /blameless-postmortem-template for the full template that integrates these prompts with the broader Google SRE Chapter 15 structure.

Frequently Asked Questions

Is alert fatigue usually the named root cause in postmortems?+

Rarely the primary named cause; commonly the contributing factor. Public postmortems usually identify the proximate technical cause (a deployment, a configuration change, a hardware failure) and list contributing factors that include alert handling, on-call response, or monitoring gaps. The blameless postmortem culture (Google SRE Book Chapter 15) encourages naming these contributing factors honestly; the major-incident postmortems cited here all do.

Knight Capital 2012: what happened with alerting?+

The 45-minute trading-loss incident in August 2012 involved acknowledged alerts that were not investigated as the operational impact unfolded. SEC findings on the incident, while not naming alert fatigue specifically, noted that operational alerts were observed and acknowledged but the underlying condition was not understood until after major financial loss had occurred. The pattern is consistent with alert fatigue contributing to delayed recognition of the unusual condition.

GitLab January 2017: what happened?+

The publicly-documented GitLab database loss incident of January 2017 involved an engineer running database commands on the wrong host during an incident response. GitLab's own postmortem, posted publicly, noted that multiple monitoring tools were paging different things during the incident and that pager noise contributed to the cognitive overload that led to the wrong-host command. The candour of the GitLab write-up is the standard for blameless postmortem disclosure.

Cloudflare July 2019: what was the alerting commentary?+

Cloudflare's postmortem on the July 2 2019 global outage (caused by a regular-expression deployment) included reflection on alert volume and operational signal during the incident. The Cloudflare engineering blog post and subsequent talks discussed how the team thought about alert design after the incident, including the gap between alerts firing and operational action being taken. The incident itself was not caused by alert fatigue, but the response phase prompted broader thinking about alerting hygiene.

Are there clinical-side parallels?+

Yes. The Joint Commission has linked clinical alarm fatigue to sentinel events (preventable patient deaths) in multiple Sentinel Event Alert publications. The ECRI Top 10 Health Technology Hazards has named alarm fatigue or alarm hazards as a top-three concern in multiple years. The clinical literature is more mature in tracing alarm fatigue to outcomes than the DevOps literature, partly because patient outcomes are more visible than operational outcomes. Read /joint-commission-npsg-06-01-01 for the regulatory response.

How should you write your own postmortems to surface alert-fatigue contributing factors?+

Include three structural prompts in the template. First: how many pages did the on-call engineer receive in the 24 hours before this incident, and how many in the four hours after the page fired? Second: at what point in the incident did the operator first see the relevant signal in the noise? Third: would different alert tuning have surfaced this incident earlier? These three prompts make alert-fatigue contributing factors visible without forcing the framing. Read /blameless-postmortem-template for the full template.

What is the difference between blame and root cause analysis?+

Blame attributes the incident to an individual's decision or error. Root cause analysis attributes the incident to system conditions that made the failure path possible. Google SRE Book Chapter 15 is the canonical reference for blameless postmortems: the explicit assumption is that engineers act in good faith with the information they have, and any incident is evidence that the system enabled the failure. Naming alert fatigue as a contributing factor is a system-condition observation, not a blame statement; it identifies how the system enabled the operator to miss or delay action.