pingfatigue.com is an independent, vendor-neutral reference on alert fatigue. Not affiliated with PagerDuty, Atlassian, Splunk, or any other vendor. Tool comparisons may contain affiliate links, clearly labelled.
Home/Alert Tuning
REMEDIATION PLAYBOOK

How to Reduce Alert Fatigue: A Practical Tuning Playbook (2026)

14 steps with worked examples. Check each step to track your estimated noise reduction. Updated April 2026.

The Three Principles

From the Google SRE Book (Beyer et al., 2016). Every alert rule in your system must satisfy all three:

01
Every alert must require human action

If an alert can be handled by automation or requires no action, it should not page a human.

02
Every alert must have a runbook

If the response is not documented, the alert is not operational-ready. No runbook = no alert.

03
Every alert must come from a symptom, not a cause

Alert on user-visible impact (error rate, latency, availability) not on resource metrics (CPU, memory).

The 14-Step Tuning Checklist

EST. NOISE REDUCTION
0%
01Export every alert rule and its 90-day fire history+

Pull the list from PagerDuty/Opsgenie/Datadog. Most teams discover 2-3x more rules than they expected. This is your baseline.

02Kill anything that fired > 20 times with zero action in 90 days-15%+

If an alert has fired 20+ times and nobody has ever taken action, it is noise by definition. Delete it or disable it today. Do not archive - delete.

03Add burn-rate alerts for your top 3 SLOs-20%+

Replace the most noisy cause-based alerts (CPU, memory, disk) with SLO burn-rate alerts. Multi-window multi-burn-rate per Google SRE Workbook Chapter 5.

04Deduplicate identical alerts from redundant tools-10%+

Count how many monitoring tools fire for the same service. Consolidate to one authoritative source per service type, or enable correlation grouping.

05Enable correlation and grouping in your pager tool-15%+

PagerDuty Event Orchestration, Opsgenie Alert Policies, or any equivalent. Group by service + environment + time window. Reduces ticket count 60-90%.

06Tier severity: P1 (page now), P2 (ticket), P3 (business hours)-10%+

Define explicit criteria for each tier. Most 'P1' alerts in most teams are actually P2 or P3 by any reasonable definition. Re-tier before you do anything else.

07Route by service ownership, not by rota-5%+

Alerts should go to the team who owns the service. Generic 'all engineers' rotas create shared-nobody responsibility. Service ownership mapping reduces resolution time.

08Write a runbook for every remaining P1 and P2 alert-5%+

No runbook = no alert. If you cannot document what the on-call engineer should do when it fires, the alert is not actionable and should be demoted or deleted.

09Set quiet-period windows for known maintenance-5%+

Silence alerts during planned maintenance windows. Every tooling deployment, batch job, or scheduled restart should have a maintenance window that suppresses affected alerts.

10Enable auto-resolve when the metric recovers-5%+

Alerts that auto-resolve stop waking engineers for transient blips. Enable auto-resolve thresholds: if the metric returns to normal for 5 minutes, the alert closes itself.

11Make every alert include last N log lines and traces-3%+

Rich alert context reduces mean time to diagnose. Engineers who open an alert with log context and a trace link investigate 30-50% faster than engineers starting from scratch.

12Review noisy alerts weekly in a 15-minute standing meeting-3%+

A weekly review: which alerts fired most? Which had no action? What is the false-positive rate this week vs last? Track the trend. Publish the number. Improvement becomes visible.

13Track pages per engineer per week and make the number visible-2%+

What gets measured gets managed. A dashboard showing pages/week per engineer, updated daily, makes the problem concrete and creates organisational pressure to fix it.

14Run quarterly alert audits and kill 20% of rules-2%+

Institutionalise the audit. Every quarter: sort by frequency, identify the noisiest 20% of rules, delete or demote them. The pile does not shrink on its own.

Alert Hygiene Scorecard

A structured audit tool. Score your team honestly. Screenshot and share into your next engineering planning meeting.

01We know how many pages each engineer receives per week
YESNO
02Our false-positive ratio is below 50%
YESNO
03Every P1/P2 alert has a runbook linked in the alert body
YESNO
04We have explicit P1 / P2 / P3 severity definitions
YESNO
05We have SLO-based burn-rate alerts for our top 3 services
YESNO
06Alert deduplication or correlation is enabled in our pager tool
YESNO
07We run a monthly or quarterly alert audit
YESNO
08Every alert auto-resolves when the underlying metric recovers
YESNO
09Our on-call engineers sleep through the night at least 5 nights per week
YESNO
10Nobody has requested removal from the on-call rotation in the last quarter
YESNO

Common Anti-Patterns

ANTI-PATTERN
Alerting on everything just in case
FIX

Adopt the SRE principle: every alert must require human action. If it might not require action, it should not be a page.

ANTI-PATTERN
Lowering thresholds after incidents
FIX

Post-incident threshold adjustments without SLO context create permanent noise. Re-examine the metric choice, not just the threshold value.

ANTI-PATTERN
Different thresholds per environment
FIX

Staging environments should not page on-call. Production thresholds should be meaningful. Copy-paste rules across environments multiply noise.

ANTI-PATTERN
Multiple tools monitoring the same thing
FIX

Consolidate to one authoritative monitoring source per service type. Cross-tool deduplication is hard; elimination is easy.

ANTI-PATTERN
Alert rules as tribal knowledge
FIX

If the person who wrote the alert has left the company and nobody else knows what it means, delete it. Orphaned rules are always noise.

ANTI-PATTERN
Treating MTTA as 'time to acknowledge', not 'time to act'
FIX

Acknowledging an alert to stop the escalation is not the same as starting the response. Track acknowledged-but-no-action rate separately.

SLO vs ThresholdCorrelation & DedupRunbooksTools ComparisonOn-Call CostCalculator