Home/Alert Tuning

REMEDIATION PLAYBOOK

Alert Tuning: Cut False Positives to Below 20% (2026)

14 steps with worked examples. Check each step to track your estimated noise reduction. Updated June 2026.

The Three Principles

From the Google SRE Book (Beyer et al., 2016). Every alert rule in your system must satisfy all three:

01

Every alert must require human action

If an alert can be handled by automation or requires no action, it should not page a human.

02

Every alert must have a runbook

If the response is not documented, the alert is not operational-ready. No runbook = no alert.

03

Every alert must come from a symptom, not a cause

Alert on user-visible impact (error rate, latency, availability) not on resource metrics (CPU, memory).

The 14-Step Tuning Checklist

EST. NOISE REDUCTION

0%

01Export every alert rule and its 90-day fire history+

Pull the list from PagerDuty/Opsgenie/Datadog. Most teams discover 2-3x more rules than they expected. This is your baseline.

02Kill anything that fired > 20 times with zero action in 90 days-15%+

If an alert has fired 20+ times and nobody has ever taken action, it is noise by definition. Delete it or disable it today. Do not archive - delete.

03Add burn-rate alerts for your top 3 SLOs-20%+

Replace the most noisy cause-based alerts (CPU, memory, disk) with SLO burn-rate alerts. Multi-window multi-burn-rate per Google SRE Workbook Chapter 5.

04Deduplicate identical alerts from redundant tools-10%+

Count how many monitoring tools fire for the same service. Consolidate to one authoritative source per service type, or enable correlation grouping.

05Enable correlation and grouping in your pager tool-15%+

PagerDuty Event Orchestration, Opsgenie Alert Policies, or any equivalent. Group by service + environment + time window. Reduces ticket count 60-90%.

06Tier severity: P1 (page now), P2 (ticket), P3 (business hours)-10%+

Define explicit criteria for each tier. Most 'P1' alerts in most teams are actually P2 or P3 by any reasonable definition. Re-tier before you do anything else.

07Route by service ownership, not by rota-5%+

Alerts should go to the team who owns the service. Generic 'all engineers' rotas create shared-nobody responsibility. Service ownership mapping reduces resolution time.

08Write a runbook for every remaining P1 and P2 alert-5%+

No runbook = no alert. If you cannot document what the on-call engineer should do when it fires, the alert is not actionable and should be demoted or deleted.

09Set quiet-period windows for known maintenance-5%+

Silence alerts during planned maintenance windows. Every tooling deployment, batch job, or scheduled restart should have a maintenance window that suppresses affected alerts.

10Enable auto-resolve when the metric recovers-5%+

Alerts that auto-resolve stop waking engineers for transient blips. Enable auto-resolve thresholds: if the metric returns to normal for 5 minutes, the alert closes itself.

11Make every alert include last N log lines and traces-3%+

Rich alert context reduces mean time to diagnose. Engineers who open an alert with log context and a trace link investigate 30-50% faster than engineers starting from scratch.

12Review noisy alerts weekly in a 15-minute standing meeting-3%+

A weekly review: which alerts fired most? Which had no action? What is the false-positive rate this week vs last? Track the trend. Publish the number. Improvement becomes visible.

13Track pages per engineer per week and make the number visible-2%+

What gets measured gets managed. A dashboard showing pages/week per engineer, updated daily, makes the problem concrete and creates organisational pressure to fix it.

14Run quarterly alert audits and kill 20% of rules-2%+

Institutionalise the audit. Every quarter: sort by frequency, identify the noisiest 20% of rules, delete or demote them. The pile does not shrink on its own.

Alert Hygiene Scorecard

A structured audit tool. Score your team honestly. Screenshot and share into your next engineering planning meeting.

01We know how many pages each engineer receives per week

YESNO

02Our false-positive ratio is below 50%

YESNO

03Every P1/P2 alert has a runbook linked in the alert body

YESNO

04We have explicit P1 / P2 / P3 severity definitions

YESNO

05We have SLO-based burn-rate alerts for our top 3 services

YESNO

06Alert deduplication or correlation is enabled in our pager tool

YESNO

07We run a monthly or quarterly alert audit

YESNO

08Every alert auto-resolves when the underlying metric recovers

YESNO

09Our on-call engineers sleep through the night at least 5 nights per week

YESNO

10Nobody has requested removal from the on-call rotation in the last quarter

YESNO

Common Anti-Patterns

ANTI-PATTERN

Alerting on everything just in case

FIX

Adopt the SRE principle: every alert must require human action. If it might not require action, it should not be a page.

ANTI-PATTERN

Lowering thresholds after incidents

FIX

Post-incident threshold adjustments without SLO context create permanent noise. Re-examine the metric choice, not just the threshold value.

ANTI-PATTERN

Different thresholds per environment

FIX

Staging environments should not page on-call. Production thresholds should be meaningful. Copy-paste rules across environments multiply noise.

ANTI-PATTERN

Multiple tools monitoring the same thing

FIX

Consolidate to one authoritative monitoring source per service type. Cross-tool deduplication is hard; elimination is easy.

ANTI-PATTERN

Alert rules as tribal knowledge

FIX

If the person who wrote the alert has left the company and nobody else knows what it means, delete it. Orphaned rules are always noise.

ANTI-PATTERN

Treating MTTA as 'time to acknowledge', not 'time to act'

FIX

Acknowledging an alert to stop the escalation is not the same as starting the response. Track acknowledged-but-no-action rate separately.

SLO vs Threshold Correlation & Dedup Runbooks Tools Comparison On-Call Cost Calculator