Home/Metrics/Alert-to-Incident Ratio

DIAGNOSTIC METRIC

Alert-to-Incident Ratio: The Single Number That Diagnoses Alert Fatigue

Updated June 2026. Sources: Google SRE Book Chapter 6 (Practical Alerting), Catchpoint 2024 SRE Report, PagerDuty 2023 State of Digital Operations.

The Most Useful Single Number

If you can only track one metric to diagnose whether your alerting practice is healthy or noisy, track the alert-to-incident ratio. It is the ratio of paging alerts received to declared incidents in the same time window. A 5:1 ratio means 5 paging alerts produced 1 real incident, so 4 of those 5 alerts were noise. A 50:1 ratio means 50 paging alerts produced 1 real incident, so 49 of 50 were noise. The metric captures both the noise rate (how many alerts fire) and the precision of those alerts (how many correspond to real incidents) in a single number that anyone can interpret.

The ratio is diagnostic in a way that raw page count is not. A team paging 500 times per week can have a healthy ratio (3:1, meaning 167 real incidents) if the team is genuinely operating at incident-heavy scale, or an unhealthy ratio (50:1, meaning 10 real incidents) if they are drowning in noise. The page count alone does not distinguish these two cases; the ratio does. A team paging 50 times per week with a 50:1 ratio has the same noise problem as a team paging 500 times per week with a 50:1 ratio, just at different scale.

The metric also creates a clear improvement narrative. "We are going to halve our alert-to-incident ratio this quarter" is a target that engineers understand intuitively, that translates to specific tuning work, and that produces measurable results within a single quarter. Compare to softer goals like "we are going to reduce alert fatigue", which lack the metric anchor that makes progress trackable.

Computing It

Both PagerDuty and Opsgenie expose the necessary data via their analytics APIs. The math is intentionally simple: count paging alerts received in a window, count declared incidents in the same window, divide. Most teams compute weekly and monthly. PagerDuty's analytics dashboard does this calculation natively if you configure the report. Opsgenie's incident metric endpoints provide the same data; a simple dashboard widget computes the ratio. incident.io and Rootly both expose ratios in their default dashboards.

The non-trivial work is defining "incident" consistently across teams. If incident creation is automatic (every Sev-2-or-higher alert auto-creates an incident), the ratio looks artificially favourable because the count of incidents inflates. If incident creation requires manual declaration (an engineer must mark the page as a real incident), the ratio reflects reality more accurately but requires discipline to capture. Most mature operations require manual incident declaration with a low-friction Slack command (incident.io and Rootly both excel at this); auto-creation tends to corrupt the metric.

Compute the ratio at two grains: aggregate across the team for the headline diagnostic, and per-service for the action surface. The aggregate ratio answers "how noisy is our alerting practice overall"; the per-service ratio answers "which services should we tune first". A team with a 4:1 aggregate but a 30:1 outlier on the checkout service has a clear next move that the aggregate alone would not surface.

Target Ratios and What They Feel Like

Ratio	State	What it feels like	Realistic time to fix
1:1 to 2:1	Healthy	Each page is real signal; engineer engages promptly without resentment; on-call week ends without fatigue residue	Maintain via quarterly audit
3:1 to 5:1	Acceptable	Most pages are real; some triage required; on-call is workload but not burnout	1 to 2 quarters of focused tuning
5:1 to 10:1	Stressed	Engineer learns to triage quickly; real incidents emerge from noise; tolerable for a quarter, attrition risk over a year	2 to 3 quarters of focused work
10:1 to 20:1	Painful	Dread before primary week; constant context-switching; real incidents recognisable but cognitive load is high	3 to 6 quarters of structural intervention
20:1 to 50:1	Critical	Pattern-matching dominates investigation; missed signals; sustained sleep disruption; high attrition pressure	6 to 12 months structural plus tooling intervention
50:1+	Crisis	Engineers cannot reliably distinguish signal from noise; missed Sev-1 incidents; team morale collapsing	Emergency response, suspend non-critical alerts, leadership escalation

Weekly Review Cadence

The alert-to-incident ratio drifts unless it is reviewed regularly. The cheapest and most reliable mechanism is a 30-minute weekly review with three standing items. Item one: aggregate ratio for the week, compared to the four-week rolling average and the target. A spike (this week 25 percent above the rolling average) triggers investigation. Item two: per-service ratios for the top-10 noisiest services. The noisiest service typically warrants an action item before the next review. Item three: any new alert rules added since the last review, evaluated for whether they are likely to maintain or degrade the ratio.

The meeting needs a clear owner. At small scale (under 30 engineers) this can be the SRE lead or the most experienced senior engineer rotating monthly. At larger scale, the alert review board (read /alert-fatigue-scale-up-50-engineers) takes ownership and the weekly review can be lighter. The key is consistency: a weekly review that happens for three months and then drifts produces worse outcomes than no review at all, because the team learns that the metric does not actually matter to leadership.

For the dashboard backing this review, the recommended layout is: ratio chart over the last 12 weeks with target line, top-10 services by ratio with both ratio and absolute page count, list of new alert rules added since last review, and a list of action items from the previous review with status. This is standard dashboard infrastructure that takes a day to build and pays back across many years of operational discipline. Most pager tools (PagerDuty Analytics, Opsgenie Reports, incident.io Insights) ship with most of these views available out of the box; the work is in configuring them and committing to the review cadence.

Dashboard Sketch

A minimal but effective alert-to-incident dashboard contains four panels and is the only dashboard most teams need for alert hygiene. Panel one (top-left): aggregate weekly ratio over the last 12 weeks, with a horizontal line showing the team target (typically 3:1 or 5:1). Panel two (top-right): per-service ratio bar chart for the top-10 services, sorted by ratio descending. Panel three (bottom-left): list of new alert rules added in the last 30 days, with the per-rule ratio for each.

Panel four (bottom-right): list of high-ratio services with the most recent action items from the alert review and their status (open, in progress, closed). This panel is the closed loop: the dashboard does not just measure, it tracks what is being done about the measurement. Most teams underinvest in this panel and pay for it later with measurement-without-action drift. The bar to ship is low: a Google Sheet or a Linear list view linked from the dashboard is enough to start. The work is the discipline of using it.

Frequently Asked Questions

What is the alert-to-incident ratio?+

Alert-to-incident ratio is the count of paging alerts received divided by the count of declared incidents in the same window. A ratio of 5:1 means 5 paging alerts produce 1 real incident; 4 of the 5 were noise. A ratio of 50:1 means 50 paging alerts produce 1 real incident; 49 of 50 were noise. The ratio is a strong single-number diagnostic for alert fatigue because it captures both the noise rate and the false-positive rate in one figure.

How do you compute it from PagerDuty data?+

PagerDuty's analytics API exposes both alert count and incident count per service per time window. Divide alert count by incident count for the same service over the same window (typically weekly or monthly). A simple report or dashboard view does this automatically. Opsgenie exposes equivalent data via its incident metric endpoints. The math is intentionally simple; the work is in defining incident consistently across teams.

What target ratio should we aim for?+

Per Google SRE Book Chapter 6, the implicit target is that on-call engineers should rarely be paged for non-incidents; the healthy ratio is close to 1:1 or 2:1. In many noisy environments the typical ratio is around 3:1 to 5:1 before tuning. Per practitioner experience at noisy organisations, ratios of 20:1 to 100:1 are common before tuning. A useful target for a team starting from a noisy state is to halve the ratio each quarter for two to four quarters.

What does a 50:1 ratio actually feel like for an engineer?+

Brutal. The engineer is paged frequently for noise that requires acknowledgment but not investigation, and learns to delay engagement to avoid burning time on noise. The real incidents are embedded in the noise and easy to miss in the first 5 to 10 minutes. Acknowledgment latency degrades, refocus cost compounds, and the engineer ends the on-call week genuinely tired even if they did not actively work many hours. This is the alert-fatigue spiral.

What does a 10:1 ratio feel like?+

Painful but manageable. The engineer is paged often enough to dread it, but the noise is recognisable enough that triage is fast. Real incidents emerge clearly from the noise. The cognitive cost is the constant context-switching rather than the investigation work. Most teams in the 10:1 range can be made healthier with focused alert tuning over a quarter or two without major structural changes.

What does a 2:1 ratio feel like?+

Tolerable. Each page is likely to be real, the engineer can engage promptly without resentment, and the on-call week ends without significant fatigue residue. This is the healthy steady state and is achievable for most teams that do the engineering hygiene work consistently. Maintaining 2:1 requires ongoing audit cadence; without maintenance the ratio drifts upward as new alert rules accumulate.

How does the ratio differ from false-positive rate?+

False-positive rate measures the fraction of pages that were not real signal. Alert-to-incident ratio measures the count of pages per declared incident. The two are related but not identical: for example, a 70 percent false-positive rate corresponds to a 3.3:1 ratio if each true-positive alert is exactly one incident, but real-world incidents often produce multiple correlated alerts, so the ratio is typically higher than the false-positive math would predict. The ratio is more directly diagnostic; the false-positive rate is easier to communicate to non-engineers.

Should the ratio be computed per service or aggregated?+

Both. Aggregated ratio gives the headline diagnostic; per-service ratio gives the action surface. A team with healthy aggregate ratio (3:1) but one outlier service at 30:1 has a clear next move: tune the alerts on that service. A team with uniformly 8:1 across all services has a different problem that calls for a broader tuning practice. Always compute per-service ratios for the top-10 noisiest services in a weekly review.