pingfatigue.com is an independent, vendor-neutral reference on alert fatigue. Not affiliated with PagerDuty, Atlassian, Splunk, or any other vendor. Tool comparisons may contain affiliate links, clearly labelled.
Home/SLO vs Threshold Alerting
ALERTING ARCHITECTURE

SLO-Based vs Threshold Alerting: Which Ends Alert Fatigue (and How to Migrate)

The structural fix for alert fatigue. Updated April 2026 | Sources: Google SRE Workbook Chapter 5, Honeycomb observability essays, Nobl9 SLO research

The Core Difference

THRESHOLD ALERTING
CPU > 80% for 5 minutes -> PAGE
  • Fires on cause, not user impact
  • Most self-resolve before action
  • Highly sensitive to transient spikes
  • Default for most monitoring tools
  • Creates the bulk of false positives
SLO-BASED ALERTING
Error budget burning 14x fast -> PAGE
  • Fires on symptom: user experience degraded
  • By definition actionable: budget is burning
  • Immune to transient spikes (multi-window)
  • Requires SLO definition up front
  • Eliminates most false positives structurally

SLO Budget Calculator

slo-budget-calculator.exe
SLO target availability99.9%
99% (7.3 hrs/mo)99.99% (4.4 min/mo)
Current burn rate (x sustainable)14x
Google SRE: alert when burn rate > 14x (1hr window) or > 6x (6hr window)
Monthly error budget
43.2 min
0.72 hours per month
Hours to budget exhaustion
3.1 hrs
At current 14x burn rate
Days to exhaustion
0.00 days
Budget exhausts today
Alert warranted?
YES
Burn rate >= 14x: immediate page

What Is an SLO? (3-Minute Primer)

SLI (Service Level Indicator)

A quantitative measure of service behaviour. Examples: request success rate (errors / total requests), latency at 99th percentile, availability (uptime / total time). SLIs must directly reflect user experience.

SLO (Service Level Objective)

The target value for an SLI over a rolling time window. Example: 'Request success rate >= 99.9% over the previous 30 days.' This is your contract with your users (implicit or explicit). Setting it requires stakeholder agreement on what 'good enough' means.

Error budget

The complement of an SLO. A 99.9% SLO means 0.1% of requests can fail. Over 30 days (43,200 minutes), the budget is 43.2 minutes of downtime or equivalent error rate. The budget resets monthly (or weekly, depending on your window).

Burn rate

How fast the error budget is being consumed relative to the sustainable rate. Burn rate of 1 = budget will be exactly used at month end. Burn rate of 14 = budget will be exhausted in 1/14 of the month (approximately 2 days). Burn rate of 14.4 = budget exhausted in 1 hour.

Migration Path: 5 Steps

1
Define SLIs for your top 3 services

Start with success rate and latency at the edge (not internal components). User experience is the only signal that counts. Use existing metrics if available; do not add new instrumentation in step 1.

2
Set SLOs with stakeholders

SLOs are not purely technical decisions. Product, engineering, and on-call must agree. A 99.99% SLO for a non-critical service means a burned engineer every time a batch job runs. Start conservative (99.5%) and tighten over time.

3
Compute burn-rate thresholds

Use the Google SRE Workbook formula: short window burn rate threshold = (1 - SLO%) * total time / short window. For 99.9% over 30 days with a 1-hour window: threshold = 14.4. For a 6-hour window: 6. Document these per service.

4
Configure multi-window burn-rate alerts

In your monitoring tool: alert when BOTH short-window (1hr) burn rate exceeds threshold AND long-window (6hr) burn rate exceeds a lower threshold. This eliminates transient-spike false positives completely.

5
Run parallel for 30 days, then retire threshold alerts

Keep your existing threshold alerts active but silent (suppressed) during the parallel period. Validate that every real incident triggered at least one SLO-based alert. After 30 days, delete the threshold alerts.

Alert Tuning PlaybookMTTR ImpactTools ComparisonResearch CitationsCalculator