SLO-Based vs Threshold Alerting: Which Ends Alert Fatigue (and How to Migrate)
The structural fix for alert fatigue. Updated April 2026 | Sources: Google SRE Workbook Chapter 5, Honeycomb observability essays, Nobl9 SLO research
The Core Difference
- Fires on cause, not user impact
- Most self-resolve before action
- Highly sensitive to transient spikes
- Default for most monitoring tools
- Creates the bulk of false positives
- Fires on symptom: user experience degraded
- By definition actionable: budget is burning
- Immune to transient spikes (multi-window)
- Requires SLO definition up front
- Eliminates most false positives structurally
SLO Budget Calculator
What Is an SLO? (3-Minute Primer)
A quantitative measure of service behaviour. Examples: request success rate (errors / total requests), latency at 99th percentile, availability (uptime / total time). SLIs must directly reflect user experience.
The target value for an SLI over a rolling time window. Example: 'Request success rate >= 99.9% over the previous 30 days.' This is your contract with your users (implicit or explicit). Setting it requires stakeholder agreement on what 'good enough' means.
The complement of an SLO. A 99.9% SLO means 0.1% of requests can fail. Over 30 days (43,200 minutes), the budget is 43.2 minutes of downtime or equivalent error rate. The budget resets monthly (or weekly, depending on your window).
How fast the error budget is being consumed relative to the sustainable rate. Burn rate of 1 = budget will be exactly used at month end. Burn rate of 14 = budget will be exhausted in 1/14 of the month (approximately 2 days). Burn rate of 14.4 = budget exhausted in 1 hour.
Migration Path: 5 Steps
Start with success rate and latency at the edge (not internal components). User experience is the only signal that counts. Use existing metrics if available; do not add new instrumentation in step 1.
SLOs are not purely technical decisions. Product, engineering, and on-call must agree. A 99.99% SLO for a non-critical service means a burned engineer every time a batch job runs. Start conservative (99.5%) and tighten over time.
Use the Google SRE Workbook formula: short window burn rate threshold = (1 - SLO%) * total time / short window. For 99.9% over 30 days with a 1-hour window: threshold = 14.4. For a 6-hour window: 6. Document these per service.
In your monitoring tool: alert when BOTH short-window (1hr) burn rate exceeds threshold AND long-window (6hr) burn rate exceeds a lower threshold. This eliminates transient-spike false positives completely.
Keep your existing threshold alerts active but silent (suppressed) during the parallel period. Validate that every real incident triggered at least one SLO-based alert. After 30 days, delete the threshold alerts.