What is SLO-based alerting?

SLO-based alerting fires when a service is burning through its error budget at an unsustainable rate, not when a metric crosses an arbitrary threshold. A 99.9% SLO has 43.8 minutes of error budget per month. If errors are accumulating at 14x the sustainable rate, the budget will be exhausted in 3.1 hours and a page is warranted. This approach eliminates false positives from transient CPU spikes, disk flaps, and self-healing events.

What is burn-rate alerting?

Burn-rate alerting measures how fast a service is consuming its error budget relative to the sustainable rate. A burn rate of 1 means the service is consuming exactly its budget (the SLO will be barely met). A burn rate of 14.4 means the budget will be exhausted in 1 hour. Google SRE Workbook Chapter 5 defines multi-window multi-burn-rate alerts as the production-standard implementation.

How do you migrate from threshold to SLO-based alerting?

Migration in five steps: (1) Define SLIs (service level indicators) - the metrics that measure user experience directly, e.g. request success rate, latency at 99th percentile. (2) Set SLOs - the target percentage for each SLI, agreed with stakeholders. (3) Compute error budgets - the allowed failure budget per rolling window. (4) Configure multi-window burn-rate alerts. (5) Run 30 days in parallel with threshold alerts before decommissioning thresholds.

What is a multi-window multi-burn-rate alert?

The Google SRE Workbook recommends alerting on two time windows simultaneously: a short window (e.g. 1 hour) detects fast-burning incidents, and a long window (e.g. 6 hours) detects slow burns that would escape short-window detection. Both must be burning above threshold to fire. This eliminates the false positives from transient spikes that only affect the short window.

Home/SLO vs Threshold Alerting

ALERTING ARCHITECTURE

SLO vs Threshold Alerting: 5-Step SRE Migration (2026)

The structural fix for alert fatigue. Updated June 2026 | Sources: Google SRE Workbook Chapter 5, Honeycomb observability essays, Nobl9 SLO research

The Core Difference

THRESHOLD ALERTING

CPU > 80% for 5 minutes -> PAGE

Fires on cause, not user impact
Most self-resolve before action
Highly sensitive to transient spikes
Default for most monitoring tools
Creates the bulk of false positives

SLO-BASED ALERTING

Error budget burning 14x fast -> PAGE

Fires on symptom: user experience degraded
By definition actionable: budget is burning
Immune to transient spikes (multi-window)
Requires SLO definition up front
Eliminates most false positives structurally

SLO Budget Calculator

slo-budget-calculator.exe

SLO target availability99.9%

99% (7.3 hrs/mo)99.99% (4.4 min/mo)

Current burn rate (x sustainable)14x

Google SRE: alert when burn rate > 14x (1hr window) or > 6x (6hr window)

Monthly error budget

43.2 min

0.72 hours per month

Hours to budget exhaustion

3.1 hrs

At current 14x burn rate

Days to exhaustion

0.00 days

Budget exhausts today

Alert warranted?

YES

Burn rate >= 14x: immediate page

What Is an SLO? (3-Minute Primer)

SLI (Service Level Indicator)

A quantitative measure of service behaviour. Examples: request success rate (errors / total requests), latency at 99th percentile, availability (uptime / total time). SLIs must directly reflect user experience.

SLO (Service Level Objective)

The target value for an SLI over a rolling time window. Example: 'Request success rate >= 99.9% over the previous 30 days.' This is your contract with your users (implicit or explicit). Setting it requires stakeholder agreement on what 'good enough' means.

Error budget

The complement of an SLO. A 99.9% SLO means 0.1% of requests can fail. Over 30 days (43,200 minutes), the budget is 43.2 minutes of downtime or equivalent error rate. The budget resets monthly (or weekly, depending on your window).

Burn rate

How fast the error budget is being consumed relative to the sustainable rate. Burn rate of 1 = budget will be exactly used at month end. Burn rate of 14 = budget will be exhausted in 1/14 of the month (approximately 2 days). Burn rate of 14.4 = budget exhausted in 1 hour.

Migration Path: 5 Steps

1

Define SLIs for your top 3 services

Start with success rate and latency at the edge (not internal components). User experience is the only signal that counts. Use existing metrics if available; do not add new instrumentation in step 1.

2

Set SLOs with stakeholders

SLOs are not purely technical decisions. Product, engineering, and on-call must agree. A 99.99% SLO for a non-critical service means a burned engineer every time a batch job runs. Start conservative (99.5%) and tighten over time.

3

Compute burn-rate thresholds

Use the Google SRE Workbook formula: short window burn rate threshold = (1 - SLO%) * total time / short window. For 99.9% over 30 days with a 1-hour window: threshold = 14.4. For a 6-hour window: 6. Document these per service.

4

Configure multi-window burn-rate alerts

In your monitoring tool: alert when BOTH short-window (1hr) burn rate exceeds threshold AND long-window (6hr) burn rate exceeds a lower threshold. This eliminates transient-spike false positives completely.

5

Run parallel for 30 days, then retire threshold alerts

Keep your existing threshold alerts active but silent (suppressed) during the parallel period. Validate that every real incident triggered at least one SLO-based alert. After 30 days, delete the threshold alerts.

Alert Tuning Playbook MTTR Impact Tools Comparison Research Citations Calculator