What should a runbook include?

A runbook should include: (1) What the alert means in plain language, (2) Severity and on-call action required, (3) First-responder checklist with specific diagnostic commands, (4) Step-by-step resolution instructions, (5) Escalation path if resolution fails, (6) Rollback procedure if a deploy caused the issue, (7) Post-mortem requirement and link to template. The test: can a new engineer resolve the alert at 3am using only this runbook?

What is follow-the-sun on-call?

Follow-the-sun on-call splits on-call responsibility across regional teams in different time zones so that no engineer is paged outside their working hours. A typical implementation uses three regional pods (Americas, EMEA, APAC) each taking a 8-10 hour on-call window. This eliminates night pages in theory, though it requires sufficient staffing in each region and strong handoff documentation.

How many people should be in an on-call rotation?

The Google SRE Book recommends a minimum of 8 engineers per rotation to ensure no engineer is on-call more than once every 8 weeks when including primary and secondary duties. Smaller rotations (4-6 engineers) dramatically increase on-call frequency and burnout risk; rotation size, page volume, and false-positive ratio together drive on-call burnout.

Home/Runbooks & On-Call Design

OPERATIONAL TEMPLATES

Runbooks for On-Call: Templates That Reduce Pages (2026)

Copy-paste ready templates. Updated June 2026 | Sources: Google SRE Book runbook chapter, PagerDuty schedule documentation, incident.io on-call design guides

Why Runbooks Reduce Alert Fatigue

An alert without a runbook forces the on-call engineer to reconstruct the investigation path from scratch every time the alert fires. This means: longer MTTR, higher cognitive load, higher chance of escalation, and most importantly -- the engineer learns to dread the alert rather than handle it confidently. The inverse is also true: a well-maintained runbook turns a 2am page from an anxiety event into a 10-minute procedure.

-30%

MTTR reduction from runbooks

PagerDuty 2023

3x

faster P1 resolution with linked runbook vs none

FireHydrant 2023

Runbook Template (Copy-to-Clipboard)

runbook-template.mdMARKDOWN

# Runbook: [ALERT NAME]

## What this alert means
Briefly: what is firing and why does it matter to users?

## Severity
P1 / P2 / P3 (delete as appropriate)
On-call action required: YES / NO (business hours only)

## First responder checklist
- [ ] Check [dashboard URL] to confirm alert is real
- [ ] Check [status page / deployment log] for recent changes
- [ ] Run: `[diagnostic command]`
- [ ] Expected output: [what normal looks like]
- [ ] If output shows [X]: proceed to Resolution

## Resolution steps
1. [Specific action]
2. [Specific action]
3. Verify recovery: [check command or dashboard link]
4. Acknowledge the alert and close the incident

## Escalation
If not resolved within [N] minutes, escalate to: [name / team / on-call tier]
Escalation channel: [Slack channel or phone]

## Rollback
If the issue was caused by a recent deploy:
`[rollback command]`

## Post-mortem
Required if: P1 severity, or if resolution took > 30 minutes.
Template: [link to post-mortem template]

## Alert history
Last reviewed: [date]
Owner: [team name]
Related alerts: [list]

On-Call Rotation Patterns

Follow-the-Sun

Best for: Teams of 12+ with global distributed engineering

3 regional pods (Americas, EMEA, APAC) each take an 8-10 hour window. No engineer is paged outside working hours.

PROS

No night pages for any engineer
Fresh team each window
Best for global customer base

CONS

Requires minimum ~12 engineers (4 per region)
Handoff documentation is critical
Complex to maintain across time zones

Primary / Secondary

Best for: Teams of 6-12 with moderate page volume

Two on-call engineers at all times. Primary handles all pages. Secondary covers if primary is unavailable or escalates.

PROS

Simple to configure
Provides backup for all pages
Primary can sleep if secondary is awake

CONS

Both engineers are technically on-call
Secondary may develop shadow-fatigue
Does not reduce page volume

Week-on Week-off

Best for: Low-noise teams (under 10 pages/week)

Engineer is fully on-call for one week, then fully off for the next rotation. Simple rotation scheduling.

PROS

Maximum recovery time between rotations
Simple to plan holidays around
Clear ownership per week

CONS

One bad week can cause severe burnout
Knowledge concentration in primary
Rotation frequency grows painful below 6 engineers

12-Hour Shifts

Best for: High-volume operations teams (50+ pages/week)

On-call is split into day shift (08:00-20:00) and night shift (20:00-08:00). Two separate engineers per day.

PROS

No single engineer carries 24-hour responsibility
Night shift engineer is awake
Better rest during non-shift hours

CONS

Requires large rotation (minimum 8 engineers)
Handoffs at shift boundaries require care
Scheduling complexity

Shadowing Rotation

Best for: Teams onboarding new engineers or expanding rotation size

New engineers shadow an experienced on-call engineer for 2-4 weeks before taking primary responsibility.

PROS

Accelerates on-call readiness for new joiners
Reduces MTTA from inexperience
Builds runbook culture organically

CONS

Two engineers per shift during shadow period
Experienced engineers carry more load temporarily
Requires structured runbooks to shadow effectively

Blameless Post-Mortem Template

Post-mortems close the feedback loop between incidents and alert quality. Without them, the same alerts fire for the same reasons indefinitely.

postmortem-template.md

# Post-Mortem: [INCIDENT TITLE]

**Date:** [YYYY-MM-DD]
**Duration:** [start time] to [end time] ([N] hours)
**Severity:** P1 / P2
**Impact:** [Who was affected and how]

## Timeline (UTC)
| Time | Event |
|------|-------|
| HH:MM | Incident begins |
| HH:MM | Alert fires |
| HH:MM | On-call acknowledges |
| HH:MM | Root cause identified |
| HH:MM | Fix deployed |
| HH:MM | Service restored |
| HH:MM | Incident closed |

## Root cause
[One clear sentence. Avoid blame.]

## Contributing factors
- [Factor 1]
- [Factor 2]

## What went well
- [Item]

## What could be improved
- [Item]

## Action items
| Action | Owner | Due |
|--------|-------|-----|
| [Fix] | [Name] | [Date] |
| [Alert update] | [Name] | [Date] |

## Detection
How was this detected? (Monitoring alert / customer report / internal notice)
Was the alert sufficient? YES / NO -- if NO, what change is needed?

Alert Tuning Tools Comparison On-Call Cost Calculator