Primary + Escalation On-Call: 2-Tier Cost vs Single-Tier 2026
Updated May 2026. Sources: Google SRE Book Chapter 11 (Being On-Call), public PagerDuty and incident.io case studies, incident.io 2024 State of On-Call.
The Division of Labour
Two-tier on-call splits incident response across two distinct rotations with different responsibilities. Tier-1, the primary on-call, takes every page first. The primary's job is to assess the page, execute documented runbook steps for known classes, and mitigate the routine 80 percent of incidents without escalation. Tier-2, variously named escalation rotation, secondary on-call, or senior on-call, is paged only when the primary cannot mitigate within a defined window (typically 15 to 30 minutes for Sev-1 incidents).
The model is described in less detail in Google SRE Book Chapter 11 than the follow-the-sun pattern, but is widely adopted in practice. The structural value comes from spreading the deep-knowledge cognitive burden across fewer engineers (the tier-2 pool) while preserving fast first response from a wider engineer rotation (the tier-1 pool). It is also the natural pattern when an SRE team has a mix of senior and intermediate engineers and the seniors are too few to staff a primary rotation alone.
A common variation is a primary plus named-escalation model, where the escalation engineer is not on a rotation but is a single named person who is always reachable. This is structurally fragile (the named person becomes a permanent solo on-call from the escalation side) and should be treated as an interim arrangement, not a target state.
Page Propagation Rates
The fraction of tier-1 pages that escalate to tier-2 is the key operational metric for evaluating two-tier on-call. From public engineering blog write-ups and a small sample of operations data shared in conference talks (SREcon, DevOps Enterprise Summit), realistic ranges are as follows. Mature operations with strong runbook coverage and well-tuned alerts: 5 to 15 percent escalation rate. Typical operations: 15 to 30 percent escalation rate. Less mature operations with weak runbook coverage or noisy alerting: 30 to 60 percent escalation rate.
A high escalation rate is a leading indicator of a runbook coverage problem, not a deep-complexity problem. Most pages can in principle be handled by tier-1 with a written runbook; the reason they escalate is that the runbook does not exist, is out of date, or assumes context the tier-1 engineer lacks. If your escalation rate is above 30 percent, the highest-leverage move is to invest in runbook documentation for the top-10 alert classes rather than to restructure rotations. The two-tier structure becomes much more valuable once runbook coverage is strong because tier-2 is then genuinely handling the residual deep-knowledge incidents rather than absorbing routine mitigation.
For planning purposes, model escalation as a probability that compounds across the on-call burden. A tier-1 engineer in a 6-person rotation seeing 42 pages per week sees roughly 7 pages on their primary week. At a 20 percent escalation rate, the tier-2 engineer in a 4-person rotation sees roughly 1.5 escalated pages per primary week. The tier-2 burden is real but materially lighter than the tier-1 burden in volume, while heavier in cognitive intensity per page.
Compensation Premium and Total Cost
There is no industry-standard compensation premium for tier-2 on-call. Three patterns are common. Pattern one: no premium, on the reasoning that tier-2 is on-call less frequently and the deep-knowledge expectation is already priced into the senior salary. Pattern two: 10 to 15 percent on-call stipend for tier-2, matching the tier-1 stipend, on the reasoning that fairness across tiers matters more than precise burden matching. Pattern three: 20 to 30 percent premium for tier-2, on the reasoning that the per-page cognitive burden is higher and that senior engineers in this role have stronger outside options.
The cost comparison to single-tier rotation, for a 10-engineer team, is illustrative. Single-tier with no formal escalation: 10 engineers on weekly rotation, $180,000 fully-loaded each, with the cost of the rotation being the slice of engineering time consumed by pages (per /on-call-cost math, roughly $30,000 to $80,000 per engineer per year depending on volume, around $400,000 to $600,000 for the team). Two-tier with 6 on tier-1 and 4 on tier-2, with a 15 percent tier-2 premium: the visible salary premium is roughly 4 * $27,000 = $108,000 per year. In return you typically get 20 to 40 percent faster MTTR on Sev-1 incidents (per PagerDuty and incident.io case studies), which translates to materially less revenue impact per incident. For revenue-sensitive teams the trade is favourable; for non-revenue-impacting incidents the trade is closer to break-even.
One number to anchor the MTTR improvement value: at /outage-cost ranges (typical mid-market B2B SaaS Sev-1 cost is $50,000 to $500,000 per hour of impact), a 30 percent MTTR reduction on Sev-1 typically saves $20,000 to $150,000 per incident. Even a single saved Sev-1 per year pays for the tier-2 compensation premium many times over.
Two-Tier vs Single-Tier: Quick Reference
| Dimension | Single-tier | Two-tier |
|---|---|---|
| First-response latency (MTTA) | Same engineer always responds | Same (primary always responds) |
| Sev-1 mitigation MTTR | Slower for non-routine incidents | 20 to 40 percent faster typically |
| Engineer cognitive burden | Concentrated on whoever holds pager | Distributed: tier-1 routine, tier-2 deep |
| Runbook coverage incentive | Lower (everyone learns deep over time) | Higher (runbooks needed for tier-1) |
| Compensation cost | Single stipend | Tier-1 stipend + tier-2 premium |
| Minimum team size | 5 to 6 engineers | 8 to 10 engineers (rotation per tier) |
| Rotation experience | Same every week | Lighter tier-1 weeks, intense tier-2 weeks |
| Retention impact | Burden uniform across team | Risk concentrates in tier-2 if escalation too frequent |
When to Add a Tier-2
Three signals indicate the rotation has outgrown single-tier and would benefit from a two-tier structure. First signal: ad-hoc escalation rate above 15 percent. If the primary regularly Slacks senior engineers off-rotation for incident help, you already have a two-tier rotation; it is just informal and unmeasured. Formalising the escalation rotation captures the value without the burden distribution being arbitrary.
Second signal: Sev-1 MTTR is dominated by initial-response delays rather than by mitigation complexity. If your incident postmortems repeatedly note "delayed escalation to deep-knowledge engineer" as a contributing cause, the structural fix is a tier-2 escalation rotation that is always reachable. Third signal: the team has clear seniority differentiation where a small set of engineers carries most of the deep system context. Two-tier formalises this; trying to staff a single-tier rotation evenly when half the rotation lacks the context to mitigate complex incidents creates either slow MTTR or constant ad-hoc escalation.
When not to add tier-2: when the team is below 8 engineers (rotation cadence breaks down), when runbook coverage is weak (the underlying problem is documentation, not rotation structure), or when the senior pool is too small to staff a tier-2 rotation sustainably (you risk creating a permanent solo tier-2 with all the burnout risk of solo on-call). Fix the underlying issues first; structure the tiers second.