Home/On-Call Cost/Two-Tier Cost

ROTATION ECONOMICS

Primary + Escalation On-Call: 2-Tier Cost vs Single-Tier 2026

Updated June 2026. Sources: Google SRE Book Chapter 11 (Being On-Call), public PagerDuty and incident.io case studies.

The Division of Labour

Two-tier on-call splits incident response across two distinct rotations with different responsibilities. Tier-1, the primary on-call, takes every page first. The primary's job is to assess the page, execute documented runbook steps for known classes, and mitigate the routine 80 percent of incidents without escalation. Tier-2, variously named escalation rotation, secondary on-call, or senior on-call, is paged only when the primary cannot mitigate within a defined window (typically 15 to 30 minutes for Sev-1 incidents).

The model is described in less detail in Google SRE Book Chapter 11 than the follow-the-sun pattern, but is widely adopted in practice. The structural value comes from spreading the deep-knowledge cognitive burden across fewer engineers (the tier-2 pool) while preserving fast first response from a wider engineer rotation (the tier-1 pool). It is also the natural pattern when an SRE team has a mix of senior and intermediate engineers and the seniors are too few to staff a primary rotation alone.

A common variation is a primary plus named-escalation model, where the escalation engineer is not on a rotation but is a single named person who is always reachable. This is structurally fragile (the named person becomes a permanent solo on-call from the escalation side) and should be treated as an interim arrangement, not a target state.

Page Propagation Rates

The fraction of tier-1 pages that escalate to tier-2 is the key operational metric for evaluating two-tier on-call. From public engineering blog write-ups and a small sample of operations data shared in conference talks (SREcon, DevOps Enterprise Summit), realistic ranges are as follows. Mature operations with strong runbook coverage and well-tuned alerts: 5 to 15 percent escalation rate. Typical operations: 15 to 30 percent escalation rate. Less mature operations with weak runbook coverage or noisy alerting: 30 to 60 percent escalation rate.

A high escalation rate is a leading indicator of a runbook coverage problem, not a deep-complexity problem. Most pages can in principle be handled by tier-1 with a written runbook; the reason they escalate is that the runbook does not exist, is out of date, or assumes context the tier-1 engineer lacks. If your escalation rate is above 30 percent, the highest-leverage move is to invest in runbook documentation for the top-10 alert classes rather than to restructure rotations. The two-tier structure becomes much more valuable once runbook coverage is strong because tier-2 is then genuinely handling the residual deep-knowledge incidents rather than absorbing routine mitigation.

For planning purposes, model escalation as a probability that compounds across the on-call burden. A tier-1 engineer in a 6-person rotation seeing 42 pages per week sees roughly 7 pages on their primary week. At a 20 percent escalation rate, the tier-2 engineer in a 4-person rotation sees roughly 1.5 escalated pages per primary week. The tier-2 burden is real but materially lighter than the tier-1 burden in volume, while heavier in cognitive intensity per page.

Compensation Premium and Total Cost

There is no industry-standard compensation premium for tier-2 on-call. Three patterns are common. Pattern one: no premium, on the reasoning that tier-2 is on-call less frequently and the deep-knowledge expectation is already priced into the senior salary. Pattern two: 10 to 15 percent on-call stipend for tier-2, matching the tier-1 stipend, on the reasoning that fairness across tiers matters more than precise burden matching. Pattern three: 20 to 30 percent premium for tier-2, on the reasoning that the per-page cognitive burden is higher and that senior engineers in this role have stronger outside options.

The cost comparison to single-tier rotation, for a 10-engineer team, is illustrative. Single-tier with no formal escalation: 10 engineers on weekly rotation, $180,000 fully-loaded each, with the cost of the rotation being the slice of engineering time consumed by pages (per /on-call-cost math, roughly $30,000 to $80,000 per engineer per year depending on volume, around $400,000 to $600,000 for the team). Two-tier with 6 on tier-1 and 4 on tier-2, with a 15 percent tier-2 premium: the visible salary premium is roughly 4 * $27,000 = $108,000 per year. In return you typically get 20 to 40 percent faster MTTR on Sev-1 incidents (per PagerDuty and incident.io case studies), which translates to materially less revenue impact per incident. For revenue-sensitive teams the trade is favourable; for non-revenue-impacting incidents the trade is closer to break-even.

One number to anchor the MTTR improvement value: at /outage-cost ranges (typical mid-market B2B SaaS Sev-1 cost is $50,000 to $500,000 per hour of impact), a 30 percent MTTR reduction on Sev-1 typically saves $20,000 to $150,000 per incident. Even a single saved Sev-1 per year pays for the tier-2 compensation premium many times over.

Two-Tier vs Single-Tier: Quick Reference

Dimension	Single-tier	Two-tier
First-response latency (MTTA)	Same engineer always responds	Same (primary always responds)
Sev-1 mitigation MTTR	Slower for non-routine incidents	20 to 40 percent faster typically
Engineer cognitive burden	Concentrated on whoever holds pager	Distributed: tier-1 routine, tier-2 deep
Runbook coverage incentive	Lower (everyone learns deep over time)	Higher (runbooks needed for tier-1)
Compensation cost	Single stipend	Tier-1 stipend + tier-2 premium
Minimum team size	5 to 6 engineers	8 to 10 engineers (rotation per tier)
Rotation experience	Same every week	Lighter tier-1 weeks, intense tier-2 weeks
Retention impact	Burden uniform across team	Risk concentrates in tier-2 if escalation too frequent

When to Add a Tier-2

Three signals indicate the rotation has outgrown single-tier and would benefit from a two-tier structure. First signal: ad-hoc escalation rate above 15 percent. If the primary regularly Slacks senior engineers off-rotation for incident help, you already have a two-tier rotation; it is just informal and unmeasured. Formalising the escalation rotation captures the value without the burden distribution being arbitrary.

Second signal: Sev-1 MTTR is dominated by initial-response delays rather than by mitigation complexity. If your incident postmortems repeatedly note "delayed escalation to deep-knowledge engineer" as a contributing cause, the structural fix is a tier-2 escalation rotation that is always reachable. Third signal: the team has clear seniority differentiation where a small set of engineers carries most of the deep system context. Two-tier formalises this; trying to staff a single-tier rotation evenly when half the rotation lacks the context to mitigate complex incidents creates either slow MTTR or constant ad-hoc escalation.

When not to add tier-2: when the team is below 8 engineers (rotation cadence breaks down), when runbook coverage is weak (the underlying problem is documentation, not rotation structure), or when the senior pool is too small to staff a tier-2 rotation sustainably (you risk creating a permanent solo tier-2 with all the burnout risk of solo on-call). Fix the underlying issues first; structure the tiers second.

Frequently Asked Questions

What is two-tier on-call?+

Two-tier on-call splits responsibility between a primary on-call engineer (first responder for all pages) and a secondary or escalation tier (deep-knowledge engineers paged only when the primary cannot mitigate within a defined window). Common configurations: primary plus one named escalation, primary plus an escalation rotation, or primary plus a senior-on-call rotation. The goal is to distribute load and reduce the deep-knowledge burden across the team while preserving fast first response.

How often does a page escalate to tier-2?+

Realistic ranges from public engineering blog write-ups: 10 to 25 percent of pages escalate in mature operations, 30 to 50 percent in less-mature operations. The escalation rate is a leading indicator of runbook coverage quality: high escalation rate suggests the primary cannot mitigate most pages with documented procedures, which usually points to missing runbooks rather than legitimate complexity.

What is the compensation premium for tier-2?+

Varies widely. Some organisations pay tier-2 the same as tier-1 because the on-call frequency is similar. Other organisations pay a 10 to 25 percent premium for the deep-knowledge tier because it requires more senior engineers and because the off-hours interruption risk, while less frequent, is higher per occurrence. There is no industry-standard formula; the answer is whatever your local market and total-comp policy support.

Does two-tier improve MTTR?+

Yes for incidents that exceed primary mitigation capability; no for incidents the primary can handle. The structural improvement is the elimination of the I-do-not-know-who-to-call dead time when an incident exceeds tier-1 capability. Public PagerDuty and incident.io case studies show MTTR improvements of 20 to 40 percent for Sev-1 incidents in two-tier configurations compared to single-tier rotations with ad-hoc escalation.

When should you add a tier-2?+

When more than roughly 15 to 20 percent of pages currently escalate ad-hoc, when you have at least 8 engineers (so each tier has a workable rotation), and when the cost of slow Sev-1 mitigation is material. Below 8 engineers it is hard to staff both tiers sustainably; above that scale a two-tier rotation usually improves both incident outcomes and rotation experience.

How does two-tier interact with follow-the-sun?+

It composes well. A common large-organisation pattern is follow-the-sun for tier-1 primary responders (3 regions, daytime coverage in each) plus a centrally located tier-2 escalation rotation that absorbs the small fraction of pages requiring deep system knowledge. The escalation tier is paged less frequently and the night pages it does take are tolerable because of the lower volume.

Does tier-2 reduce alert fatigue?+

Partially. It reduces the deep-knowledge cognitive burden on tier-1, who can focus on triage and routine mitigation. It does not reduce overall pager volume; the primary still sees every page. To reduce volume, run the alert hygiene work documented in /alert-tuning, /correlation-dedup, and /slo-vs-threshold in parallel with tier structure changes.