Home/Scenarios/Startup (5 engineers)

ORG SCALE

Alert Fatigue at a 5-Engineer Startup: Realistic Page Volume + Cost

Updated June 2026. Sources: Catchpoint 2024 SRE Report, Google SRE Book Chapter 11.

What 5-Engineer On-Call Actually Looks Like

At 5 engineers, on-call is everyone's job. There is no dedicated SRE function, no platform team, and no clear separation between "the people who write features" and "the people who keep the system running". Each engineer takes a week of primary pager duty roughly once every five weeks. The rotation is short enough that everyone has recent system context, which is an advantage that disappears at larger scale; everyone is on-call often enough that nothing has been forgotten.

The system itself is usually small: a single primary application, a few supporting services, one or two databases, a single monitoring stack (often a Datadog free tier or a managed Prometheus). The alert surface area is correspondingly small: realistically 30 to 80 distinct alert rules, mostly threshold-based, mostly inherited from earlier days when the founding engineer wrote them quickly to cover gaps. Most of these rules have not been audited since they were authored.

The result is that at 5-engineer scale, alert fatigue almost always shows up as repeated noisy weeks driven by a small set of poorly-tuned rules that nobody owns. The fix is not buying better tooling; it is doing the audit. A senior engineer spending a half-day on the alert ruleset typically removes 30 to 60 percent of noise within one cycle. The opportunity cost is half a day of feature work, and the return is one engineer-week per quarter of recovered focus across the team.

Expected Page Volume Benchmarks

State	Pages / engineer / week	False-positive rate	Sustainable for	Action
Healthy	Under 8	Under 30%	Indefinitely	Maintain; monthly review
Stressed	8 to 20	30 to 60%	6 to 12 months	Quarterly audit, runbook top-10 alerts
Critical	20 to 40	High	3 to 6 months	Immediate audit, kill rules, migrate to SLO
Untenable	Above 40	Above 80%	Weeks	Emergency response, suspend non-Sev-1 alerts

The healthy threshold is materially lower than the larger-team target because of rotation cadence. At 42 pages per engineer per week (an illustrative figure for larger orgs), a 5-engineer team would be paging a single primary 42 times in their week, which is structurally untenable in a 5-week rotation. The Google SRE Book Chapter 6 healthy threshold of roughly 14 per week applies more cleanly to larger rotations with longer recovery between primary weeks; at 5-engineer scale the threshold should be tightened further to keep each primary week tolerable.

The Fix Sequence at 5-Engineer Scale

The fix sequence at small scale differs from the larger-team playbook because the constraints are different. There is no budget for AIOps, no team to dedicate to a multi-month SLO migration, and no engineering capacity to maintain a complex alert ruleset. The right sequence prioritises moves that compound across a small team rather than moves that compound across many alert rules.

Move one (first week): audit and kill. List every alert rule. For each, answer: when this fires, what is the human response? If the answer is "nothing, it usually self-heals" or "I check Datadog and confirm it is fine", the rule should be killed or downgraded to a non-paging notification. Expect to remove 20 to 40 percent of rules outright. This is the single highest-leverage move at any scale and is especially valuable at startup scale because the small ruleset can be reviewed thoroughly in a half day.

Move two (weeks 2 to 4): runbook the survivors. For each remaining paging rule, write a short runbook (8 to 12 lines is fine) covering: what this means, first diagnostic step, common cause, mitigation, when to escalate. The investment is roughly 30 minutes per rule. Total time for 20 surviving rules: 10 engineer-hours. The benefit is that any engineer in the rotation can mitigate without deep context, which is essential at small scale where context distribution is uneven.

Move three (month 2 to 3): migrate user-facing paths to SLO-based alerting. Identify the 3 to 5 most critical user-facing flows (signup, key feature, checkout). For each, define an SLO (e.g. 99.5 percent successful response within 2 seconds), compute the error budget, and configure a burn-rate alert. Replace the previous threshold-based alerts on those paths. This is the structural fix that prevents new noisy rules from accumulating; future engineers asking "should this be a paging alert" have a clearer test (does it threaten the SLO?). Read /slo-vs-threshold for the implementation pattern.

The 12-Month Roadmap

Quarter 1: complete the audit-and-kill plus runbook-the-survivors moves above. Expected outcome: 40 to 60 percent reduction in pager volume, runbook coverage on every paging alert. Effort: 15 to 20 engineer-hours total across the team. Investment in tooling: zero.

Quarter 2: SLO migration on user-facing paths. Expected outcome: structurally prevent re-accumulation of noise on critical flows, plus modest further volume reduction (10 to 20 percent on top of Q1). Effort: 20 to 40 engineer-hours, primarily by the engineer with the strongest production-systems instinct. Investment in tooling: zero if your existing monitoring supports SLO queries (Datadog, Grafana, New Relic all do); modest if you need to migrate to a tool that does.

Quarter 3: institute monthly alert hygiene reviews. The investment is 30 minutes per month of senior engineer time. The output is a list of any new noisy rules to tune, kill, or runbook. This is the practice that prevents drift over time and is the cheapest discipline available. Expected outcome: alert volume stays flat or trends downward even as the system grows.

Quarter 4: consider whether the next hire should be a dedicated SRE or platform engineer. By this point the team is typically growing past 8 engineers and the operational scope is starting to drag focused product work. A dedicated SRE at this stage is high-leverage if the system has reached enough complexity to justify the specialised attention. If the system is still small and well-tuned, the more flexible hire is the right move. The decision should be data-driven from the alert hygiene reviews, not from outside pressure to "hire SREs because we are growing".

What Not to Do at 5-Engineer Scale

Avoid AIOps vendor purchases. The pricing model assumes scale that 5-engineer teams do not have, and the value proposition (correlation across many monitoring tools, ML-based grouping at very high event volume) is irrelevant to small homogeneous stacks. The right correlation at this scale is the basic dedup that comes free with an entry tier such as PagerDuty Professional or Jira Service Management.

Avoid follow-the-sun rotations. At 5 total engineers you do not have the regional staffing minimums for a credible three-region rotation. If night pages are dominant, the right move is to tune the alerts to reduce night-page volume (most night pages at small scale are false-positive volume that survives daytime) rather than to spread the rotation across regions you cannot staff.

Avoid implementing two-tier on-call. At 5 engineers each tier would have fewer than 3 people on rotation, which breaks down on absence. Stick with single-tier rotation across all engineers and rely on Slack ad-hoc escalation when a primary genuinely needs help. The volume is low enough that ad-hoc works; formalising the escalation tier adds complexity without proportional benefit.

Frequently Asked Questions

How many pages per week is realistic at a 5-engineer startup?+

Healthy: under 10 pages per engineer per week. Typical untuned: 15 to 30 pages per engineer per week. Above 30 is structural overload at 5-engineer scale because the rotation gives each engineer the pager every fifth week, and a high-volume week on solo primary becomes unsustainable. A commonly cited illustrative figure of 42 pages per week applies to larger teams; small teams should target lower because the rotation cadence is shorter.

What false-positive rate is realistic at 5 engineers?+

Treat false-positive rate as something you measure from your own 90-day alert history rather than an assumed figure; untuned small systems are frequently noisy. Smaller systems do not have systematically lower false-positive rates; the noise is just smaller in absolute volume. The advantage of being small is that fewer alert rules exist and audit is faster: a thorough quarterly review of 30 to 80 alert rules is a one-engineer-day exercise at startup scale and is the highest-leverage move available.

Should a 5-engineer startup buy PagerDuty?+

Probably not at the Business tier. An entry tier such as PagerDuty Professional (or Jira Service Management if you are on the Atlassian stack) is sufficient at small scale because the noise-reduction features in higher tiers are not yet binding. Note that Opsgenie, long the cheap default here, is closed to new customers and shuts down in April 2027, so it is no longer an option for a new team. Spend the saving on a senior engineer hour per week dedicated to alert audit and runbook authoring; the impact dominates the tier upgrade.

What is the realistic on-call cost at 5 engineers?+

At the typical untuned level (20 pages per engineer per week, 70 percent false positive, $180,000 fully-loaded cost), the direct alert-handling time across the team is roughly $80,000 to $150,000 per year. That excludes the night-page premium and the contingent attrition cost. After alert hygiene reduces volume to under 8 pages per engineer per week, the direct cost drops to roughly $30,000 to $60,000 per year, materially less burdensome at this scale.

What is the right rotation pattern at 5 engineers?+

Weekly primary rotation across all 5 engineers, with a clear secondary (the prior week's primary, who has the most recent context). Avoid solo on-call (see /single-engineer-on-call-cost for why). Avoid trying to do follow-the-sun at 5 total engineers (see /follow-the-sun-on-call-cost for staffing minimums). Two-tier on-call also does not work at 5 engineers because each tier breaks below the rotation cadence floor.

When should a 5-engineer startup hire dedicated SRE?+

Usually not at 5 engineers. The right move at 5 is to maintain operational shared ownership across all engineers and to invest in alert hygiene to keep the rotation tolerable. Dedicated SRE typically becomes the right hire at 15 to 30 engineers, when the operational burden starts to drag focused product work and when there is enough system complexity to justify specialised attention.

Should pages be tracked in retrospective?+

Yes, even at 5-engineer scale. A monthly 30-minute review of the last month's pages (volume per rule, false-positive rate per rule, what was actually mitigated vs auto-resolved) is the cheapest alert hygiene practice available. It typically surfaces 2 to 5 noisy rules per month that can be tuned, killed, or runbooked, which materially moves the needle over a year.