Alert Fatigue at a 5-Engineer Startup: Realistic Page Volume + Cost
Updated May 2026. Sources: incident.io 2024 State of On-Call (extrapolated to 5-engineer scale), Catchpoint 2024 SRE Report, Google SRE Book Chapter 11.
What 5-Engineer On-Call Actually Looks Like
At 5 engineers, on-call is everyone's job. There is no dedicated SRE function, no platform team, and no clear separation between "the people who write features" and "the people who keep the system running". Each engineer takes a week of primary pager duty roughly once every five weeks. The rotation is short enough that everyone has recent system context, which is an advantage that disappears at larger scale; everyone is on-call often enough that nothing has been forgotten.
The system itself is usually small: a single primary application, a few supporting services, one or two databases, a single monitoring stack (often a Datadog free tier or a managed Prometheus). The alert surface area is correspondingly small: realistically 30 to 80 distinct alert rules, mostly threshold-based, mostly inherited from earlier days when the founding engineer wrote them quickly to cover gaps. Most of these rules have not been audited since they were authored.
The result is that at 5-engineer scale, alert fatigue almost always shows up as repeated noisy weeks driven by a small set of poorly-tuned rules that nobody owns. The fix is not buying better tooling; it is doing the audit. A senior engineer spending a half-day on the alert ruleset typically removes 30 to 60 percent of noise within one cycle. The opportunity cost is half a day of feature work, and the return is one engineer-week per quarter of recovered focus across the team.
Expected Page Volume Benchmarks
| State | Pages / engineer / week | False-positive rate | Sustainable for | Action |
|---|---|---|---|---|
| Healthy | Under 8 | Under 30% | Indefinitely | Maintain; monthly review |
| Stressed | 8 to 20 | 30 to 60% | 6 to 12 months | Quarterly audit, runbook top-10 alerts |
| Critical | 20 to 40 | 60 to 80% | 3 to 6 months | Immediate audit, kill rules, migrate to SLO |
| Untenable | Above 40 | Above 80% | Weeks | Emergency response, suspend non-Sev-1 alerts |
The healthy threshold is materially lower than the larger-team target because of rotation cadence. At 42 pages per engineer per week (the incident.io 2024 industry median for larger orgs), a 5-engineer team would be paging a single primary 42 times in their week, which is structurally untenable in a 5-week rotation. The Google SRE Book Chapter 6 healthy threshold of roughly 14 per week applies more cleanly to larger rotations with longer recovery between primary weeks; at 5-engineer scale the threshold should be tightened further to keep each primary week tolerable.
The Fix Sequence at 5-Engineer Scale
The fix sequence at small scale differs from the larger-team playbook because the constraints are different. There is no budget for AIOps, no team to dedicate to a multi-month SLO migration, and no engineering capacity to maintain a complex alert ruleset. The right sequence prioritises moves that compound across a small team rather than moves that compound across many alert rules.
Move one (first week): audit and kill. List every alert rule. For each, answer: when this fires, what is the human response? If the answer is "nothing, it usually self-heals" or "I check Datadog and confirm it is fine", the rule should be killed or downgraded to a non-paging notification. Expect to remove 20 to 40 percent of rules outright. This is the single highest-leverage move at any scale and is especially valuable at startup scale because the small ruleset can be reviewed thoroughly in a half day.
Move two (weeks 2 to 4): runbook the survivors. For each remaining paging rule, write a short runbook (8 to 12 lines is fine) covering: what this means, first diagnostic step, common cause, mitigation, when to escalate. The investment is roughly 30 minutes per rule. Total time for 20 surviving rules: 10 engineer-hours. The benefit is that any engineer in the rotation can mitigate without deep context, which is essential at small scale where context distribution is uneven.
Move three (month 2 to 3): migrate user-facing paths to SLO-based alerting. Identify the 3 to 5 most critical user-facing flows (signup, key feature, checkout). For each, define an SLO (e.g. 99.5 percent successful response within 2 seconds), compute the error budget, and configure a burn-rate alert. Replace the previous threshold-based alerts on those paths. This is the structural fix that prevents new noisy rules from accumulating; future engineers asking "should this be a paging alert" have a clearer test (does it threaten the SLO?). Read /slo-vs-threshold for the implementation pattern.
The 12-Month Roadmap
Quarter 1: complete the audit-and-kill plus runbook-the-survivors moves above. Expected outcome: 40 to 60 percent reduction in pager volume, runbook coverage on every paging alert. Effort: 15 to 20 engineer-hours total across the team. Investment in tooling: zero.
Quarter 2: SLO migration on user-facing paths. Expected outcome: structurally prevent re-accumulation of noise on critical flows, plus modest further volume reduction (10 to 20 percent on top of Q1). Effort: 20 to 40 engineer-hours, primarily by the engineer with the strongest production-systems instinct. Investment in tooling: zero if your existing monitoring supports SLO queries (Datadog, Grafana, New Relic all do); modest if you need to migrate to a tool that does.
Quarter 3: institute monthly alert hygiene reviews. The investment is 30 minutes per month of senior engineer time. The output is a list of any new noisy rules to tune, kill, or runbook. This is the practice that prevents drift over time and is the cheapest discipline available. Expected outcome: alert volume stays flat or trends downward even as the system grows.
Quarter 4: consider whether the next hire should be a dedicated SRE or platform engineer. By this point the team is typically growing past 8 engineers and the operational scope is starting to drag focused product work. A dedicated SRE at this stage is high-leverage if the system has reached enough complexity to justify the specialised attention. If the system is still small and well-tuned, the more flexible hire is the right move. The decision should be data-driven from the alert hygiene reviews, not from outside pressure to "hire SREs because we are growing".
What Not to Do at 5-Engineer Scale
Avoid AIOps vendor purchases. The pricing model assumes scale that 5-engineer teams do not have, and the value proposition (correlation across many monitoring tools, ML-based grouping at very high event volume) is irrelevant to small homogeneous stacks. The right correlation at this scale is the basic dedup that comes free with PagerDuty Professional or Opsgenie Essentials.
Avoid follow-the-sun rotations. At 5 total engineers you do not have the regional staffing minimums for a credible three-region rotation. If night pages are dominant, the right move is to tune the alerts to reduce night-page volume (most night pages at small scale are false-positive volume that survives daytime) rather than to spread the rotation across regions you cannot staff.
Avoid implementing two-tier on-call. At 5 engineers each tier would have fewer than 3 people on rotation, which breaks down on absence. Stick with single-tier rotation across all engineers and rely on Slack ad-hoc escalation when a primary genuinely needs help. The volume is low enough that ad-hoc works; formalising the escalation tier adds complexity without proportional benefit.