The 3am page that shouldn't have happened: tuning alert thresholds
Alert fatigue is the single most common reason teams ignore real incidents. A practical guide to threshold design, quorum, and severity routing that respects on-call sleep.
Every engineer who's been on-call for more than six months has the same story. A 3am page. A sleepy drive to the laptop. An investigation that ends in "everything is fine, the probe had a hiccup." The next time the pager goes off, response time is slower. The time after that, slower still. By the end of the quarter, the team is ignoring pages that turn out to be real.
This is the most common failure mode of monitoring programs. It is also the most preventable. The cure is threshold discipline.
Why defaults are usually wrong
Most vendors ship with defaults like "alert after two consecutive failures from a single probe." This is noise-generating by design. A single probe's network path can flap. A CDN edge near the probe can degrade. A routing change upstream of the probe can cause transient failures. None of these are your problem, and all of them page you.
The fix is two-dimensional: require quorum across probes, and require duration across time.
Quorum
A quorum requirement says "at least N of M probes must see the failure before we declare an incident." The default we recommend is 2-of-3: a minimum of three probes across at least two regions, and at least two of them have to agree within the same window.
This single change eliminates perhaps 80% of single-probe noise. It costs almost nothing — the third probe is cheap, and the latency hit is minimal. Any modern monitoring platform should let you configure this per-monitor.
Duration
A duration requirement says "the failure condition must persist for at least N minutes before we alert." For critical user journeys, 2-3 minutes is usually right. For latency budgets, 10 minutes is more honest — short latency spikes are common and usually recover on their own.
Duration + quorum together are the two-dial knob that covers most of what you want: "wake me up only when two probes have agreed something is wrong for three straight minutes." That's a real incident, not a blip.
Severity routing
Not every alert should wake someone up. A reasonable mapping:
- P1 — page on-call immediately, any time. Core product down for paying users. Login broken. Checkout broken. The things that cost money or trust by the minute.
- P2 — page on-call during business hours, Slack after hours. Feature is down but main product works. Status page shows degradation. Users are inconvenienced, not blocked.
- P3 — Slack only. Background jobs failing, non-critical integrations broken, latency elevated but within SLA.
The discipline here is: whenever you configure a new monitor, pick its severity beforeconfiguring the threshold. If you don't know what severity it should be, you don't understand what you're monitoring.
The quarterly review
Once a quarter, pull your alert history and ask three questions per alert:
- Was this a real incident? If yes, was the response proportional? If no, this alert is a candidate for tuning.
- How long did the real incidents last, and did we alert early enough or late? Early-but-noisy is often worse than later-but-accurate.
- Are there alerts we never got but should have? Post-mortems from this quarter usually reveal gaps.
This is a 90-minute exercise and the highest-leverage operational work a team can do on monitoring. Most teams skip it because it's not urgent. It compounds spectacularly when you do it.
Alerting hygiene
Some principles that consistently improve alert quality:
- Every alert has an owner. Not a team. A person. If the owner leaves, the alert needs a new owner or needs to be deleted.
- Every alert has a runbook URL. The person paged at 3am should not be expected to intuit the response. A one-page runbook with diagnostic commands and common fixes dramatically shortens incidents.
- Every alert has a severity.If you can't assign severity, you haven't thought about the alert.
- Alerts that haven't fired in 90 days are candidates for deletion.Either they're perfectly tuned or they're checking something that doesn't matter. Be honest.
The honest failure mode
The most painful threshold tuning conversation isn't about technology — it's about risk tolerance. Tighter thresholds catch more real issues at the cost of more noise. Looser thresholds are quieter but miss edge cases. There is no universally correct setting.
What works is being explicit about the tradeoff with the team. "We're going to miss some 90-second partial outages. The cost of being paged for every 90-second blip is higher than the cost of missing them." That's a reasonable stance, written down, reviewed quarterly. What doesn't work is pretending there's a configuration that's both sensitive and quiet.
The thing to actually measure
If you want one metric for your alerting program's health, use noise ratio: the percentage of pages in the last 30 days that were real incidents versus flapping / noise / false alarms.
Above 80%, you're doing well. 50-80%, you have room to tune. Below 50%, you're training your team to ignore the pager, which will catastrophically cost you one day. Fix it before that day.
Keep reading
Why product health deserves one unified platform
Three point tools glued together with webhooks is the industry norm. It's also the reason most product teams miss the signal they actually need. A case for one platform over three.
Beyond the green check: what a meaningful uptime monitor validates
A 200 OK is the weakest signal an uptime monitor can give you. Here's a practical guide to monitors that actually tell you whether your product is working.
Incident communication is a product feature: a status page playbook
A structured guide to running the status page during an incident: what to say when, how often to update, and why ambiguous language costs more than the outage itself.