Beyond the green check: what a meaningful uptime monitor validates
A 200 OK is the weakest signal an uptime monitor can give you. Here's a practical guide to monitors that actually tell you whether your product is working.
Every uptime monitoring vendor demos the same scenario: your site goes down, their tool pages you, you fix it, you're a hero. It's compelling in the sales demo and almost irrelevant to the 95% of outages that actually happen in production.
Real outages in 2026 look like this: the homepage returns a 200, the API returns a 200, the login form submits successfully, and something behind the login wall is broken for 40% of paid users. A monitor checking whether the homepage returns 200 tells you nothing. Everything is fine, except what your users are actually doing.
The reason 200 OK is misleading
HTTP 200 means the server returned a response. It says nothing about what's in the response. A catastrophic backend failure that causes your framework's error page to render — still a 200. A cached stale response served by CDN while the origin is on fire — still a 200. A login page that loads perfectly but rejects every credential — 200.
This is why assertions matter more than status codes. The cheap version is a substring check: assert that the response body contains a known string ("Log out" on an authenticated page, a specific CTA on the homepage). The better version is schema validation on API responses. The best version is a scripted check that does something — logs in, loads a page behind auth, asserts on the DOM.
The monitor-pyramid
For most products, here's the order I'd build monitors, in priority:
- One synthetic check per revenue-critical user journey. If it's a SaaS product: sign up, log in, core action. If it's e-commerce: browse, add to cart, checkout.
- One check per critical API endpoint with response schema assertion and latency budget.
- TLS certificate expiry. Cheap to monitor, catastrophic to miss.
- Heartbeat monitors on scheduled jobs. Every backup, every nightly aggregation, every webhook consumer should check in on a cadence. Silence means failure.
- DNS / CNAME health for customer-facing domains. Especially if you offer status pages or custom domains to your users.
Multi-region isn't a luxury
A monitor that runs from one region tells you whether the monitor's network link to your server is working. That's mostly uninteresting. Multi-region monitoring — or at minimum, two-region — lets you distinguish "this one probe's ISP is flaky" from "the service is actually down".
The practical rule we recommend: never declare an incident on a single probe's failure. Require at least two regions to agree within the same window. On Brily this is called quorumand it's configurable per monitor; the default is 2-of-3.
Latency budgets: worth setting, painful to tune
A latency budget is an assertion like "p95 response time must be under 800ms". It's one of the highest-value signals for a healthy product, and it's also the most common source of alert noise, because your actual p95 is not a straight line — it wobbles with traffic, CDN warmth, background jobs.
Practical guidance:
- Set the budget above the 90th percentile of your last-30-days baseline. If your actual p95 is 600ms, the alerting threshold should be 900ms or higher.
- Always combine latency thresholds with duration windows. "p95 over 900ms for 10 minutes" is signal. "one slow request" is noise.
- Review latency budgets every quarter. If you've been alerting on them, tune them. If you haven't been alerting on them, tighten them.
The thing you should monitor that nobody does
Password reset. Every SaaS I've worked with has had at least one 4-hour outage caused by a broken password reset flow that nobody noticed until support tickets piled up, because password reset is not a common path — so it doesn't show up in your error dashboard as anomalous.
Add a synthetic monitor that triggers password reset to a dedicated test inbox (Mailosaur, or an internal email account), clicks the link, and confirms the new password works. Run it every 15 minutes. The first time it catches a broken reset flow, you'll be grateful.
Monitoring-as-code
Past a certain team size (four engineers, ish), click-to-create monitors stops scaling. You want your monitors in source control, diff-able, reviewable, and redeployable on a new environment. Brily's API ships with a simple declarative format that you can keep in the same repo as your application code.
This also solves the "who owns this monitor" problem, which comes up reliably around month nine of any team's monitoring journey. Monitors in git have an owner; monitors in the UI have a mystery.
What to cut
You should have fewer monitors than you think. Every monitor is a maintenance burden: it can break, flap, get out of date as the product changes. A team with 40 monitors and no tuning discipline has worse signal than a team with 10 monitors and good thresholds.
Once a month, prune. Monitors that haven't alerted in 90 days are either perfectly tuned or checking something uninteresting. Be honest about which.
Keep reading
Why product health deserves one unified platform
Three point tools glued together with webhooks is the industry norm. It's also the reason most product teams miss the signal they actually need. A case for one platform over three.
Incident communication is a product feature: a status page playbook
A structured guide to running the status page during an incident: what to say when, how often to update, and why ambiguous language costs more than the outage itself.
NPS that means something: tying feedback to what you actually shipped
Generic NPS is a gauge chart on a dashboard. Useful NPS tells you whether the release you cut last Tuesday made users happier or sadder — and at what confidence.