Resilience

Error budgets that teams actually use

Field note · 9 min read

Most orgs I meet have a slide called “SLOs.” Fewer have a room where, when a graph turns red, people change what they are doing the next week without waiting for a quarterly re-plan. The gap is not the math; it is governance, narrative, and consequence. This note is a practical map from “we wrote down three nines” to an error budget you can use as a spending mechanism-on purpose.

1. If your SLI is not measured, the SLO is a wish

Pick user-journey–level symptoms: checkout completion, time-to-read for an API, or a mobile screen’s time-to-interactive. Instrument them the way a user would notice pain-not the way your most optimistic microservice would like to report. If you can’t graph it in one well-understood way, the error budget is abstract and disputable.

2. Policy before panic

While you are in the green-before a launch or season-agree, in writing:

What a consumed budget (yellow) stops (certain feature categories, high-risk refactors, parallel launches).
What an exhausted budget (red) forces (freeze, rollback posture, on-call headcount, or explicit “we accept the risk and tell customers” with leadership signoff).
How reliability work and compliance/audit evidence (see evidence bundles) can draw from the same runbooks: fewer duplicate meetings.

3. Product management is in the same channel

Reliability is a feature cost. A PM who only hears “SRE says no” will route around the program. Put error budget in the same roadmap review: “this release consumes an estimated 12% of quarterly budget; here is the rollback and monitoring plan; here is the revenue hypothesis.” The conversation becomes economic instead of political.

4. Stop hiding tail latency in averages

Mean time is a polite fiction. A budget against percentile latency and error rate by critical journey is harder, but that is the pain users remember-especially in mobile and cross-region hybrid setups. Document which percentiles the board sees vs which engineers fix first.

5. Tooling should not require a new religion

Whether you use a vendor, OpenTelemetry, or a humble HAProxy + logs pipeline, the rule is the same: the same definitions feed both dashboards and the budget. If the weekly email uses a different denominator than the incident call, the program dies by confusion.

“A budget you never spend is a budget you never used to learn.” - Use it for controlled experiments and debt paydown, not only as a no-launch sign.

6. Executive reporting in plain language

One number for leadership: remaining budget this quarter, plus a story about one concrete reliability investment that earned a future launch slot. The narrative beats twenty slides of uptime charts. Tie it to assurance where audits ask for monitoring and change control: you are already doing the work; package it as evidence, not a separate theater track.

Takeaway

Living error budgets are contracts between product, engineering, and operations-backed by metrics people trust and policies people rehearsed while calm. If you only build the dashboard when a customer complains, you are not running a budget; you are running a blame lottery.

Discuss an SLO program →