From grimoire
Defines SLOs and error budgets for service reliability, enabling data-driven trade-offs between feature velocity and system stability.
How this skill is triggered — by the user, by Claude, or both
Slash command
/grimoire:design-slo-sla-frameworkThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Define Service Level Objectives (SLOs) that balance reliability with feature velocity, and Service Level Agreements (SLAs) that set customer expectations with contractual consequences.
Define Service Level Objectives (SLOs) that balance reliability with feature velocity, and Service Level Agreements (SLAs) that set customer expectations with contractual consequences.
Adopted by: Google (SRE origin), Atlassian, Spotify, Cloudflare, AWS (every managed service has published SLAs) — the SRE discipline is built on this foundation Impact: Google SRE book: teams using error budgets ship 2x more features with 2x fewer incidents; SLOs replace "keep the lights on" with measurable, negotiable reliability targets; error budgets make reliability vs. velocity trade-offs explicit and data-driven Why best: Without SLOs, reliability is subjective; "the system is unreliable" vs. "we missed our 99.9% availability SLO by 0.2% this month" — the latter drives precise conversations and improvement
Sources: Beyer et al. "Site Reliability Engineering" O'Reilly (2016); Murphy et al. "The Site Reliability Workbook" O'Reilly (2018); Treynor "The Calculus of Service Availability" ACM Queue (2017)
Define SLIs (Service Level Indicators) — An SLI is a quantitative measure of the service's behavior from the user's perspective. Common SLIs: availability (successful requests / total requests), latency (% of requests completing < threshold), throughput (requests processed per second), error rate (failed requests / total requests). Choose SLIs that reflect what users care about, not what's easy to measure.
Set SLO targets — An SLO is a target range for an SLI over a time window. "99.9% of requests succeed over a rolling 30-day window." The target must be aspirational but achievable. Start with historical performance and set the SLO at the 10th percentile of recent 30-day windows. Tighten over time as reliability improves.
Calculate error budgets — Error budget = 1 - SLO. For 99.9% availability: error budget = 0.1% of requests = 43.2 minutes of downtime per 30-day window. The error budget is the allowable unreliability. Track budget consumption daily. When the error budget is exhausted, reliability work takes priority over feature development.
Define the SLO measurement window — Rolling windows (last 30 days) are more stable than calendar windows and avoid the "reset on the 1st" gaming problem. Use a 28-day or 30-day rolling window for most SLOs. Weekly SLOs are useful for fast feedback but noisy; use as leading indicators, not commitments.
Choose SLO tiers for service criticality — Tier 1 (user-facing critical): 99.9% or 99.95% availability, p99 < 200 ms. Tier 2 (internal business logic): 99.5%, p99 < 500 ms. Tier 3 (batch, background): 99.0%, no latency SLO. Not all services need the same reliability target; higher SLOs cost proportionally more to maintain.
Implement SLO monitoring — Measure SLIs continuously. Use Prometheus + Grafana, Datadog SLO tracking, or Google Cloud Monitoring. Alert at 2% and 5% of monthly error budget consumed in the last hour (alert on burn rate, not remaining budget). Burn rate alerting provides time to act before the budget is exhausted.
Define SLAs (Service Level Agreements) — An SLA is a commercial agreement with defined consequences for missing the SLO (service credits, refunds). SLAs must be weaker than SLOs: if your SLO is 99.9%, your SLA should commit to 99.5% — the gap is your margin for SLO misses before SLA consequences trigger. Involve legal and product in SLA definition.
Establish error budget policy — Write and publish: when the error budget is > 50% remaining: normal development velocity. When 25-50% remaining: reliability review required before major deployments. When < 25% remaining: freeze feature deployments; focus on reliability. When exhausted: full reliability incident review; no feature work until budget is replenished.
Conduct SLO reviews — Monthly: review SLO compliance, error budget consumption, and trend. Quarterly: review whether SLO targets are appropriate — too easy targets provide false confidence; too hard targets paralyze teams. Annual: renegotiate SLAs based on current SLO track record.
Make SLOs visible — Publish SLO status on an internal dashboard accessible to all engineers and stakeholders. SLO transparency creates shared accountability. Consider publishing SLO status externally on a status page (Statuspage, Cachet) to set user expectations proactively.
npx claudepluginhub jeffreytse/grimoire --plugin grimoireDefines service reliability targets, error budgets, and SLI/SLO/SLA structures based on Google SRE practices. Use when designing or reviewing reliability commitments.
Designs SLO frameworks, defines SLIs and error budgets, and implements monitoring systems balancing reliability with feature velocity. For service reliability targets and dashboards.
Defines Service Level Objectives (SLOs) and error budget policies for services. Creates documents with SLIs, targets, burn rate alerts, and review cadences.