From systems-design
Guides interactive workshop to define SLIs from user journeys, set SLO targets based on baselines, establish error budgets and alerting for services.
How this skill is triggered — by the user, by Claude, or both
Slash command
/systems-design:slo-workshopThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
This command runs an interactive workshop to help define SLOs (Service Level Objectives) for a service.
This command runs an interactive workshop to help define SLOs (Service Level Objectives) for a service.
Guide teams through the complete SLO definition process:
First, understand the service context:
If a service name or file is provided:
Gather context through questions:
Guide through selecting meaningful SLIs:
Present SLI categories:
Common SLI Types:
1. Availability
"Can users access the service?"
Measurement: Successful requests / Total requests
2. Latency
"How fast does the service respond?"
Measurement: Request duration at percentile (p50, p90, p99)
3. Correctness
"Does the service return correct results?"
Measurement: Correct responses / Total responses
4. Throughput
"Can the service handle the load?"
Measurement: Requests processed per time unit
5. Freshness
"How current is the data?"
Measurement: Age of data served to users
For each relevant SLI type, define:
Help set appropriate targets:
Consider factors:
Provide guidance:
SLO Target Guidance:
Starting Point Recommendations:
- Availability: Start at current baseline - 0.1%
- Latency: Start at current p99 + 20% buffer
Common Targets:
- 99.9% = 43 minutes downtime/month
- 99.5% = 3.6 hours downtime/month
- 99% = 7.3 hours downtime/month
Tips:
- Don't start at 100% (impossible to maintain)
- Don't set targets you can't measure
- Conservative targets are easier to achieve
- You can tighten targets over time
Define what happens when the error budget is consumed:
Error budget calculation:
Error Budget = 100% - SLO Target
Example:
SLO = 99.9% availability
Error Budget = 0.1% = 43.2 minutes/month
Policy framework:
Error Budget Policy Template:
Budget > 50%:
- Normal development velocity
- Standard change process
Budget 25-50%:
- Increased review for risky changes
- Prioritize reliability improvements
Budget < 25%:
- Pause non-critical feature work
- Focus on reliability improvements
Budget exhausted:
- Stop all non-critical deployments
- All hands on reliability
- Postmortem for budget-burning incidents
Design multi-window burn rate alerting:
Explain burn rate concept:
Burn Rate Alerting:
Burn rate = Rate of consuming error budget
1x burn rate = Exactly consuming monthly budget
2x burn rate = Will exhaust budget in 15 days
10x burn rate = Will exhaust budget in 3 days
Multi-window alerts:
- Fast burn: 14.4x rate over 1 hour (page)
- Slow burn: 3x rate over 3 days (ticket)
Generate SLO documentation:
# [Service Name] SLO Definition
## Service Overview
[Description from workshop]
## Critical User Journeys
1. [Journey 1]
2. [Journey 2]
## SLIs
### [SLI Name]
- Type: [Availability/Latency/etc.]
- Definition: [How measured]
- Good event: [What counts as good]
- Valid event: [What counts as valid]
## SLO Targets
| SLI | Target | Window | Error Budget |
|-----|--------|--------|--------------|
| [SLI 1] | [%] | [days] | [time] |
## Error Budget Policy
### Budget > 50%
[Actions]
### Budget 25-50%
[Actions]
### Budget < 25%
[Actions]
### Budget Exhausted
[Actions]
## Alerting
| Alert | Burn Rate | Window | Severity |
|-------|-----------|--------|----------|
| [Name] | [rate]x | [time] | [Page/Ticket] |
## Review Schedule
- Quarterly SLO review
- Monthly error budget review
- After significant incidents
# Start workshop for a specific service
/sd:slo-workshop order-service
# Start workshop with context file
/sd:slo-workshop @docs/services/payment-api.md
# Start general workshop
/sd:slo-workshop
Throughout the workshop, use AskUserQuestion to:
The workshop produces:
This command leverages:
slo-sli-error-budget - SLO methodology detailsobservability-patterns - Measurement approachesdistributed-tracing - Trace-based SLIsFor SLO consultation without interactive workshop:
observability-consultant - General observability guidancenpx claudepluginhub melodic-software/claude-code-plugins --plugin systems-designDesigns SLOs with SLIs, targets, alerting thresholds, and error budgets following Google SRE best practices. Use for defining reliability targets or service indicators.
Guides SLI selection, SLO setting methodology, and Error Budget management. Use when defining service quality targets or setting up SLO-based alerting.
Defines Service Level Objectives (SLOs) and error budget policies for services. Creates documents with SLIs, targets, burn rate alerts, and review cadences.