Skill

slo-workshop

Guides interactive workshop to define SLIs from user journeys, set SLO targets based on baselines, establish error budgets and alerting for services.

monitoring

devops

Popularity

Parent stars

Parent forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/systems-design:slo-workshop

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

ReadGlobGrepTaskAskUserQuestion

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

This command runs an interactive workshop to help define SLOs (Service Level Objectives) for a service.

SKILL.md

265 lines · ~1.4k tokens

Stats

LanguagePython

Parent stars67

Parent forks10

MaintenanceGood

Last CommitFeb 15, 2026

Actions

View Source View Plugin View on GitHub View README

SLO Workshop Command

This command runs an interactive workshop to help define SLOs (Service Level Objectives) for a service.

Purpose

Guide teams through the complete SLO definition process:

Identifying critical user journeys
Selecting appropriate SLIs (Service Level Indicators)
Setting realistic SLO targets
Establishing error budget policies
Designing alerting strategies

Workflow

Phase 1: Service Understanding

First, understand the service context:

If a service name or file is provided:

Search the codebase for the service
Identify endpoints, dependencies, and user-facing functionality
Look for existing metrics, SLOs, or monitoring configuration

Gather context through questions:

What does this service do for users?
Who are the primary users (internal/external)?
What are the critical user journeys?
What does "working correctly" mean for users?

Phase 2: SLI Selection

Guide through selecting meaningful SLIs:

Present SLI categories:

Common SLI Types:

1. Availability
   "Can users access the service?"
   Measurement: Successful requests / Total requests

2. Latency
   "How fast does the service respond?"
   Measurement: Request duration at percentile (p50, p90, p99)

3. Correctness
   "Does the service return correct results?"
   Measurement: Correct responses / Total responses

4. Throughput
   "Can the service handle the load?"
   Measurement: Requests processed per time unit

5. Freshness
   "How current is the data?"
   Measurement: Age of data served to users

For each relevant SLI type, define:

What counts as a "good" event
What counts as a "valid" event (denominator)
How it will be measured (metrics, logs, synthetic)

Phase 3: SLO Target Setting

Help set appropriate targets:

Consider factors:

Current baseline (what are we achieving today?)
User expectations (what do users need?)
Engineering capacity (what can we sustain?)
Business requirements (what's contractually required?)

Provide guidance:

SLO Target Guidance:

Starting Point Recommendations:
- Availability: Start at current baseline - 0.1%
- Latency: Start at current p99 + 20% buffer

Common Targets:
- 99.9% = 43 minutes downtime/month
- 99.5% = 3.6 hours downtime/month
- 99% = 7.3 hours downtime/month

Tips:
- Don't start at 100% (impossible to maintain)
- Don't set targets you can't measure
- Conservative targets are easier to achieve
- You can tighten targets over time

Phase 4: Error Budget Policy

Define what happens when the error budget is consumed:

Error budget calculation:

Error Budget = 100% - SLO Target

Example:
SLO = 99.9% availability
Error Budget = 0.1% = 43.2 minutes/month

Policy framework:

Error Budget Policy Template:

Budget > 50%:
- Normal development velocity
- Standard change process

Budget 25-50%:
- Increased review for risky changes
- Prioritize reliability improvements

Budget < 25%:
- Pause non-critical feature work
- Focus on reliability improvements

Budget exhausted:
- Stop all non-critical deployments
- All hands on reliability
- Postmortem for budget-burning incidents

Phase 5: Alerting Strategy

Design multi-window burn rate alerting:

Explain burn rate concept:

Burn Rate Alerting:

Burn rate = Rate of consuming error budget

1x burn rate = Exactly consuming monthly budget
2x burn rate = Will exhaust budget in 15 days
10x burn rate = Will exhaust budget in 3 days

Multi-window alerts:
- Fast burn: 14.4x rate over 1 hour (page)
- Slow burn: 3x rate over 3 days (ticket)

Define alert thresholds based on SLO targets

Phase 6: Documentation

Generate SLO documentation:

# [Service Name] SLO Definition

## Service Overview
[Description from workshop]

## Critical User Journeys
1. [Journey 1]
2. [Journey 2]

## SLIs

### [SLI Name]
- Type: [Availability/Latency/etc.]
- Definition: [How measured]
- Good event: [What counts as good]
- Valid event: [What counts as valid]

## SLO Targets

| SLI | Target | Window | Error Budget |
|-----|--------|--------|--------------|
| [SLI 1] | [%] | [days] | [time] |

## Error Budget Policy

### Budget > 50%
[Actions]

### Budget 25-50%
[Actions]

### Budget < 25%
[Actions]

### Budget Exhausted
[Actions]

## Alerting

| Alert | Burn Rate | Window | Severity |
|-------|-----------|--------|----------|
| [Name] | [rate]x | [time] | [Page/Ticket] |

## Review Schedule
- Quarterly SLO review
- Monthly error budget review
- After significant incidents

Usage Examples

# Start workshop for a specific service
/sd:slo-workshop order-service

# Start workshop with context file
/sd:slo-workshop @docs/services/payment-api.md

# Start general workshop
/sd:slo-workshop

Interactive Elements

Throughout the workshop, use AskUserQuestion to:

Gather service context
Validate SLI selections
Confirm target appropriateness
Review error budget policies

Output

The workshop produces:

SLO Definition Document - Complete SLO specification
Implementation Checklist - Steps to implement the SLOs
Review Schedule - When to revisit and adjust

Related Skills

This command leverages:

slo-sli-error-budget - SLO methodology details
observability-patterns - Measurement approaches
distributed-tracing - Trace-based SLIs

Related Agent

For SLO consultation without interactive workshop:

observability-consultant - General observability guidance

slo-workshop

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

slo-workshop

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

SLO Workshop Command

Purpose

Workflow

Phase 1: Service Understanding

Phase 2: SLI Selection

Phase 3: SLO Target Setting

Phase 4: Error Budget Policy

Phase 5: Alerting Strategy

Define alert thresholds based on SLO targets

Phase 6: Documentation

Usage Examples

Interactive Elements

Output

Related Skills

Related Agent

Similar Skills

SLO Workshop Command

Purpose

Workflow

Phase 1: Service Understanding

Phase 2: SLI Selection

Phase 3: SLO Target Setting

Phase 4: Error Budget Policy

Phase 5: Alerting Strategy

Define alert thresholds based on SLO targets

Phase 6: Documentation

Usage Examples

Interactive Elements

Output

Related Skills

Related Agent

Similar Skills