Stats
Actions
Tags
From qe-framework
Defines SLOs/SLIs, manages error budgets, designs incident response, automates toil, and runs chaos experiments for production systems.
How this skill is triggered — by the user, by Claude, or both
Slash command
/qe-framework:Qsre-engineerThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
1. **Assess reliability** - Review architecture, SLOs, incidents, toil levels
Load detailed guidance based on context:
| Topic | Reference | Load When |
|---|---|---|
| SLO/SLI | references/slo-sli-management.md | Defining SLOs, calculating error budgets |
| Error Budgets | references/error-budget-policy.md | Managing budgets, burn rates, policies |
| Monitoring | references/monitoring-alerting.md | Golden signals, alert design, dashboards |
| Automation | references/automation-toil.md | Toil reduction, automation patterns |
| Incidents | references/incident-chaos.md | Incident response, chaos engineering |
When implementing SRE practices, provide:
# 99.9% availability SLO over a 30-day window
# Allowed downtime: (1 - 0.999) * 30 * 24 * 60 = 43.2 minutes/month
# Error budget (request-based): 0.001 * total_requests
# Example: 10M requests/month → 10,000 error budget requests
# If 5,000 errors consumed in week 1 → 50% budget burned in 25% of window
# → Trigger error budget policy: freeze non-critical releases
groups:
- name: slo_availability
rules:
# Fast burn: 2% budget in 1h (14.4x burn rate)
- alert: HighErrorBudgetBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > 0.014400
and
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > 0.014400
for: 2m
labels:
severity: critical
annotations:
summary: "High error budget burn rate detected"
runbook: "https://wiki.internal/runbooks/high-error-burn"
# Slow burn: 5% budget in 6h (1x burn rate sustained)
- alert: SlowErrorBudgetBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[6h]))
/
sum(rate(http_requests_total[6h]))
) > 0.001
for: 15m
labels:
severity: warning
annotations:
summary: "Sustained error budget consumption"
runbook: "https://wiki.internal/runbooks/slow-error-burn"
# Latency — 99th percentile request duration
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
# Traffic — requests per second by service
sum(rate(http_requests_total[5m])) by (service)
# Errors — error rate ratio
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
# Saturation — CPU throttling ratio
sum(rate(container_cpu_cfs_throttled_seconds_total[5m])) by (pod)
/
sum(rate(container_cpu_cfs_periods_total[5m])) by (pod)
service: api-gateway
slo:
availability: 99.9%
latency_p99: 500ms
error_rate: 0.1%
error_budget:
monthly_budget: 43.2 minutes downtime
burn_rate: (errors / total_requests)
threshold_alert: 2% burned in 1 hour
def calc_error_budget(slo: float, window_hours: int) -> dict:
allowed_error = 1 - slo
minutes = window_hours * 60
return {"budget_minutes": allowed_error * minutes, "burn_rate_alert": allowed_error * 14.4}
# Incident: [Service] High Error Rate
**Severity:** P1 | **Detection:** [Alert name] | **Duration:** 5 min
**Root Cause:** [To be filled] | **Impact:** [N users, $X revenue]
**Action Plan:**
1. Page on-call engineer
2. Check [service] logs for errors
3. If DB slow: scale replicas / query optimization
4. Verify error budget impact
# SLO config for [service-name]
# Last reviewed: [DATE] | Owner: [team]
service:
name: api-gateway
sli_query: sum(rate(http_requests_total{status=~"5.."}[5m]))
slo:
availability: 99.9% # Justification: user-facing API
latency_p99: 500ms # Justification: customer expectation
error_budget_policy:
- if burn_rate > 14.4x for 1h: freeze non-critical releases
- if burn_rate > 1x for 6h: declare SEV2, page on-call
npx claudepluginhub inho-team/qe-framework --plugin qe-frameworkCreates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.