From godmode
Guides site reliability engineering: defines SLO/SLI/SLA, manages error budgets, reduces toil, sets up on-call rotations, creates runbooks, and handles incidents.
How this skill is triggered — by the user, by Claude, or both
Slash command
/godmode:reliabilityThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- `/godmode:reliability`, "SRE", "SLO", "error budget"
/godmode:reliability, "SRE", "SLO", "error budget"grep -r "healthcheck\|health-check\|/health" \
--include="*.ts" --include="*.py" -l 2>/dev/null
grep -r "pagerduty\|opsgenie\|alertmanager" \
--include="*.yaml" --include="*.yml" -l 2>/dev/null
Service: <name> | Criticality: Tier 1/2/3
Current state: monitoring, alerting, on-call, runbooks
Dependencies: <upstream and downstream>
Hierarchy: SLA (external) -> SLO (internal, stricter) -> SLI (measured metric).
SLI categories: availability (success/total), latency (requests < threshold / total), throughput, correctness, freshness, durability.
Error budget = 1 - SLO.
Errors: HTTP 5xx, timeouts, circuit breaker rejections. NOT 4xx or 429.
IF error rate >0.1%: investigate top 3 error classes. WHEN SLO budget <10% remaining: freeze deploys.
Policy:
50% remaining: normal operations
Multi-window burn rate alerts:
Both windows must trigger (reduces false positives).
Toil = manual, repetitive, automatable, tactical, no enduring value. Inventory with frequency + hours. Target: <50% of team capacity. Automate top 3.
IF toil > 50%: stop feature work, automate.
Minimum 5 engineers. Primary + secondary. Escalation: L1(0m) -> L2(15m) -> L3(30m) -> L4(60m). Health: <5 pages/shift, <1 during sleep, MTTA <5min, MTTR <30min, false positive <20%. Max 1 week in 5. Day off after off-hours SEV1.
Every pageable alert needs: what is happening, user impact, diagnostic steps (commands), mitigation options (commands), escalation, post-incident actions. Levels: L0 Manual -> L1 Assisted -> L2 Semi-auto -> L3 Full auto.
Lifecycle: Detection -> Triage -> Mitigation -> Resolution -> Post-mortem -> Prevention. Severity: SEV1 (<15min), SEV2 (<30min), SEV3 (<2h), SEV4 (next day). Roles: IC, Tech Lead, Comms, Scribe.
SLOs, error budget alerts, dashboards, logging, tracing, alerts, runbooks, on-call, circuit breakers, timeouts, auto-scaling, canary deploy, rollback.
Append .godmode/reliability-results.tsv:
timestamp service slo_count budget_remaining_pct alerts runbooks status
KEEP if: SLO measurement works AND alerts fire
correctly AND runbook is actionable.
DISCARD if: false positives OR measurement broken
OR runbook is vague.
STOP when ALL of:
- SLOs defined and measurable (tier-1 services)
- Burn rate alerts configured and tested
- On-call rotation active with escalation
- Runbooks exist for all critical alerts
On failure: git reset --hard HEAD~1. Never pause.
| Failure | Action |
|---|---|
| SLO always breached | Set at current p95, improve later |
| Alert fatigue | Increase threshold, multi-signal alerts |
| Runbook outdated | Add to deploy checklist, test quarterly |
| Budget depleted fast | Freeze deploys, fix top errors |
npx claudepluginhub arbazkhan971/godmodeDefines SLOs/SLIs, manages error budgets, designs incident response and capacity plans, and produces monitoring configs and automation for production systems.
Defines SLIs/SLOs, manages error budgets, designs incident response, produces monitoring configs and automation scripts. Use for reliability engineering, incident management, toil reduction, or capacity planning.