From hatch3r
Verifies a service is production-ready via 9 reliability gates: SLO definition, kill switch, timeouts, retries, probes, runbook, staged rollout, ambiguity detection, and fan-out discipline. Runs before declaring a feature complete.
How this skill is triggered — by the user, by Claude, or both
Slash command
/hatch3r:hatch3r-reliability-verifyThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill defines what "done" means for any feature shipping a service to production. Run before declaring a feature complete. The 9 gates below are machine-checkable on the manifest, the source, and the alert configuration. Skipping any gate = the feature is not done. Functional tests passing alone do not satisfy this bar — a service that lacks an SLO, a kill switch, or a runbook will fail in...
This skill defines what "done" means for any feature shipping a service to production. Run before declaring a feature complete. The 9 gates below are machine-checkable on the manifest, the source, and the alert configuration. Skipping any gate = the feature is not done. Functional tests passing alone do not satisfy this bar — a service that lacks an SLO, a kill switch, or a runbook will fail in production before its first alert reaches the on-call.
Inputs the skill expects:
src/ and k8s/ (or equivalent manifest path).docs/runbooks/ directory.slo/ directory or inline SLO definitions in the alert manifest (Prometheus rules, Datadog monitors, OpenSLO YAML).Outputs the skill produces: a 9-line verdict block written to the PR conversation, plus a JSON artifact at .audit-workspace/reliability-verify-<sha>.json for downstream consumption by hatch3r-release (or any downstream release-prep skill).
Before any work, scan the invocation for unresolved questions in scope, intent, acceptance criteria, target environment, or irreversibility. If any are found, ask the user via the platform-native question tool per agents/shared/user-question-protocol.md. Do not proceed under silent assumption. Default path, not an exception. Triggers for THIS skill: service scope, SLO target values and window, rollout strategy (canary stages, hold durations), kill-switch authority and provider, and blast-radius rollback drill cadence.
Fan-out scales with task size; token cost never justifies serializing independent work (rules/hatch3r-fan-out-discipline.md P8 B2; agents/shared/efficiency-patterns.md). Emit sub_agents_spawned: { count, rationale } in your output.
This skill is the verification HARNESS — it declares HOW each reliability gate is checked. The DISPATCHER that decides WHEN to run it is the CQ specialist agent:
agents/hatch3r-reliability.md — invokes this skill as a closing reliability gate (CQ4), alongside skills/hatch3r-observability-verify for the telemetry sub-vector. The agent contributes the review trigger and Phase-4 dispatch; this skill contributes the 9-gate procedure (SLO, kill switch, timeouts, retries, probes, runbook, staged rollout).No duplication: the agent decides WHEN, this skill defines HOW. The agent body cites this skill (agents/hatch3r-reliability.md — "cite skills/hatch3r-reliability-verify ... as the closing gates"); this subsection is the symmetric upstream citation per rules/hatch3r-agent-orchestration.md (Phase-4 dispatch).
availability >= 99.9% over rolling 28d or p95 latency <= 300ms over rolling 28d.slo/<service>.yaml or a Sloth / OpenSLO file).slo: or objectives: in the service manifest; reject if absent.rules/hatch3r-observability-metrics.md.docs/runbooks/<service>.md next to the alert that would trigger its use.rules/hatch3r-operability.md §Feature Flags.context.WithDeadline (Go), chained AbortSignal (Web/Node), Deadline metadata (gRPC), or TimeLimiter (JVM).rules/hatch3r-resilience-patterns.md §Timeouts.opossum (Node), resilience4j (JVM), Polly (.NET), gobreaker + cenkalti/backoff (Go), or pybreaker + tenacity (Python).sleep = min(cap, random_between(base, prev_sleep * 3)) with base 100ms, cap 30s, max 3 retries.Idempotency-Key header present on retried non-idempotent operations (POST, PATCH).rules/hatch3r-resilience-patterns.md §Retry.livenessProbe, readinessProbe, and (for slow-starting services) startupProbe./health/live, /health/ready, /health/startup — not a single shared /health.livenessProbe.httpGet.path != readinessProbe.httpGet.path (shared endpoints fail this gate).rules/hatch3r-operability.md §Probes./health/ready to 503, then drains in-flight requests.preStop hook delays 1–3s before SIGTERM to handle the endpoint-propagation race.terminationGracePeriodSeconds >= 45.rules/hatch3r-operability.md §Graceful Shutdown.runbook_url annotation linking to docs/runbooks/<alert-name>.md.runbook_url or with a 404 link.rules/hatch3r-operability.md §Runbook URL.Rollout or Canary resource in the deploy directory; reject if missing or if steps: skips the 1% stage.rules/hatch3r-progressive-delivery.md.## Blast radius section; reject if absent or if any required field is empty.rules/hatch3r-progressive-delivery.md §Blast-Radius Reasoning.All 9 gates pass = the service is "done" enough to ship to production. Anything less = not done; the missing gates are findings against this skill.
The orchestrator running this skill emits a single-line verdict per gate (GATE_N: PASS|FAIL <evidence-path>) and aggregates them. One FAIL on a required gate blocks the merge regardless of functional-test status.
Evidence path format: path/to/file.yaml:LN or commit-sha. The verdict is auditable — a downstream review or release-gate skill can replay the same checks against the same evidence paths and reproduce the verdict bit-for-bit.
Gates run independently — a FAIL on Gate 3 does not short-circuit the remaining gates; the run produces the full 9-line verdict so the developer fixes everything in one pass rather than serializing on rerun cycles.
hatch3r-implementer finishes service code and before hatch3r-qa-validation runs.src/services/, src/handlers/, src/clients/, k8s/, manifests/, or the alert / SLO configuration.rules/hatch3r-resilience-patterns.md — circuit breakers, retries with decorrelated jitter, idempotency keys.rules/hatch3r-operability.md — probes, graceful shutdown, kill switches, runbooks.rules/hatch3r-progressive-delivery.md — canary, blue-green, auto-rollback on SLO burn.rules/hatch3r-observability-metrics.md — SLOs, RED metrics, burn-rate alerts.sre.google/workbookkubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probesargoproj.github.io/argo-rolloutsflagger.appopenfeature.devgithub.com/nodeshift/opossumresilience4j.readme.iopollydocs.orgsloth.devopenslo.comnpx claudepluginhub hatch3r/hatch3r --plugin hatch3rVerifies observability completeness before shipping a service — OTel span coverage, log-trace correlation, SLOs, error tracking, and GenAI semantic conventions.
Guides site reliability engineering: defines SLO/SLI/SLA, manages error budgets, reduces toil, sets up on-call rotations, creates runbooks, and handles incidents.
Implements SRE practices for production reliability: SLO/SLI definitions, monitoring/alerting, chaos engineering, incident runbooks, capacity planning. Handles brownfield extensions.