From hatch3r
Responds to production incidents using a structured workflow: classify severity, triage impact, mitigate, root-cause, and write a blameless post-mortem. Use for outages, production issues, or security incidents.
How this skill is triggered — by the user, by Claude, or both
Slash command
/hatch3r:hatch3r-incident-responseThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill shares the `id: hatch3r-incident-response` with the orchestrator command `commands/hatch3r-incident-response.md`. The two are NOT duplicates — they split the incident workflow by execution model per CONSTITUTION §6 Decision 13:
commands/hatch3r-incident-response.md (Decision 13 handoff)This skill shares the id: hatch3r-incident-response with the orchestrator command commands/hatch3r-incident-response.md. The two are NOT duplicates — they split the incident workflow by execution model per CONSTITUTION §6 Decision 13:
commands/hatch3r-incident-response.md (orchestrator entry): the delegated live-incident pipeline — a hatch3r-incident-responder specialist drives triage → bounded-autonomy mitigation → communication → blameless post-mortem, and a hatch3r-reliability specialist runs the post-incident telemetry/SLO reconstruction in parallel once stabilized (agentPipeline: [hatch3r-incident-responder, hatch3r-reliability]). Use the command for a live production incident that warrants specialist fan-out (parallel mitigation + reliability reconstruction); a security-suspected incident adds hatch3r-security.The merge-candidate review (F16.3-H3) flagged the shared id; this handoff documentation is the explicit workflow-split declaration that disambiguates the pair, enforced by the Decision-13 command↔skill gate in src/cli/commands/validate.ts. A future collapse into a single command appendix requires coordinated edits to the command body, the bundled content inventory (skills count), and that gate.
Task Progress:
- [ ] Step 0: Detect ambiguity (P8 B1)
- [ ] Step 1: Classify severity (P0-P3) based on impact
- [ ] Step 1b: Capture topology context — impacted service graph + upstream/downstream deps
- [ ] Step 2: Triage — identify affected systems, user impact, blast radius
- [ ] Step 3: Mitigate — apply hotfix or rollback, verify mitigation works
- [ ] Step 4: Root cause analysis — trace the failure chain
- [ ] Step 5: Write post-mortem with timeline, root cause, action items
- [ ] Step 6: Create follow-up issues for permanent fixes and preventive measures
Two policies bound this workflow before any remediation action runs: the Bounded Autonomy & Escalation matrix (which actions auto-execute vs require a human gate, by severity) and the Telemetry Sources adapter (where to read signal, by capability class). Both are defined below; read them before Step 3.
Before any work, scan the invocation for unresolved questions in scope, intent, acceptance criteria, target environment, or irreversibility. If any are found, ask the user via the platform-native question tool per agents/shared/user-question-protocol.md. Do not proceed under silent assumption. Default path, not an exception. Triggers for THIS skill: user-facing impact vs internal-only, blast radius known (single tenant vs all users), rollback safety verified, stakeholder notification scope (engineering vs exec vs public), and whether mitigation requires data write (irreversible) vs config flip (reversible).
Fan-out scales with task size; token cost never justifies serializing independent work (rules/hatch3r-fan-out-discipline.md P8 B2; agents/shared/efficiency-patterns.md). Emit sub_agents_spawned: { count, rationale } in your output.
An agent acting in a live incident operates under bounded autonomy: actions are bounded, reversible-first, and gated on a human-in-the-loop for high-blast-radius severities. This matrix sets the auto-action vs escalation threshold by severity. Match the row to the severity assigned in Step 1.
| Severity | Autonomy bound | Required gate before action |
|---|---|---|
| P0 | No autonomous mutation. Investigate, build the timeline, propose a diff — do not apply. | Human approval required before any mitigation. Page on-call; do not self-execute. |
| P1 | Confidence-routed. High-confidence reversible action (flag flip, scale-up, documented rollback) may auto-apply WITH a diff preview emitted first; low-confidence or irreversible action escalates. | Human gate when confidence is medium/low OR the action writes data / is irreversible. |
| P2 | Auto-remediation acceptable for reversible actions with a diff preview emitted before apply. | Human gate only for irreversible (data-write, schema, destructive) actions. |
| P3 | Auto-remediation acceptable. | None for reversible actions; flag irreversible actions for review. |
Rules that hold across all rows:
agents/shared/quality-charter.md section 1) on every proposed mitigation. Medium/low confidence on P1 routes to a human gate.Capture signal from the project's observability stack before declaring blast radius (Step 1b) and when verifying mitigation (Step 3). Read the configured stack from project conventions; do not assume a vendor. Adapter patterns by capability class:
| Capability class | What to read | Common providers |
|---|---|---|
| Distributed traces | Request path spans, trace_id/span_id correlation, latency percentiles | OpenTelemetry-compatible backend, Datadog APM, Grafana Tempo |
| Metrics (RED/USE) | Rate, Errors, Duration per route; Utilization, Saturation, Errors per resource | Prometheus/Grafana, CloudWatch, Datadog |
| Logs | Structured JSON with trace_id, service+version+environment, error-level stack traces | Splunk, CloudWatch Logs, Loki |
| Error tracking | Grouped exceptions with release + environment tags | Sentry-class tracker |
| Deploy/change history | Recent deploys, config changes, dependency bumps correlated to incident start | Platform CLI (gh/az/glab), CI deploy log |
Reference rules/hatch3r-observability-tracing.md and rules/hatch3r-observability-logging.md for the end-user instrumentation floor these sources read from. When a capability class is not instrumented in the project, record the gap as a post-mortem action item rather than assuming data exists.
| Severity | Definition | Examples |
|---|---|---|
| P0 | Complete outage, data loss, security breach | App unusable, auth down, data exposed |
| P1 | Major degradation, significant user impact | Sync failing, billing broken, >1% error rate |
| P2 | Partial degradation, limited impact | Single flow broken, slow performance |
| P3 | Minor issue, workaround available | Cosmetic bug, edge case |
platform in .hatch3r/hatch.json):
issue_read, search_issues) or gh issue list --search "..."az boards query --wiql "SELECT [System.Id] FROM WorkItems WHERE [System.Title] CONTAINS '...'" or az boards work-item show --id Nglab issue list --search "..." or glab issue view NBefore declaring blast radius in Step 2, map the impacted service graph. A correct blast-radius estimate depends on knowing what the failing component talks to — upstream callers amplify user impact, downstream dependencies are candidate root causes.
Output a one-line topology summary: impacted: {node} | upstream callers: {list} | downstream deps: {list}. Step 2 blast-radius estimation consumes this directly.
Write a structured post-mortem document:
Store in project incident docs or as an issue/wiki page on the platform. Follow project conventions.
platform in .hatch3r/hatch.json):
gh issue create --title "..." --body "..." --label "incident-follow-up" (or use GitHub MCP issue_create)az boards work-item create --type "Bug" --title "..." --description "..." --fields "System.Tags=incident-follow-up"glab issue create --title "..." --description "..." --label "incident-follow-up"incident-follow-up, P0, P1).npx claudepluginhub hatch3r/hatch3r --plugin hatch3rExecutes structured production incident response: triages P1-P3 severity, contains blast radius (rollback, mitigation), root-causes after stabilization, logs timeline, generates postmortem. Triggers on outages or 'incident'.
Classifies incidents by severity (SEV1-4), constructs timelines, assesses impact, performs 5 Whys root cause analysis, and generates blameless post-mortems for production issues.
Execute structured live incident response: declare severity, assign roles, mitigate, communicate, resolve, and run blameless postmortems for production incidents.