From ultraship
Diagnoses and recovers from production incidents with structured triage, severity classification, and root cause analysis. Use when service is down, errors spike, or users report critical bugs.
How this skill is triggered — by the user, by Claude, or both
Slash command
/ultraship:rescue <url-or-error-description><url-or-error-description>The summary Claude sees in its skill listing — used to decide when to auto-load this skill
When production is down, every minute costs trust. This skill runs an incident like a principal SRE — fast triage, clear decision-making, structured recovery, and prevention so it never happens again.
When production is down, every minute costs trust. This skill runs an incident like a principal SRE — fast triage, clear decision-making, structured recovery, and prevention so it never happens again.
Before doing anything, classify the incident:
| Severity | Definition | Response Time | Example |
|---|---|---|---|
| SEV-1 | Service completely down, all users affected | Immediately | Site returns 500, database unreachable |
| SEV-2 | Major feature broken, many users affected | Within 15 min | Auth broken, payments failing, data loss |
| SEV-3 | Minor feature broken, some users affected | Within 1 hour | One API endpoint slow, email not sending |
| SEV-4 | Cosmetic or edge case | Next business day | UI glitch on one browser, non-critical error log |
Severity determines urgency. SEV-1/2: restore first, investigate later. SEV-3/4: investigate first, then fix.
Ask for:
If the user is panicking, skip questions and use whatever info is available. Speed > completeness for SEV-1.
node ${CLAUDE_PLUGIN_ROOT}/tools/incident-commander.mjs <project-directory> --url=<production-url>
Parse the JSON output.
If a Sentry MCP server is connected (check your available tools — search for sentry tools), pull live production errors before guessing at code: list the most recent / most frequent issues since the incident window, read the top stack traces, and map each frame back to a file and line in this repo. A real stack trace from production beats inferring the culprit from recent commits. Use the actual error signature to narrow the suspect commit. If no Sentry server is connected, continue with the static diagnostics above (and mention that connecting Sentry would sharpen this step).
Present findings in order of urgency:
Site Status:
Likely Culprit:
Error Patterns Found:
Resource Issues:
Present in order of speed — for SEV-1/2, always recommend Option 1 first:
Option 1: Rollback (fastest — 2-5 min)
git revert <culprit-hash> --no-edit && git push
This is almost always the right first move. Restore service, then investigate.
When NOT to rollback:
Option 2: Hot Fix (5-15 min) If the error pattern is clear and the fix is small:
fix: [what was broken] — incident [date]Option 3: Traffic Management (immediate) If the issue is load-related:
Option 4: Investigate Further If the cause isn't clear:
railway logs, Vercel: function logs)After applying a fix:
node ${CLAUDE_PLUGIN_ROOT}/tools/health-check.mjs <production-url>
Confirm the site is back to healthy status. Check:
For SEV-1/2, the user needs to communicate with their users:
Status page update template:
[Investigating] We're aware of [issue description] and are actively working on a fix.
[Identified] We've identified the cause and are deploying a fix.
[Resolved] The issue has been resolved. [Brief explanation]. We apologize for the disruption.
If the user has a status page: help them post the update. If they don't: suggest setting up a simple one (Instatus, Betteruptime, or a static page).
Generate a post-mortem document from the incident-commander output:
# Incident Post-Mortem — [Date]
## Summary
- **What happened:** [One sentence]
- **Severity:** SEV-[N]
- **Duration:** [start time] to [end time] ([N] minutes)
- **Impact:** [Who was affected, what they experienced]
- **Root cause:** [One sentence]
## Timeline
| Time | Event |
|---|---|
| HH:MM | Issue detected (how: monitoring/user report/deploy) |
| HH:MM | Investigation started |
| HH:MM | Root cause identified |
| HH:MM | Fix deployed |
| HH:MM | Service restored |
## Root Cause Analysis
[Detailed explanation of what went wrong and why]
## What Went Well
- [Fast detection, quick recovery, etc.]
## What Went Wrong
- [Missed in review, no test coverage, no monitoring, etc.]
## Action Items
| Action | Priority | Owner | Deadline |
|---|---|---|---|
| Add test for this failure case | High | [user] | This week |
| Add monitoring for [pattern] | High | [user] | This week |
| Add pre-deploy check that would have caught this | Medium | [user] | This sprint |
| [Update runbook/docs] | Low | [user] | This month |
Save to docs/incidents/YYYY-MM-DD-incident.md.
Based on the incident, suggest concrete preventive measures:
Immediate (today):
/ship pre-deploy auditThis week:
/health or /api/health)This month:
npx claudepluginhub houseofmvps/ultraship --plugin ultrashipIncident response — diagnose production issues, find root cause, propose fix with rollback. Use when asked about "something is broken", "production issue", "why is this down", "incident", or "debug production".
Diagnoses production incidents by detecting environment, gathering symptoms, reading logs with Grep/Bash, checking metrics, tracing requests to find root causes and propose fixes with rollbacks.
Executes structured production incident response: triages P1-P3 severity, contains blast radius (rollback, mitigation), root-causes after stabilization, logs timeline, generates postmortem. Triggers on outages or 'incident'.