From swd
Conduct a root-cause analysis on a bug, incident, or regression — reproduce the failure, reconstruct the timeline, run a 5-whys chain, distinguish symptom from proximate cause from root cause, sweep for siblings, and propose a fix that addresses the cause (not the symptom). Use when the user says "/rca", "root cause", "5 whys", "why is this failing", "investigate this regression", or after any failure the team wants to learn from rather than just patch.
How this skill is triggered — by the user, by Claude, or both
Slash command
/swd:rcaThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
The point of this skill is **not** to fix a bug. It is to understand *why* the bug happened deeply enough that the fix kills the cause, not just the symptom — and to catch the class of failure next time, not just this instance.
The point of this skill is not to fix a bug. It is to understand why the bug happened deeply enough that the fix kills the cause, not just the symptom — and to catch the class of failure next time, not just this instance.
/blueprint plans a change you want to make. /rca investigates a failure you didn't want. The shapes are different: planning starts from a goal, RCA starts from a fact (something broke).
/blueprint's validation reveals that the load-bearing assumption was already wrong in prodDo not invoke for: trivial bugs with obvious causes (typo, missing null check on a fresh edit, lint error), or for future failures you're trying to design around — that's /blueprint territory.
Before any theorising, write down:
If any of this is ambiguous, ask the user. RCA on a fuzzy symptom is a fishing expedition.
You need one of:
Without either, every subsequent step is speculation. If a repro is elusive, that itself is the finding — surface it and discuss instrumentation before continuing.
When did it start? What changed?
git log --since="<earliest known-good>" --until="<earliest known-bad>" -- <area>
Also check: deploys, dependency bumps, infra changes, feature-flag toggles, data migrations, traffic shifts. The cause is almost always in the delta.
If the bug pre-dates any plausible change, the cause is likely a latent condition that something new started exercising. Look for what's new in the inputs, not the code.
Start from the symptom. At each step, the answer becomes the next "why." Record the chain as a numbered list, with each item carrying the same three fields — narrative prose makes it easy to chain unverified claims fluently; explicit per-item fields force evidence per link.
Why did the request return 500?
customers query returned 0 rows for an authenticated user.api/customers.ts:42 — if (rows.length === 0) throw 500; log req-id=abc123 shows empty result set.Why did it return 0 rows?
org_id, and the user's org_id was null.api/customers.ts:38 query; psql> SELECT org_id FROM users WHERE id='…' → null.Why was org_id null?
org_id for users created via the magic-link path.auth/magic-link.ts:71-95 — no call to assignOrg(); cf. auth/password.ts:60 which does call it.Why doesn't it backfill?
auth/hooks.ts:12 registers only on('password-signup', …).Why is it wired only there?
a1b2c3d) and the author didn't know the hook existed — no contract documented it.git log -S "magic-link" auth/; docs/auth.md has no mention of the hook contract.Rules:
file:line, query result, log excerpt, or commit SHA. A blank or hand-wavy cell ("looks like…", "probably because…", "the author must have…") is itself a finding — mark the row UNVERIFIED, and either go get the evidence or stop the chain there. An unverified link cannot support the links below it.Restate the chain, labelling:
A fix at the proximate cause stops this failure. A fix at the root cause stops the class.
Both are legitimate — but the user should choose with eyes open. Present both options.
Then falsify the root cause. Per-link evidence proves each step; it does not prove the synthesis ("therefore X is the root cause"). A chain can be link-by-link verified and still synthesise to the wrong root — e.g. you found a real defect, but not the one that produced this symptom. Write down:
If a cheap disconfirming experiment exists, run it. If not, surface the falsifier as a known limitation — "we believe X is the root cause; we have not ruled out Y." A hypothesis you can't even imagine disproving is religion, not analysis.
If the root cause is real, where else does it manifest?
grep -rn "<the function/condition>" .
git log -S"<distinctive token>"
Found siblings get reported alongside the primary. A "root cause" with zero siblings is suspicious — re-examine whether you stopped at a proximate cause.
Before proposing any fix, write down the assumptions the root-cause hypothesis and the eventual fix rest on, as a numbered list with the same three fields per item. Validating-as-you-go produces a list of "things that happen to be true" — listing first, then validating, surfaces the assumptions you actually depend on.
Assumption: assignOrg() is the only mechanism that sets org_id.
grep -rn "org_id\s*=" . — only assignOrg() and the seed script write it.Assumption: All magic-link users have org_id = null (not just the reported one).
psql> SELECT count(*) FROM users WHERE signup_method='magic_link' AND org_id IS NULL → 1,247.Assumption: Adding assignOrg() to the magic-link path won't double-assign for users who already have an org.
assignOrg(): idempotent — early-returns when org_id is already set.Assumption: The hook contract is the right enforcement layer (vs. a DB constraint).
Rules:
UNVERIFIED assumption to be true, it's a hypothesis dressed as a fix — say so explicitly when proposing it.Two distinct proposals, separated:
For each: files, approach, test coverage, blast radius, rollback story.
Recommend one. The recommendation depends on urgency, risk, and whether the symptom fix would mask the cause from future detection.
Close the loop between cause and fix. For each proposed fix (especially the root-cause one — the symptom fix is trivially counterfactual-true by construction), walk it through the captured failure and answer:
Land on one of:
A fix that can't survive its own counterfactual is a hypothesis dressed as a fix. Say so when proposing it, or pick a different fix.
For the root-cause fix, also propose:
AGENTS.md missed the constraint that would have flagged this, update it."Add a test for this exact bug" is the weakest prevention. "Make this class of bug impossible to express" is the strongest. Aim higher than the floor.
Inverted pyramid:
When tempted to skip a step, check whether your reasoning appears below. If it does, the answer is: do the step.
| Rationalization | Why it fails here |
|---|---|
| "The fix is obvious, skip the RCA." | If the fix were truly obvious, the user wouldn't have invoked /rca. Obvious fixes go via /plan or a one-line edit. The invocation is the signal that depth is required. |
| "I can't reproduce it, but I'm pretty sure I know why." | Pretty-sure-without-repro is the modal cause of "the same bug came back next week." If you can't reproduce, the deliverable is "we need a repro / better instrumentation," not a guess dressed as a finding. |
| "Three whys is enough." | Three whys usually lands on a proximate cause. The whole point of five is to push past the comfortable stopping point. |
| "The symptom is the cause — fix and ship." | The symptom is evidence of the cause. Fix the symptom and the cause moves to its next manifestation. |
| "No siblings found, so the cause is local." | Or you stopped too early. Re-examine. A genuinely local root cause is possible but rare. |
| "Adding a test for this exact case is enough prevention." | That catches the regression, not the class. Ask whether a guardrail, type, or invariant could prevent the class from being expressible at all. |
| "The 5-whys chain reads fine, I don't need to verify each link." | A coherent narrative is not the same as a correct one. Verify each link against code, logs, or data. |
| "The cause is 'the original author didn't anticipate this' — done." | That's a no-op finding. The actionable root cause is: what process / type / test / doc would have made them anticipate it? |
| "RCA can wait, let's ship the fix first." | Fine, if the RCA has a deadline before the on-call rotation forgets. RCAs deferred indefinitely become RCAs never done. Set the deadline now. |
"I'll just git bisect and call the offending commit the root cause." | The commit is where the cause was introduced, not what the cause is. Bisect locates the change; the 5-whys explains why the change was wrong and what allowed it through. |
The RCA is complete when all of these are true. Each item is answerable with evidence — a citation, a log excerpt, a commit SHA — not a vibe.
UNVERIFIED link is explicitly flagged, and no link below an UNVERIFIED one is treated as proven.UNVERIFIED assumption is surfaced as an open question; the recommended fix does not silently depend on one.If a checkbox cannot be ticked honestly, the RCA is not done — return to the step that produces it.
/blueprint — for designing a change. Use after RCA when the root-cause fix is non-trivial enough to warrant deep planning./plan — for the symptom fix when it's small and obvious./rebase — unrelated; named here only so future-you doesn't conflate "regression after rebase" with "rebase failure." Regression after rebase → /rca; merge-conflict failure → /rebase.docs/guidelines.md (per the repo-docs skill) is this skill, applied to a regression specifically.npx claudepluginhub korya/swd-skills --plugin swdProvides behavioral guidelines to reduce common LLM coding mistakes, focusing on simplicity, surgical changes, assumption surfacing, and verifiable success criteria.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.