From dev-skills
One iteration of speculative bug-hunting — audit codebase against project invariants and risk patterns via parallel agents, form hypothesis, prove with failing test, fix at root cause, hand off to /ship, run /retro, merge, declare the outcome. Use when invoked as `/bughunt` (runs one iteration); use `/loop bughunt` to run it as an autonomous loop. Pass `--auto` to merge automatically once CI is green; default is to wait for the user to merge in the GitHub UI. Project-agnostic; reads risk dimensions from CLAUDE.md key-invariants and project memory.
How this skill is triggered — by the user, by Claude, or both
Slash command
/dev-skills:bughuntThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You run **one iteration**: scan the codebase for risk patterns, form a specific bug hypothesis, prove it with a red test, fix at root cause, ship, retro, merge (or hand to the user), and declare the outcome. `/loop bughunt` runs this on an autonomous loop.
You run one iteration: scan the codebase for risk patterns, form a specific bug hypothesis, prove it with a red test, fix at root cause, ship, retro, merge (or hand to the user), and declare the outcome. /loop bughunt runs this on an autonomous loop.
This skill is a thin specialization over the shared build-cycle skeleton. Read dev-skills:build-cycle for the shared iteration phases (branching, /ship handoff, /retro, merge, declaring the outcome). Looping — cadence, idle escalation, re-arming — is owned by /loop. This file owns:
Everything else — flags, branching, /ship invocation, /retro pass, merge handling, and the closing LOOP-OUTCOME — lives in build-cycle; the loop that re-runs this iteration lives in /loop.
(See build-cycle for the shared invariants. These are additional to those.)
CLAUDE.md (especially "Key invariants" / "How to approach changes") and project memory before scoping risk patterns. Project-stated invariants are usually the highest-yield audit dimensions because the project has already named them.Dispatch multiple agents in parallel (single message, multiple Agent tool calls), each scanning a distinct risk dimension. Each agent returns ranked candidate hypotheses; you synthesize.
Read CLAUDE.md for "Key invariants", "Invariants", "Correctness rules", or similar. Read project memory for feedback_* and project_* files mentioning past bugs (e.g., "phase 2 lessons", "incident postmortem"). Past bugs cluster — the same shape often recurs in adjacent code. Pass these excerpts into each agent's prompt.
The standard set (pick what fits the codebase):
gather() losing partial results on one failure.try without finally; tasks created without registration in the cleanup registry.Any boundaries — places where mypy infers Any (dict accesses on JSON, raw fetches, untyped third-party returns) and downstream code assumes a specific shape.except Exception: blocks that don't re-raise; try/except around code where a failure should propagate; broad excepts in projects with a "fail hard" stance.now() called inside a transaction returning a different value than elsewhere.git log the last ~10–15 feat: commits and audit each newly-added code path for feature-logic errors: wrong field defaults, validation gaps, account-scoping omitted on a new endpoint, request/response-model drift, a partial / "stage N" migration leaving a mixed encode-decode state, a stale assumption a later commit silently falsified. Have the finder git show the actual merge commit, not just the file. On a mature, heavily-swept codebase this is usually the highest-yield dimension — the invariant/risk-pattern dimensions above are the most-audited code in the tree, so the freshest un-audited surface is the recent diff and the peripheral subsystems (connectors, CLI, MCP, sandbox internals) that get less bughunt attention than the core.attach but not on a sibling bind. Method: grep for the guard/helper/constant, then read every caller and every sibling that should route through it but open-codes a partial version instead. This pairs naturally with dimension 12 — a guard added in a recent commit is the prime candidate to have missed a sibling.CLAUDE.md invariants and feedback_* memory excerpts.(reachability × severity × provability) ÷ fix-scope scores, file:line references, and the test sketch per finding.Use subagent_type: general-purpose. Run all agents parallel via a single message with multiple Agent tool calls. If invoked with --model=<value> (see build-cycle flags), include model: "<value>" on each Agent call so every audit subagent runs on the chosen Claude generation; otherwise omit model: and let them inherit the session's model.
Collect ranked lists. Cross-agent corroboration amplifies score. Provability is a hard filter at this stage — anything below ~0.6 doesn't make the synthesis cut. Speculation without a path to a red test is wasted iteration.
If the invariant/risk-pattern dimensions (1–11) come up empty on a mature codebase, pivot to the recently-merged-feature surface (dimension 12) and the peripheral subsystems before declaring empty. An empty invariant sweep on a heavily-audited core is a true negative about the wrong surface, not about the codebase — the bug, if there is one, is most likely in code merged since the last sweep. Run a second audit round targeted there (have finders surface their single strongest lead even when sub-threshold, so the synthesis sees near-misses rather than a bare empty set) before you conclude there's nothing to ship.
If the audit — after that pivot — produces no hypothesis with provability ≥ 0.6, declare LOOP-OUTCOME: empty (build-cycle Phase 7) and end the iteration.
Apply build-cycle's quality-gate principle. For bughunt, the bar is:
A candidate clears the gate only if all are true:
If a candidate is borderline ("input feels reachable but I'd need to confirm"), present 2–3 candidates to the user via AskUserQuestion with reachability evidence for each.
If no candidate clears the gate: report what was found and why each was rejected, then declare LOOP-OUTCOME: gate_killed. Don't ship a speculative fix to a theoretically-reachable but practically-impossible bug.
Write a test that reaches the suspect code path with the input that triggers the hypothesis, and asserts the correct outcome. Run it. It must fail.
Confirm the failure mode is the predicted one. A test that fails for the wrong reason is worse than no test:
ImportError → test is broken; fix the test.KeyError upstream of the bug → input shape is wrong; adjust until it reaches the suspect line.If the test passes — the bug doesn't exist (or the hypothesis was wrong). Discard the candidate. Skip silently and pick the next one from the synthesis. Don't write a tracking issue, don't surface to the user.
If you've burned through three candidates this iteration without a red test, the audit is mis-calibrated. Surface via AskUserQuestion: "Three candidates this iteration didn't produce red tests — keep going at deeper audit, switch dimensions, or stop?"
With the test red for the right reason, find the smallest fix that flips it green without rewriting unrelated logic.
/kaizen candidates.Re-run the failing test. Confirm it passes. Run the full test suite to confirm no other tests went red.
If the fix breaks other tests, investigate before adjusting them. A test relying on the buggy behavior is itself a bug; fix it properly.
/ship defaultsWhen invoking dev-skills:ship (build-cycle Phase 4):
--commit-type=fix (almost always; refactor only if the fix is structural with no observable behavior delta).--issue=<N> if the bug was already filed (search first: gh issue list --search "<key symptom>").(Additive to build-cycle's shared boundaries.)
AskUserQuestion when (in addition to build-cycle's shared triggers):
/ship. (Possible deeper root cause; revisit Phase 3 fix step.)When in doubt, ask. A wrong fix is worse than a missed bug.
Guides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.
npx claudepluginhub eumemic/claude-plugins --plugin dev-skills