From devenv-workflow
Autonomous test-driven debugging loop with hypothesis-test-revert discipline. Captures baseline, categorizes failures, and iterates with structured comparison.
How this skill is triggered — by the user, by Claude, or both
Slash command
/devenv-workflow:debug-loopThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
A structured debugging protocol that replaces ad-hoc test fixing with a disciplined baseline-compare-iterate loop. Uses test suites as ground truth to prevent destructive shortcuts.
A structured debugging protocol that replaces ad-hoc test fixing with a disciplined baseline-compare-iterate loop. Uses test suites as ground truth to prevent destructive shortcuts.
The debug loop addresses the "wrong assumptions leading to wasted work" pattern by enforcing:
| Mechanism | Role | Location |
|---|---|---|
| Formula | Step ordering, iteration budget | .beads/formulas/debug-loop.formula.toml |
| Skill scripts | Data capture, comparison, reporting | .claude/skills/debug-loop/scripts/ |
| Hook | Real-time regression detection | .claude/hooks/debug-regression-guard.ts |
.claude/skills/debug-loop/
SKILL.md # This file
scripts/
baseline.ts # Capture test baseline
categorize.ts # Classify failures by category
compare.ts # Compare results, determine verdict
report.ts # Generate markdown report
types/
test-results.ts # TypeScript interfaces
schemas/
baseline.schema.json # JSON Schema for baseline data
tests/
categorize.test.ts # Categorization tests
compare.test.ts # Comparison tests
.beads/formulas/
debug-loop.formula.toml # Formula definition
.debug/ # Runtime data (gitignored)
baseline.json # Current baseline
bd mol wisp debug-loop --var test_command="bun test" --var max_iterations=3
| Step | Description | Script |
|---|---|---|
capture-baseline | Run tests, parse output, write .debug/baseline.json | baseline.ts |
categorize-failures | Group failures by type, plan fix order | categorize.ts |
fix-iteration-N | Hypothesis-test-compare-decide cycle | compare.ts |
final-report | Generate before/after markdown report | report.ts |
# Capture a baseline
bun run .claude/skills/debug-loop/scripts/baseline.ts --test-command "bun test"
# Categorize failures (imported as library)
import { categorizeAll, formatPrioritizedPlan } from './scripts/categorize';
# Compare against baseline (imported as library)
import { readBaseline, compareResults, formatComparison } from './scripts/compare';
# Generate report (imported as library)
import { generateReport, buildReport } from './scripts/report';
| Category | Priority | Signals | Action |
|---|---|---|---|
| compile | 1 (fix first) | error CS####, SyntaxError, Cannot find module | Fix build errors first -- they cascade |
| runtime | 2 | NullReferenceException, TypeError | Fix unexpected crashes |
| assertion | 3 | Assert.Equal, expect().toBe() | Fix wrong outputs |
| infrastructure | 4 (fix last) | ECONNREFUSED, ETIMEDOUT | Environmental -- may not be fixable in code |
Each fix iteration follows this protocol:
git checkout -- .)| Condition | Verdict | Action |
|---|---|---|
| New failures appeared | worse | MANDATORY revert |
| Failure count increased | worse | MANDATORY revert |
| Some tests fixed, none regressed | improved | Keep and commit |
| Failure count decreased, none new | improved | Keep and commit |
| No change in failure set | same | Keep only if correct |
## Debug Loop Report
### Baseline
- Total: 185 | Passed: 120 | Failed: 65
- Categories: compile(10) runtime(8) assertion(45) infrastructure(2)
### Iterations
| # | Hypothesis | Change | Result | Action |
|---|-----------|--------|--------|--------|
| 1 | Missing null check | Added guard | improved (+3 passing) | kept |
| 2 | Wrong date format | Fixed format | worse (-2 passing) | reverted |
| 3 | Stale DI registration | Updated reg | improved (+5 passing) | kept |
### Final State
- Total: 185 | Passed: 128 | Failed: 57
- Net improvement: +8 passing tests
### Unresolved
- assertion(39): OrderServiceTests...
- infrastructure(2): DbConnectionTests...
The close-task formula's verify-tests step suggests using the debug loop when tests fail. Pour the debug-loop wisp, complete it, then return to close-task.
Debug loop state (.debug/baseline.json) persists on disk and survives compaction. Active wisp steps are tracked by beads.
Each successful iteration is committed (debug: <hypothesis>). Failed iterations are reverted (git checkout -- .). This creates a clean history where each debug commit is a verified improvement.
import { captureBaseline } from './scripts/baseline';
const baseline = await captureBaseline("bun test", projectRoot);
// baseline: TestBaseline with total, passed, failed, categories, failures
import { categorizeAll, groupByCategory, formatPrioritizedPlan } from './scripts/categorize';
const failures = categorizeAll(testResults);
const groups = groupByCategory(failures); // Map in priority order
const plan = formatPrioritizedPlan(failures); // Human-readable plan
import { readBaseline, compareResults, formatComparison } from './scripts/compare';
const baseline = await readBaseline(projectRoot);
const result = compareResults(baseline, currentSummary, currentFailures);
// result.verdict: "improved" | "same" | "worse"
// result.delta: TestDelta with newFailures, fixedTests
const output = formatComparison(baseline, currentSummary, result);
import { generateReport, buildReport } from './scripts/report';
const report = buildReport(baseline, iterations, finalSummary, finalDelta, remaining);
const markdown = generateReport(report);
npx claudepluginhub jesposito/ai-nme-marketplace --plugin devenv-workflowSystematic debugging loop that spawns fresh subagents to investigate and fix bugs. Automatically activated for debugging, error investigation, and root cause analysis.
Provides a structured debugging loop for hard bugs and performance regressions: reproduce, minimize, hypothesize, instrument, fix, regression-test.
Runs a six-phase debug loop: feedback loop, reproduce, hypothesise, instrument, fix + regression test, cleanup. Hard-bug diagnosis and fix for flaky tests, perf regressions, and unexplained misbehaviour.