Skill

debug-loop

Autonomous test-driven debugging loop with hypothesis-test-revert discipline. Captures baseline, categorizes failures, and iterates with structured comparison.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/devenv-workflow:debug-loop

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

A structured debugging protocol that replaces ad-hoc test fixing with a disciplined baseline-compare-iterate loop. Uses test suites as ground truth to prevent destructive shortcuts.

Supporting Files

schemas/baseline.schema.jsonscripts/baseline.tsscripts/categorize.tsscripts/compare.tsscripts/report.tstests/categorize.test.tstests/compare.test.tstests/report.test.tstypes/test-results.ts

SKILL.md

196 lines · ~1.7k tokens

Stats

LanguageTypeScript

Stars0

MaintenanceFair

Last CommitFeb 18, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Debug Loop Skill

A structured debugging protocol that replaces ad-hoc test fixing with a disciplined baseline-compare-iterate loop. Uses test suites as ground truth to prevent destructive shortcuts.

Overview

The debug loop addresses the "wrong assumptions leading to wasted work" pattern by enforcing:

Baseline Capture -- Record test state before any changes
Failure Categorization -- Classify failures by type and priority
Bounded Iterations -- Fix with hypothesis tracking and revert-if-worse
Before/After Reporting -- Proof that debugging actually improved things

Architecture

Mechanism	Role	Location
Formula	Step ordering, iteration budget	`.beads/formulas/debug-loop.formula.toml`
Skill scripts	Data capture, comparison, reporting	`.claude/skills/debug-loop/scripts/`
Hook	Real-time regression detection	`.claude/hooks/debug-regression-guard.ts`

Directory Structure

.claude/skills/debug-loop/
  SKILL.md                          # This file
  scripts/
    baseline.ts                     # Capture test baseline
    categorize.ts                   # Classify failures by category
    compare.ts                      # Compare results, determine verdict
    report.ts                       # Generate markdown report
  types/
    test-results.ts                 # TypeScript interfaces
  schemas/
    baseline.schema.json            # JSON Schema for baseline data
  tests/
    categorize.test.ts              # Categorization tests
    compare.test.ts                 # Comparison tests

.beads/formulas/
  debug-loop.formula.toml           # Formula definition

.debug/                             # Runtime data (gitignored)
  baseline.json                     # Current baseline

Usage

Starting a Debug Loop

bd mol wisp debug-loop --var test_command="bun test" --var max_iterations=3

Formula Steps

Step	Description	Script
`capture-baseline`	Run tests, parse output, write `.debug/baseline.json`	`baseline.ts`
`categorize-failures`	Group failures by type, plan fix order	`categorize.ts`
`fix-iteration-N`	Hypothesis-test-compare-decide cycle	`compare.ts`
`final-report`	Generate before/after markdown report	`report.ts`

Manual Script Usage

# Capture a baseline
bun run .claude/skills/debug-loop/scripts/baseline.ts --test-command "bun test"

# Categorize failures (imported as library)
import { categorizeAll, formatPrioritizedPlan } from './scripts/categorize';

# Compare against baseline (imported as library)
import { readBaseline, compareResults, formatComparison } from './scripts/compare';

# Generate report (imported as library)
import { generateReport, buildReport } from './scripts/report';

Failure Categories

Category	Priority	Signals	Action
compile	1 (fix first)	`error CS####`, `SyntaxError`, `Cannot find module`	Fix build errors first -- they cascade
runtime	2	`NullReferenceException`, `TypeError`	Fix unexpected crashes
assertion	3	`Assert.Equal`, `expect().toBe()`	Fix wrong outputs
infrastructure	4 (fix last)	`ECONNREFUSED`, `ETIMEDOUT`	Environmental -- may not be fixable in code

Iteration Protocol

Each fix iteration follows this protocol:

HYPOTHESIZE -- State what you believe causes the failure
SCOPE -- Identify the minimal files to change
CHANGE -- Make the smallest change that tests the hypothesis
VERIFY -- Run the test command
COMPARE -- Check if results improved, stayed same, or worsened
DECIDE -- Keep changes (commit) or revert (git checkout -- .)
RECORD -- Document hypothesis and outcome

Verdict Logic

Condition	Verdict	Action
New failures appeared	worse	MANDATORY revert
Failure count increased	worse	MANDATORY revert
Some tests fixed, none regressed	improved	Keep and commit
Failure count decreased, none new	improved	Keep and commit
No change in failure set	same	Keep only if correct

Report Format

## Debug Loop Report

### Baseline
- Total: 185 | Passed: 120 | Failed: 65
- Categories: compile(10) runtime(8) assertion(45) infrastructure(2)

### Iterations
| # | Hypothesis | Change | Result | Action |
|---|-----------|--------|--------|--------|
| 1 | Missing null check | Added guard | improved (+3 passing) | kept |
| 2 | Wrong date format | Fixed format | worse (-2 passing) | reverted |
| 3 | Stale DI registration | Updated reg | improved (+5 passing) | kept |

### Final State
- Total: 185 | Passed: 128 | Failed: 57
- Net improvement: +8 passing tests

### Unresolved
- assertion(39): OrderServiceTests...
- infrastructure(2): DbConnectionTests...

Integration

With close-task Formula

The close-task formula's verify-tests step suggests using the debug loop when tests fail. Pour the debug-loop wisp, complete it, then return to close-task.

With Session State

Debug loop state (.debug/baseline.json) persists on disk and survives compaction. Active wisp steps are tracked by beads.

With Git

Each successful iteration is committed (debug: <hypothesis>). Failed iterations are reverted (git checkout -- .). This creates a clean history where each debug commit is a verified improvement.

Scripts API

baseline.ts

import { captureBaseline } from './scripts/baseline';

const baseline = await captureBaseline("bun test", projectRoot);
// baseline: TestBaseline with total, passed, failed, categories, failures

categorize.ts

import { categorizeAll, groupByCategory, formatPrioritizedPlan } from './scripts/categorize';

const failures = categorizeAll(testResults);
const groups = groupByCategory(failures);  // Map in priority order
const plan = formatPrioritizedPlan(failures);  // Human-readable plan

compare.ts

import { readBaseline, compareResults, formatComparison } from './scripts/compare';

const baseline = await readBaseline(projectRoot);
const result = compareResults(baseline, currentSummary, currentFailures);
// result.verdict: "improved" | "same" | "worse"
// result.delta: TestDelta with newFailures, fixedTests
const output = formatComparison(baseline, currentSummary, result);

report.ts

import { generateReport, buildReport } from './scripts/report';

const report = buildReport(baseline, iterations, finalSummary, finalDelta, remaining);
const markdown = generateReport(report);

debug-loop

Invocation

Context Preview

Supporting Files

SKILL.md

debug-loop

Invocation

Context Preview

Supporting Files

SKILL.md

Debug Loop Skill

Overview

Architecture

Directory Structure

Usage

Starting a Debug Loop

Formula Steps

Manual Script Usage

Failure Categories

Iteration Protocol

Verdict Logic

Report Format

Integration

With close-task Formula

With Session State

With Git

Scripts API

baseline.ts

categorize.ts

compare.ts

report.ts

Similar Skills

Debug Loop Skill

Overview

Architecture

Directory Structure

Usage

Starting a Debug Loop

Formula Steps

Manual Script Usage

Failure Categories

Iteration Protocol

Verdict Logic

Report Format

Integration

With close-task Formula

With Session State

With Git

Scripts API

baseline.ts

categorize.ts

compare.ts

report.ts

Similar Skills