From flightplan
Construct a deterministic, agent-runnable pass/fail signal for a bug, regression, or feature — fast, sharp, and reproducible without a human in the loop. Use when the user says "build a test harness", "make this reproducible", "what's the cheapest reproducer for this bug", "the dev loop is too slow", or "we need a deterministic check for this". This is the reference skill the rest of the Valesco engineering chain leans on: `/diagnose` Phase 1 invokes it to find a bug's cause, `/tdd` invokes it to drive new code red→green. Distinct from `/diagnose` (which *uses* loops to locate causes) and `/tdd` (which *uses* loops to drive incremental implementation) — this skill is about *constructing* the loop itself. Whenever you find yourself debugging by re-reading code or running ad-hoc commands without a deterministic pass/fail, stop and run this.
How this skill is triggered — by the user, by Claude, or both
Slash command
/flightplan:feedback-loopThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Construct a fast, sharp, deterministic, agent-runnable pass/fail signal for
Construct a fast, sharp, deterministic, agent-runnable pass/fail signal for the behavior under question. Pick from a catalog of ten patterns, harden the loop against flakes, and hand it off to whatever skill called this one.
This skill is advisory. It writes test files, scripts, harnesses, and fixtures — never tracker labels.
Two things in the Valesco chain depend on the loop existing and behaving predictably:
/diagnose Phase 1. Without a loop, hypothesis-testing degrades to
hand-waving. Phase 1 is the entire skill — the rest is mechanical once
the signal exists./tdd red→green. A test that takes 30 seconds to fail breaks the
tracer-bullet rhythm; the agent stops listening to its own feedback.The loop also tends to graduate into a regression test that lives in the repo permanently, so the next person hitting the same bug — including Claude Code running inside runway — has a deterministic check available.
If you don't have a loop, you don't have engineering — you have storytelling. Build the loop.
/diagnose Phase 1, when you've reproduced the bug by hand but
need an automatable signal./tdd, when picking the seam for the first failing test in a new
feature slice./triage to determine whether you have legitimate access to a
reproducer; this skill cannot manufacture one out of nothing.Every loop, regardless of pattern, is judged on three axes. A 30-second flaky loop is barely better than no loop at all; a 2-second deterministic loop is a debugging superpower.
How to get fast: cache setup steps (don't re-bootstrap the DB on every iteration), narrow the scope (run one test, not the whole file), skip unrelated init, swap heavy deps for thin fakes only at the outer boundary.
A sharp loop asserts on the specific symptom, not "didn't crash". If
the bug is "wrong total when cart contains a 0-priced item", the
assertion is assert total == 0, not assert order.completed_ok.
Sharp loops survive refactors because they describe behavior, not
structure. A loop that fails when you rename an internal function was
testing implementation, not behavior — see /tdd.
Same input, same output, every run, on every machine. The four sources of non-determinism to neutralize:
vi.useFakeTimers(), freezegun, or inject
a clock dependency).Math.random is rarely the issue; UUIDs and
test data factories are).tmp.mkdtempSync(), ephemeral
Supabase branch via mcp__a80053c7…__create_branch).If you cannot neutralize one of the four, document which and why in a comment alongside the loop.
Pick the cheapest pattern that produces a sharp signal. The patterns are ordered by general preference — when in doubt, try them in this order.
A test at the seam closest to the bug — unit when behavior is local, integration when the bug only appears across module boundaries, e2e when the bug is in wiring.
test("reproduces <symptom>", () => {
setup();
const result = systemUnderTest(triggeringInput);
expect(result).toBe(expectedSymptom); // initially the bug, then the fix
});
vitest <file> -t <name>),
skip unrelated setup hooks, switch from full DB to in-memory adapter only
if the bug isn't in the DB layer.A shell script that hits a running dev server with a recorded request and asserts on the response shape, status, or specific field.
jq, not full body diff);
forgetting to seed the DB to a known state between runs.#!/usr/bin/env bash
set -euo pipefail
RESPONSE=$(curl -sS -X POST http://localhost:3000/api/orders \
-H "Content-Type: application/json" \
-d @fixtures/order.json)
echo "$RESPONSE" | jq -e '.total == 0' > /dev/null
Invoke the CLI with a fixture input file and diff stdout against a recorded snapshot.
sed -r 's/\x1b\[[0-9;]*m//g')../bin/mytool < fixtures/input.txt > /tmp/actual.txt
diff -u fixtures/expected.txt /tmp/actual.txt
vitest -u, insta) instead of
hand-maintained expected files; canonicalize output (sort lines if order
doesn't matter).Playwright or Puppeteer drives the real UI; the script asserts on DOM state, console messages, or network calls.
waitForSelector,
not setTimeout); zombie browser processes between runs.test("cart total updates", async ({ page }) => {
await page.goto("/cart");
await page.getByRole("button", { name: /add/i }).click();
await expect(page.getByTestId("total")).toHaveText("$0.00");
});
--ui mode for first-build,
drop to headless for the production loop; record video on failure only;
pin viewport size to make selectors stable.Save a real network request / payload / event log to disk; replay it through the code path in isolation.
test("processes captured webhook payload", async () => {
const payload = JSON.parse(fs.readFileSync("fixtures/webhook-2026-04-30.json"));
const result = await processWebhook(payload);
expect(result.status).toBe("rejected");
});
A minimal subset of the system — one service, faked deps — that exercises
the bug code path with a single function call. Lives in scripts/debug/,
not in the test suite.
/diagnose deletes it).// scripts/debug/repro-VA-142.ts
import { internalFunction } from "../../src/feature/private";
const result = internalFunction(buildInput());
if (result !== "expected") process.exit(1);
console.time blocks to find the slow part;
delete the harness as soon as a real test seam exists.Run 1000+ random inputs through the system and look for the failure mode.
numRuns); shrinking
failures the framework can't simplify (write a custom shrinker); flakes
from network or time inside the property (don't — keep the property
pure).fc.assert(fc.property(fc.array(fc.integer()), arr => {
const sorted = mySort(arr);
return isAscending(sorted);
}), { numRuns: 1000, seed: 42 });
seed: 42); save the smallest
counterexample as a regression test; raise numRuns only after the
shrinker is reliable.git bisect run over a script that boots state X, checks the bug,
exits 0 (good) or 1 (bad).
git bisect skip);
the bug requires a re-install of dependencies (script must run pnpm i);
bisecting through a refactor that renamed everything (history-aware
bisection — --first-parent).#!/usr/bin/env bash
set -e
pnpm install --frozen-lockfile > /dev/null
./bin/repro || exit 1 # bug present
exit 0 # bug absent
node_modules per commit hash to skip the
reinstall; pre-build once if the build is deterministic; bisect over a
package-lock range, not source range, when the suspect is a dep.Run the same input through two configurations — old version vs new, two deployments, two implementations — and diff outputs.
jq -S)../bin/old-tool < input.json | jq -S . > /tmp/old.json
./bin/new-tool < input.json | jq -S . > /tmp/new.json
diff -u /tmp/old.json /tmp/new.json
A bash script that prompts a human for the action, captures the result, and feeds it back to the agent. Last resort — humans aren't part of an autonomous run, so this disqualifies the loop as a regression test that runway / CI can re-run later.
#!/usr/bin/env bash
read -p "Click the buy button. Did the modal close? (y/n) " ok
[[ "$ok" == "y" ]] && exit 0 || exit 1
Walk this tree top-to-bottom. Stop at the first "yes."
needs-human rather than ready-for-agent.If you fall off the bottom of the tree without a fit, the bug isn't
loopable yet — return to /triage and request reproducer access,
trace capture, or production instrumentation as needs-info.
Apply across every pattern. Climb until the loop is fast, sharp, and deterministic enough to defend.
| Step | Cheap | Expensive |
|---|---|---|
| Cache setup | Memoize fixture loads, keep dev server warm. | Snapshot the entire DB and restore between runs. |
| Narrow scope | Filter to one test (-t name). | Compile a per-test bundle that excludes unrelated code. |
| Pin time | Inject a clock; freeze in the test. | Run inside a container with the system clock pinned. |
| Seed RNG | Pass an explicit seed to test data factories. | Replace global RNG with a deterministic-by-default wrapper. |
| Isolate filesystem | tmp.mkdtempSync() per test. | Per-test ephemeral Supabase branch. |
| Freeze network | MSW / VCR / nock at the route level. | Container with no egress; recorded HAR replay. |
| Raise repro rate | Loop the trigger 100×; tighten timing windows. | Inject targeted sleeps to widen the race window; run under TSAN. |
For non-deterministic bugs the goal is not a clean repro but a higher reproduction rate. A 50%-flake bug is debuggable; 1% is not. Climb the ladder until the rate is workable, then build the loop on top of that.
The loop you build here usually doesn't stay in scripts/debug/. It
graduates into the project's permanent infrastructure:
| Caller | What the loop becomes |
|---|---|
/diagnose Phase 1 | The active feedback loop for hypothesis testing. The Phase 5 regression test typically condenses it into a permanent test. |
/tdd | The first failing test in the red→green cycle. Each subsequent slice extends or copies it. |
| Issue body for runway pickup | Once a loop graduates into a permanent test, reference it by path in the issue body so Claude Code (running inside runway) sees both the spec and a deterministic check. |
If any check fails, fix the loop before returning. A loop that "mostly works" is the same as no loop — it pollutes downstream skills with noise.
/diagnose Phase 5 or /tdd green.scripts/debug/, never in src/. Production probes need explicit user
approval.../diagnose/SKILL.md — the primary caller;
Phase 1 is "build a loop using this skill."/tdd
— the red→green caller; this skill is the "build the red" half./diagnose
— Phase 1 of his version is the doctrinal source for the ten-pattern
catalog; this skill is the expanded reference.Provides CDSS development patterns for drug interaction checking, dose validation, clinical scoring (NEWS2, qSOFA), and alert classification integrated into EMR workflows.
npx claudepluginhub valescoagency/flightplan --plugin flightplan