From gilfoyle
Replaces executing-plans. Executes slices one at a time. After each slice, runs the slice's stress fixture AND the prove-it-prototype oracle against the binary AND the budget check. If any of those fail, STOPS and surfaces the drift to the user. Treats the plan as a hypothesis, not a contract.
How this skill is triggered — by the user, by Claude, or both
Slash command
/gilfoyle:checkpointed-buildThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are not typing the plan. You are advancing the design hypothesis by one slice. After each slice, you go back to reality and check whether the design is still standing.
You are not typing the plan. You are advancing the design hypothesis by one slice. After each slice, you go back to reality and check whether the design is still standing.
For each slice:
1. Implement.
2. Run unit tests.
3. Run the stress fixture.
4. Run the prove-it-prototype oracle against the binary.
5. Check the budget.
6. If anything in 2-5 fails → STOP. Surface to user. Do not proceed.
7. Else → commit. Next slice.
Stopping is the most important step. The standard convention is to push through drift and "fix it later." We do not. Drift caught at slice N is cheap. Drift caught at slice N+8 is the entire feature.
After budgeted-plan has produced a plan with all gates passing. Not before. If there's no plan, you're improvising. We do not improvise.
Read the plan once, top to bottom. Flag any slice where:
O(n) over files × symbols is not O(n).)debug_assert!.Raise concerns with the user before writing any code. The plan is allowed to be wrong. Catching it now is free.
If the slice changes the signature, name, or semantics of an existing function — public or private — list the callers BEFORE writing the change. The list bounds the change's blast radius and tells you what else this slice must update.
Tooling, in preference order:
tethys callers <Type::method>, IDE "find usages," rust-analyzer references). Cheap, fast, and catches most callers. Use the qualified path when bare names are ambiguous.grep for the function name across the codebase. Catches what static analysis missed (string-built dynamic dispatch, doc references, etc.).grep is the safety net.Caveats on the static-analysis tool you're using:
Output of this step: a short list (caller path + line) committed in the slice's plan or commit message. Future readers see what the change touched.
If the slice is adding a brand-new function with no existing callers, this step is a no-op — note that explicitly so the absence is intentional, not forgetful.
Before writing any utility code (path manipulation, string normalization, error wrapping, retry logic, common SQL fragments, percent-encoding, hex formatting, ...), check two places — both are part of the codebase's vocabulary:
In-source helpers — grep the codebase for existing functions with matching or close semantics. If one exists with matching semantics, reuse it. If close, widen it or wrap it — do not duplicate.
Already-imported dependencies — grep the project's dependency manifests (Cargo.toml, package.json, pyproject.toml, go.mod, etc.) for crates whose API might already cover the utility. An already-imported dep is functionally part of the codebase's vocabulary — using it has zero cost (no new audit, no new dep-evaluation), while reimplementing what it provides is duplication just like reimplementing an in-source helper.
Concrete checks for each:
normalize_*, to_str, canonical; dep grep for url, percent-encoding, path-clean.sqlx, diesel, rusqlite.From/map_err patterns; dep check for thiserror, anyhow, snafu.url, percent-encoding, urlencoding.backoff, tokio-retry.hex, base64, data-encoding.The cost of both checks is 60 seconds of grep. The cost of duplicating:
The exception is net-new deps: if the utility crate isn't already imported, adding it is a separate decision (does it clear the project's dep-evaluation bar?). This rule is specifically about already-imported crates whose use cost is zero. A net-new dep for a 10-line function is rarely worth it; using an already-imported one for the same purpose almost always is.
Edge case — partial fit: if the in-source helper or imported crate has close but not matching semantics (e.g., a lossy-conversion path-normalizer when you need strict UTF-8 erroring), document the deviation inline before duplicating. A divergent reimplementation with a comment explaining why is honest; silent duplication is debt.
Write the code. TDD discipline applies for unit tests inside the slice (use tdd-scoped). The pre-typed code in the plan is advisory. You are permitted, encouraged even, to deviate if you spot a better algorithm or a missing edge case. The slice's contract is its claim, fixture, oracle, and budget — not its code blocks.
If the slice introduces a new branch that parallels an existing one — e.g., a same-crate lookup alongside an existing unscoped lookup, a new retry path alongside an existing one, a new validation that mirrors a check elsewhere — list the existing path's behaviors and confirm the new path matches OR has a written justification for the divergence.
Specific checklist when adding a parallel path:
warn!/debug!/info! at the same severity as the old one for analogous events?Asymmetry is allowed. Unintentional asymmetry is the bug class this audit catches.
cargo nextest run or the equivalent for your language. Output must be pristine: no errors, no warnings, no skips outside the documented skip list.
The fixture from the plan, with the expected outcome from the plan. Compare actual to expected. Exact match required.
This is the step you will be most tempted to skip. Do not skip it.
If they drift, this slice broke something earlier slices established. That is the highest-priority bug class. Stop here, regardless of whether unit tests pass.
For every loop the slice introduced, confirm the production-scale cost is what the plan budgeted. For an always-on phase, measure wall-clock against the budget. Use time, hyperfine, or a benchmark harness. Eyeballing it does not count.
If the slice corresponds to a claim with a Regression fence entry in the falsifiable-design table, run that test specifically and confirm it passes. The fence is the permanent form of the falsifier — it should fail when the bug class re-appears and pass otherwise.
If the slice's claim has Regression fence: manual in the design, this step is a one-time manual check (do it now, document the result in the slice's commit message).
If the slice has no associated regression fence, this step is a no-op — but flag it: a claim shipping with no regression fence is a future-regression risk the design accepted explicitly. Note the absence in the slice's commit message so the next reader sees the gap.
Do not commit. Do not advance to the next slice. Surface to the user:
Slice N halted.
- Unit tests: [pass | fail with diff]
- Stress fixture: [pass | fail with diff]
- Oracle drift: [exact items where binary and oracle disagree]
- Budget: [actual vs planned, with measurement]
- Regression fence: [pass | fail with diff | none associated with this slice]
The implementation, the oracle, or the design is wrong. Which is it?
The user picks. Possible outcomes:
prove-it-prototype to confirm the revised oracle still agrees with the probe. Stay on slice N.falsifiable-design. The plan may need to be rewritten. Possibly the probe needs to be re-run.Your job is to surface accurately, not to decide. Decisions about which thing is wrong are not yours to make from inside the executor.
The slice you just landed may have made earlier slice comments out of date. Scan every file modified by this slice — plus the files the slice depends on — for:
// slice N hardens this or // to be implemented in step M. If the future slice has now landed, rewrite the comment to describe current behavior in present-tense / past-tense factual terms.search_symbol_by_name that now refuses ambiguity), rename in this commit. The longer a misleading name persists, the more callers depend on the wrong mental model.falsifiable-design: any TODO(rivets-XXX), // see rivets-YYY, // deferred to rivets-ZZZ in code — and any "tracked at rivets-N" / "deferred to follow-up" / "out of scope" phrase in the commit message you're about to write — must point at an existing issue whose content covers the deferred work. Run rivets show <id> (or tracker equivalent) to verify. If the ID doesn't exist, the issue's description doesn't match the deferral, or the comment has no ID at all, file the issue now and update the reference. Anonymous TODOs and phantom tracker IDs rot.Run grep for the slice number, for keywords like "TODO", "later", "future", "deferred", "tracked", and for any comment phrase that anticipated work that's now done. Fix what you find before committing.
One commit per slice. Commit message references the design claim this slice implements.
Generated and structured files (e.g., .rivets/issues.jsonl, Cargo.lock, schema dumps, generated client code) silently accumulate divergence from main on long-lived branches. Each tool invocation that mutates them is invisible at commit time; the conflict only surfaces at merge time.
Catch this every slice, not at PR-open time:
git fetch origin main
git diff origin/main --stat | grep -E '\.(jsonl|lock)$|issues\.jsonl|schema'
Flag any of:
Cargo.lock / package-lock.json / poetry.lock changeIf anything flagged: merge or rebase from main before starting the next slice. Prefer a merge commit (preserves slice history) over a rebase (rewrites it and breaks already-pushed commit hashes if other reviewers are reading them).
Symptom that means you missed this step: PR opened, gh pr checks <N> reports "no checks reported on the branch," zero workflow runs in the queue. GitHub Actions silently refuses to run pull_request workflows on PRs with conflicts and gives no surfaced signal. If you see zero CI activity within ~2 minutes of pushing a non-trivial PR, run git diff origin/main --stat — almost always a merge conflict.
After every slice has passed: rebuild the binary. Run every oracle from prove-it-prototype, every falsifier from falsifiable-design, and every regression fence in the Falsification table. They must all still pass.
The oracle / falsifier / fence are related but distinct:
If the fence is the same artifact as the falsifier (when both are deterministic tests), run it once and count it as both. If the falsifier is a one-shot measurement and the fence is a separate CI test, run the fence here — the falsifier's job ended in the design phase.
If any fail, you have a regression introduced somewhere in the slice chain. Bisect to find which slice. Stop. Surface.
gh pr checks reports zero checks running within 2 minutes. That's almost certainly a merge conflict — git diff origin/main --stat immediately before assuming a CI queue delay.git diff origin/main --stat first.// TODO: figure out later / // FIXME: needs work / // deferred to a follow-up — anonymous future-work pointers without tracker IDs. Either file the issue and reference its ID, or fix the thing now. No TODO(someone someday) — every TODO gets a TODO(rivets-XXX).rivets-abc1 but didn't verify it exists." Phantom tracker references and silent deferrals fail in the same way: future contributors looking at the tracker can't find the deferred work. Run rivets show <id> before writing the reference, not after.Cargo.toml / package.json / pyproject.toml? An already-imported dependency's API is functionally part of the codebase's vocabulary. Hand-rolling percent_encode_path when percent-encoding = "2.3" is two lines down in the manifest is the same class of duplication as hand-rolling a function that already exists in db/files.rs.This is not "follow the plan." This is "advance the design hypothesis by one slice and re-test it against reality." The plan is the current best guess. Reality is the authority. If they disagree, reality wins, and you stop until the user decides whether to revise the implementation, the oracle, or the design.
For each slice, one commit with all gates green. After the final slice, a clean run of every oracle and every falsifier against the assembled binary. If any of those is missing, the skill didn't finish. Finish it.
Provides a checklist for code reviews covering functionality, security, performance, maintainability, tests, and quality. Use for pull requests, audits, team standards, and developer training.
npx claudepluginhub dwalleck/gilfoyle --plugin gilfoyle