From crucible
Runs k iterations of stage → collect-behavior → score for calibration sweeps, replacing bash for-loops. Supports idempotent resume via sentinel files per iteration.
How this skill is triggered — by the user, by Claude, or both
Slash command
/crucible:temper-eval-calibrateopusThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
<!-- CANONICAL: shared/dispatch-convention.md -->
Invocation:
/temper-eval-calibrate <run-id-prefix> [--k N] [--source SOURCE] [--fixture <id>]
[--write-baseline] [--compare-baseline]
[--trials-override N] [--max-parallel N]
[--timeout N]
Pre-conditions:
<run-id-prefix> matches ^[A-Za-z0-9_][A-Za-z0-9_-]{0,28}$ (I-9; aligned with _runid.py's _PREFIX_RE; reserves 3 chars for -<i> suffix)Post-conditions:
i in 1..k, skills/temper/evals/.calibrate-state/<prefix>-<i>-complete exists AND skills/temper/evals/.calibrate-state/last_run-<prefix>-<i>.json exists with run_id == <prefix>-<i>Task 13 (per-iter wiring) lands BEFORE this skill is invoked — verified by sequential task numbering. Per F-R4-1, Task 13 lands BOTH the --per-iter argparse declaration AND the main() wiring in a SINGLE atomic commit (Task 4 intentionally omits the argparse declaration to eliminate any silent-drop window). Because Task 13 is numbered ahead of Task 14, operators following the plan sequentially will have --per-iter wired before this skill's SKILL.md exists to be invoked. Defense-in-depth: if Task 13 is somehow skipped, score --per-iter will raise an argparse "unrecognized arguments" error (no silent drop) — the calibrate skill's Step 3e shell-out will fail loud rather than clobbering shared state.
<run-id-prefix> must match ^[A-Za-z0-9_][A-Za-z0-9_-]{0,28}$ (M-FE-3 R3); refuse with explicit error if not.--k (default 3) must be in [1, 99] inclusive; refuse if out-of-range. STATE_DIR = skills/temper/evals/.calibrate-state/. If it does not exist, create it (mkdir -p).
i in 1..kRUN_ID = "<prefix>-<i>".
If BOTH of the following are true, skip iteration entirely (no stage, no collect, no score, no billing):
skills/temper/evals/.calibrate-state/<prefix>-<i>-complete existsskills/temper/evals/.calibrate-state/last_run-<prefix>-<i>.json exists AND its run_id field equals <prefix>-<i>Print: [skip] iteration <i>: already complete.
Shell out via Bash tool: python -m skills.temper.evals.run_evals stage "$RUN_ID" [--source ...] [--fixture ...] [--trials-override ...] [--timeout ...]
Pass through whatever flags the user supplied.
Execute the full procedure documented in skills/temper-eval-collect/SKILL.md Steps 1–8 against RUN_ID.
Version pin (M1 R8): this inline-execution reference is pinned to the Steps-1–8 shape at calibrate-skill author time. If skills/temper-eval-collect/SKILL.md is later edited (e.g. a Step 4.5 is added, or steps are renumbered), this calibrate skill MUST be re-validated against the new step set — there is NO automatic inheritance. Treat collect SKILL.md edits as breaking changes for the calibrate skill until re-verified.
This includes:
.tmp cleanup--timeout overridemanifest.jsonl per-dispatch appends.collect-status write with fsync (I-12)The inlined behavior satisfies I-7, I-8, I-10, I-12 identically to standalone /temper-eval-collect.
Shell out: python -m skills.temper.evals.run_evals score "$RUN_ID" --per-iter [--write-baseline] [--compare-baseline]
The --per-iter flag is REQUIRED here (F1): it tells score to write to skills/temper/evals/.calibrate-state/last_run-<prefix>-<i>.json instead of the shared last_run.json. Without --per-iter, iterations would clobber each other's output AND collide with operator-owned last_run.json artifacts. There is NO run-id-shape heuristic; the routing is explicit.
If score returns 2 (fatal), abort all remaining iterations and surface the error.
On successful score, write skills/temper/evals/.calibrate-state/<prefix>-<i>-complete with content complete-<ISO-8601>.
[iter <i>/<k>] RUN_ID=<RUN_ID> score=<rc>
After ALL k iterations succeed (every iteration's .calibrate-state/<prefix>-<i>-complete sentinel is present), copy skills/temper/evals/.calibrate-state/last_run-<prefix>-<k>.json to skills/temper/evals/last_run.json (the shared output the --compare-baseline workflow expects). This is unconditional on success — not optional.
Pre-copy verification (mirrors the sentinel-pair discipline in Step 3b):
last_run-<prefix>-<k>.json exists. If absent, refuse to clobber last_run.json and exit with the fatal error "iteration k completion sentinel present but per-iter file missing at <path>; refusing to consolidate.""per-iter file at <path> is malformed JSON; refusing to consolidate."run_id field equals <prefix>-<k>. If it does not, refuse with "per-iter file at <path> carries run_id=<actual>, expected <prefix>-<k>; refusing to consolidate."If ANY iteration failed (sentinel missing, score returned 2, or stage refused), do NOTHING in this step: leave last_run.json untouched so the operator can inspect the per-iter files manually without a clobbered shared state.
Rationale: the original "optionally copy" wording produced two equally-valid downstream states (shared file = last iter vs. shared file = stale) with no operator signal to disambiguate. The all-or-nothing rule eliminates the ambiguity. Revisit only if best-or-median canonicalization is genuinely needed (defer to a follow-up ticket).
This skill enforces: M-2 (k cap 1..99), I-9 (prefix regex), AC-12 (resume idempotency).
npx claudepluginhub raddue/crucibleBenchmarks Claude Code skill performance via multiple trials per eval, tracking pass rate, execution time, token usage, and variance. Aggregates to benchmark.json and generates version comparison reports. Use for 'benchmark skill' or performance tracking queries.
Generates synthetic problems with quasi-ground-truth outcomes to test agents and skills, measuring recall, precision, and confidence calibration. Use for validating routing accuracy, A/B testing changes.
Evaluates skill output quality via assertion-based grading, blind before/after comparison, and variance analysis across 3 runs per scenario. Use for benchmarking, comparing skill versions, or triggered by /skill-eval.