From crucible
Walks fix/hotfix branches to falsify gating-verdicts, computes per-skill Brier calibration scores, and appends a falsification log to the Crucible calibration ledger.
How this skill is triggered — by the user, by Claude, or both
Slash command
/crucible:calibration-reconcileThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
<!-- CANONICAL: shared/ledger-reduce.md -->
Walks fix branches to falsify the gating-verdicts that should have caught
the bug, computes per-skill Brier calibration scores, and writes a
falsification log to the central ledger ~/.claude/crucible/ledger/
(override CRUCIBLE_LEDGER_DIR). The SKILL.md is a thin prompt wrapper; the
source of truth is the testable Python core at scripts/reconcile_ledger.py.
Skill type: Utility — direct execution, no subagent dispatch.
Manual /calibration-reconcile [--lookback-days N] (default N=30).
If CRUCIBLE_CALIBRATION_DISABLED=1, the CLI skips entirely (graceful
no-op exit 0) before any git or filesystem work. The forward-capture path
(scripts/ledger_append.py) honors the same switch, so the whole calibration
subsystem can be disabled with one env var.
Invoke the reconciler directly. Resolve scripts/reconcile_ledger.py by
absolute path from the plugin root (it self-locates its own modules via
__file__, so no PYTHONPATH is needed and it runs from any cwd):
python3 <plugin_root>/scripts/reconcile_ledger.py --lookback-days 30
It reads runs.jsonl and manual-attribution.jsonl from the central store and
writes falsification.jsonl (append-only, L-1/L-9) + brier-rolling.json
(gitignored) into that SAME dir.
Scope limit: this implements steps 1–6 ONLY. The predicted-falsifier predicate parser / "second pass" is Phase 7 and explicitly out of scope.
discover_candidates): fix/* hotfix/*
branches merged in the last --lookback-days (default 30) + regression-
labelled closed issues with a referenced commit (best-effort; if gh is
unavailable, skip silently). Each candidate carries touched_files (from
git show --name-only <merge-sha>) and merge_time.cross_cut_threshold_from): threshold = p90 of
fix-branch sizes over the prior 90 days (a DISTINCT window from the 30-day
candidate lookback); bootstrap to a fixed 20 until ≥30 samples exist. A
candidate with len(touched_files) > threshold is tagged cross_cut: true.
Cross-cut candidates may still get a confidence: low attribution but are
EXCLUDED from the Brier denominator.reconcile): for each candidate, find the EARLIEST ledger
entry where ALL of: (a) gated_files ∩ touched_files non-empty, (b) entry
timestamp < candidate merge_time, (c) artifact_type == "code", (d)
backfilled == false. If found, record a falsification entry marking it
falsified: true with falsified_by: {commit, reason, confidence, cross_cut}.high when intersection ≥1 file AND merge within 14
days of the verdict AND cross_cut == false; medium when 14–30 days AND
cross_cut == false; low when >30 days OR cross_cut == true OR
multi-file fixes (>5 files but ≤ the cross-cut threshold).read_manual_attribution): read
manual-attribution.jsonl first; entries there (keyed by
ledger_entry_hash) override the algorithm's attribution. Missing file ⇒
empty overrides (no error). An entry may carry an optional signal_type
(manual_override (default) | bad_implementation) — see
Non-code bad_implementation signal.CRUCIBLE_CALIBRATION_DISABLED=1 ⇒ no-op exit.Per-skill calibration is the Brier score over each skill's falsifiable sample:
brier = mean((confidence - actual)^2)
PASS NOT marked falsified, OR
any FAIL (at v1 every FAIL ⇒ actual=1, since predicted-falsifier predicates
are Phase 7).PASS that WAS marked falsified: true (looked
up in the L-9 latest-entry-wins reduction over falsification.jsonl via
scripts.ledger_reduce.reduce, the canonical helper cited above).artifact_type == "code"
OR a non-code verdict carrying a bad_implementation falsification (see
below); backfilled == false; verdict ∈ {PASS, FAIL} with
confidence >= 0.5; entry older than the 30-day grace period; AND if a
matching falsification entry exists, its cross_cut must be false.PASS/FAIL count. STAGNATION,
ESCALATED, ARCHITECTURAL, SUSTAINED_REGRESSION are excluded from both
numerator AND denominator.n; just don't gate). The advisory threshold is brier > 0.25
(consumed in Phase 6 — here we only compute and store).bad_implementation signalAuto-falsification (walkback + predicted_falsifier) is code-artifact-centric: a
verdict is only falsified when a later fix/revert touches a gated path or fires a
predicate. That can never reach a non-code verdict (a design-doc or plan
PASS), and it misses the case where a PASS was accepted but the resulting
implementation later proved bad for reasons no path-touching fix captures — a
design-level wrong call, downstream rework, or an abandoned approach.
To score those, a human adds a manual-attribution.jsonl entry with
signal_type: "bad_implementation":
{"ledger_entry_hash": "<sha256(run_id:skill)>", "falsified": true,
"confidence": "high", "signal_type": "bad_implementation",
"reasoning": "design PASS led to a full rework of the persistence layer"}
PASS it sets actual = 0 (the verdict was
overconfident); on a FAIL it is out-of-contract and silently ignored.design/plan/…): such a verdict is
admitted into the Brier sample only when it carries this signal — absent it,
a non-code verdict's outcome is unknown and stays excluded (we never assume a
non-code PASS was correct just because nothing falsified it). The usual filters
still apply (PASS/FAIL, confidence >= 0.5, outside grace, non-cross-cut), so a
Tier-B confidence: null verdict is render-counted but not Brier-scored.actual like any other PASS falsification./ledger breaks the falsified count down by source (walkback / predicate /
manual_override / bad_implementation).
falsification.jsonl — APPEND-ONLY (L-1/L-9). Each entry:
{ledger_entry_hash, falsified, falsified_by, confidence, reasoning, cross_cut}
— plus an optional signal_type (manual_override | bad_implementation) on
manual-attribution-sourced entries.brier-rolling.json (gitignored) —
{"<skill>": {"n": int, "brier": float, "last_updated": "<iso>"}, ...}.npx claudepluginhub raddue/crucibleRenders a weekly calibration ledger report from Crucible run data, showing caught bugs, verdict breakdowns, per-skill severity rates, and inflation alerts.
Generates synthetic problems with quasi-ground-truth outcomes to test agents and skills, measuring recall, precision, and confidence calibration. Use for validating routing accuracy, A/B testing changes.
Structured retrospective analysis on shipped commits, releases, deployments, PRs, or sessions. Produces a 10-section report with per-fix validation, evidence sourcing, and failure-mode pattern library.