From epistemic-skills
Enforces fresh, repo-backed verification before any publication claim (results, comparisons, conclusions). Blocks claims that rely on recollection rather than current evidence.
How this skill is triggered — by the user, by Claude, or both
Slash command
/epistemic-skills:verification-before-publicationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
> **Related skills:** `/skill:research-question`, `/skill:preregistration`, `/skill:baseline-reproduction`, `/skill:experiment-execution`, `/skill:falsification-review`, `/skill:surprise-triage`, `/skill:kill-or-ship`
Related skills:
/skill:research-question,/skill:preregistration,/skill:baseline-reproduction,/skill:experiment-execution,/skill:falsification-review,/skill:surprise-triage,/skill:kill-or-ship
Publication is where local sloppiness becomes durable falsehood.
A result becomes publishable only when the entire evidence chain is fresh, current, and repo-backed. The headline number is the last link. Do not inspect the last link and assume the chain holds.
Core principle: if the repository cannot prove the claim now, the claim is not ready now.
Violating the letter of this rule is violating the spirit of this rule.
NO PUBLICATION CLAIMS WITHOUT FRESH, REPO-BACKED VERIFICATION EVIDENCE
Fresh means the current hypothesis entry, the current preregistration, the current judge lock, the current environment lock when compute is containerized, the current dataset revisions, the current baselines, the current falsifier verdicts, the current cost ledger, the current surprise-triage state, the current results files, and the current cross-run lessons.
If you did not run the publication gate in this work session, you do not have publication evidence. You have recollection. Recollection is not verification.
Use this skill when you are about to:
RESULTS.mdexperiments/{id}/RESULTS.mdsmokes/ or experiments/{id}/smokes/CONFIRMEDupdateHypothesisStatus(cwd, id, "CONFIRMED")beats, outperforms, matches, regresses, or fails to beatDo not use this skill:
/skill:research-question/skill:preregistration/skill:baseline-reproduction/skill:experiment-execution/skill:falsification-review/skill:surprise-triage/skill:kill-or-shipBEFORE any publication claim:
1. IDENTIFY the exact hypothesis and exact sentence.
2. LOAD the authoritative repo state.
3. VERIFY every dependency under the claim.
4. READ the actual files and outputs.
5. DECIDE: publishable or blocked.
6. ONLY THEN make the claim.
Skip any step = not verified.
Run a comfortable subset = not verified.
Reuse old output = not verified.
| Need | API or file | Rule |
|---|---|---|
| Repo scaffold | loadRepoState(cwd) | Start from current repo state, not memory |
| Hypothesis registry | loadHypotheses(cwd), parseHypotheses(content), getActiveHypothesis(entries) | Identify the exact live record |
| Hypothesis persistence | hypothesisToMarkdown(h), saveHypotheses(cwd, entries), updateHypothesisStatus(cwd, id, status) | Status follows evidence, never aspiration |
| Baseline metadata | loadBaselines(cwd), getBaselineAgeDays(b) | Stale baselines cannot support publication comparisons |
| Spend totals | getHypothesisSpend(cwd, id), getHypothesisSpendByCategory(cwd, id), getAllHypothesisSpends(cwd) | Compare total and category spend against the cap before claiming success |
| Ledger repair | appendCostRecord(cwd, record) | Only for real omitted events, never fiction |
| Judge lock | computeJudgeHash(judgeRef, id), getJudgeLock(cwd, id), writeJudgeLock(cwd, id, judgeRef) | Recompute and compare; existence alone proves nothing |
| Environment lock | getEnvironmentLock(cwd, id), computeEnvironmentHash(...) | Required when computeTarget is docker or modal |
| Cross-run lessons | loadLessons(cwd), summarizeLessons(lessons) | Review prior killed or pivoted patterns before finalizing |
| Artifact checks | fileExists(path) | Missing files block publication when they are required evidence |
| Adversarial review | runFalsificationAdversary({ claim, cwd, hypothesisId }) | Missing or stale verdicts must be refreshed before publication |
| Canonical files | HYPOTHESES.md, BASELINES.md, RESULTS.md, OVERRIDES.md, .epistemic/cost-ledger.jsonl, alternatives/, experiments/{id}/... | Publication is a repo-level decision, not a headline-only decision |
Before any publication claim, every box below MUST be checked:
experiments/{id}/prereg.md exists and still matches the claim being published.computeJudgeHash(judgeRef, id) matches getJudgeLock(cwd, id) and experiments/{id}/judge.lock exists.environment.lock matches the current Dockerfile + requirements.txt when computeTarget is docker or modal.main, latest, or implied moving snapshot survives into the claim.BASELINES.md are fresh, reproduced, and younger than 30 days.alternatives/.experiments/{id}/falsifiers/{model}.md and actually evaluated.smokes/ or experiments/{id}/smokes/..epistemic/cost-ledger.jsonl is current and the spend still fits the governing decision.category: "compute" whenever compute was used.costCap; if total spend is above 80% of the cap without confirmed results, that warning has been surfaced explicitly.summarizeLessons() shows similar prior failure patterns, explicit justification explains why this hypothesis is materially different.RESULTS.md and experiments/{id}/RESULTS.md contain confirmed, falsification-passed results only.loadHypotheses(cwd).getActiveHypothesis(entries).id explicitly.HypothesisEntry and capture claim, falsifier, bestCaseConclusion, judgeRef, baselineRef, costCap, computeTarget, and status.loadRepoState(cwd).HYPOTHESES.md, BASELINES.md, and root RESULTS.md.experiments/{id}/RESULTS.md when it exists.OVERRIDES.md when any gate was bypassed.experiments/{id}/KILLED.md when the hypothesis ever looked terminal.alternatives/ and confirm the competing explanations from question formation were preserved instead of silently forgotten.HYPOTHESES.md looks manually edited or suspicious, rerun parseHypotheses(content) on the raw markdown.experiments/{id}/prereg.md.fileExists(path).judgeRef from the hypothesis entry.getJudgeLock(cwd, id).computeJudgeHash(judgeRef, id).computeTarget is docker or modal, read getEnvironmentLock(cwd, id) and compare it against the current environment hash derived from the active Dockerfile and requirements.txt.main, latest, and implicit current-state pulls.loadBaselines(cwd).getBaselineAgeDays(entry).experiments/{id}/baselines/{name}.md when present.beats X, verify X was reproduced, not merely cited.alternatives/ and confirm the competing hypotheses from the research-question phase are documented there.experiments/{id}/falsifiers/.experiments/{id}/falsifiers/{model}.md file.runFalsificationAdversary({ claim, cwd, hypothesisId: id }).AdversaryVerdict records before relying on them.falsified-or-unreproducible, publication is blocked.cannot-audit, publication is blocked until the audit gap is fixed or explicitly overridden.caveat-required, the caveat must travel with the claim.smokes/ and experiments/{id}/smokes/./skill:surprise-triage first..epistemic/cost-ledger.jsonl.getHypothesisSpend(cwd, id) for the total.getHypothesisSpendByCategory(cwd, id) for the split between llm and compute.getAllHypothesisSpends(cwd).category: "compute".totalSpend = llm + compute against costCap.experiments/{id}/RESULTS.md, surface the warning explicitly.This hypothesis has already consumed more than 80% of its cap without confirmed results. Are you resolving uncertainty, or defending sunk cost?appendCostRecord(...) only to record a real omitted event.loadLessons(cwd) from src/state/repo.ts.summarizeLessons(lessons).Your last 3 hypotheses failed at cost overrun — are you confident this one is different?COST_OVERRUN, UNREPRODUCIBLE_BASELINE, judge drift, stale environment, or clearly similar root causes as a pattern match.summarizeLessons() shows similar failure patterns, require explicit justification before finalizing.won/lost language.RESULTS.md.experiments/{id}/RESULTS.md when it exists.updateHypothesisStatus(cwd, id, "CONFIRMED") be called when warranted./skill:kill-or-ship and write experiments/{id}/KILLED.md.Run the whole suite every time. Do not skip the checks that feel administrative. Those are the checks that stop embarrassing claims.
1. loadHypotheses(cwd)
2. getActiveHypothesis(entries) or select explicit id
3. loadRepoState(cwd)
4. read HYPOTHESES.md, BASELINES.md, RESULTS.md, OVERRIDES.md when relevant
5. inspect alternatives/
6. fileExists(`experiments/${id}/prereg.md`)
7. read `experiments/${id}/prereg.md`
8. getJudgeLock(cwd, id)
9. computeJudgeHash(judgeRef, id)
10. if docker/modal: getEnvironmentLock(cwd, id) + compare against current Dockerfile/requirements hash
11. verify HF dataset revisions are pinned and match the run artifacts
12. loadBaselines(cwd)
13. getBaselineAgeDays(b) for each required baseline
14. read `experiments/${id}/baselines/{name}.md` when present
15. read `experiments/${id}/falsifiers/{model}.md`
16. runFalsificationAdversary(...) if verdicts are missing or stale
17. inspect `smokes/` and `experiments/${id}/smokes/`
18. read `.epistemic/cost-ledger.jsonl`
19. getHypothesisSpend(cwd, id)
20. getHypothesisSpendByCategory(cwd, id)
21. call loadLessons(cwd)
22. call summarizeLessons(lessons) when lessons exist
23. verify statistical rigor checks
24. read `experiments/${id}/RESULTS.md` when present
25. update status only after the full suite passes
26. only then make the claim
| Failure | What actually happened | Correct response |
|---|---|---|
Claimed from RESULTS.md alone | Verified the headline, not the chain | Run the full suite |
Verified judge.lock and stopped | Checked one lock and ignored the rest of the contract | Continue through environment, datasets, baselines, costs, and falsifiers |
| Environment drifted under Docker or Modal | Same code ran in a different dependency world | Recompute the environment lock and block publication |
| Used an unpinned HF dataset revision | Compared against a moving target | Pin and verify the exact dataset revision |
| Used a cited baseline as a reproduced baseline | Collapsed sourcing into reproduction | Reproduce or drop the comparison |
| Logged only LLM spend | Hid the real bill by ignoring compute | Record compute rows with category: "compute" |
| Burned >80% of cap with no confirmed result and kept pushing | Let sunk cost masquerade as confidence | Surface the warning and reassess the claim |
| Ignored cross-run lessons | Repeated an old failure with cleaner prose | Review loadLessons() and require justification |
Wrote no competing hypotheses to alternatives/ | Pretended the winning story had no rivals | Document the rivals or narrow the claim |
| Reported significance without effect sizes or corrections | Turned statistics into decoration | Repair the analysis before publication |
| Copied a smoke number into a final file | Promoted provisional evidence | Triage first |
Marked status CONFIRMED early | Used state as aspiration | Let evidence determine status |
| Excuse | Reality |
|---|---|
The result is already written down. | Files can be wrong. Verify the evidence chain. |
The judge only changed a little. | Small drift is still drift. |
The container only changed a little. | Different dependencies are different method conditions. |
The dataset revision is probably the same. | Probably is not reproducibility. Pin it. |
The baseline is famous. | Fame is not reproduction. |
Compute is infrastructure, not experiment cost. | If it burned money to produce evidence, it is experiment cost. |
We are still under cap, technically. | 81% burned with no confirmed result is already a warning sign. |
The old failures were about different wording. | Pattern repetition matters more than wording repetition. |
Those lessons do not really apply. | Then write the explicit justification for why this run is different. |
The alternatives were obvious. | If they are not in alternatives/, they are not part of the record. |
The p-value is enough. | Without effect size discipline and correction policy, it is not enough. |
The smoke run looks stable. | Smoke evidence is still provisional. |
This is just an internal summary. | Internal falsehood becomes external fast. |
I only changed wording. | Wording can widen the claim faster than evidence can support it. |
Stop immediately if any of these thoughts show up:
Probably ready.Good enough.Close enough.The environment drift is cosmetic.The dataset pin can wait until release.Compute spend does not need to be itemized.We have enough left in the cap, probably.The old lessons are not relevant this time.I do not need to show the alternatives.The p-value is all anyone will ask for.The smoke number is basically confirmed.I can fix the ledger afterward.Nobody will ask how this was verified.I need the conclusion now.All of these mean the same thing: you are trying to publish faster than you are verifying.
const entries = await loadHypotheses(cwd);
const h = getActiveHypothesis(entries);
if (!h) throw new Error("No active hypothesis");
const judgeLock = await getJudgeLock(cwd, h.id);
const expectedJudge = computeJudgeHash(h.judgeRef, h.id);
if (!judgeLock || judgeLock !== expectedJudge) {
throw new Error(`Judge drift for ${h.id}`);
}
if (h.computeTarget === "docker" || h.computeTarget === "modal") {
const envLock = await getEnvironmentLock(cwd, h.id);
if (!envLock) throw new Error(`Missing environment.lock for ${h.id}`);
}
const spend = await getHypothesisSpendByCategory(cwd, h.id);
const total = spend.llm + spend.compute;
if (total > h.costCap * 0.8) {
console.warn("High budget pressure before publication review.");
}
if (await getJudgeLock(cwd, id)) console.log("lock exists");
console.log("budget should still be fine");
console.log("the dataset is probably unchanged");
const lessons = await loadLessons(cwd);
const lessonSummary = summarizeLessons(lessons);
console.log(lessonSummary);
// If the pattern rhymes with this run, require explicit justification before publishing.
Good because: publication is being tested against prior repo memory, not just the current run's optimism.
This result looks stronger than the last few, so I do not need to review prior failures.
Bad because: that is how repeated failure patterns get renamed into progress.
Publication failures rarely start with a giant bug. They start with one skipped check, then another, then a sentence that sounded safe because the number looked clean.
A judge mismatch turns the result into a different experiment. An environment mismatch turns reproducibility into theater. A drifting dataset revision turns a benchmark into a moving target. A stale baseline turns comparison into fiction. An ignored falsifier turns review into ceremony. A missing compute row turns cost governance into storytelling. An unreviewed lesson log turns recurring failure into institutional amnesia. An undocumented alternative turns one surviving story into an overclaimed conclusion. Weak statistical discipline turns a number into decoration.
This gate exists to stop last-mile dishonesty.
Not dramatic dishonesty.
Ordinary rushed dishonesty.
The kind that says we basically verified it.
Do not do that.
A claim is ready when it survives the full suite. Not when it feels ready. Not when the graph looks nice. Not when the cap is almost gone. Not when the previous failures are inconvenient.
Read the repo. Run the gate. Surface the warnings. Require the justification. Then publish.
Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.
npx claudepluginhub atomicstrata/epistemic --plugin epistemic-skills