From epistemic-skills
Freezes the experiment contract, locks the judge and compute environment, and scaffolds the run before any experiment-shaped command executes. Use after a hypothesis is concrete enough to preregister.
How this skill is triggered — by the user, by Claude, or both
Slash command
/epistemic-skills:preregistrationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
> **Related skills:** `/skill:research-question`, `/skill:baseline-reproduction`
Related skills:
/skill:research-question,/skill:baseline-reproduction
Preregistration is where ambition loses the right to improvise.
In this repo, prereg is not paperwork. It is the point where you freeze:
src/gates/prereg.ts currently blocks experiment-shaped bash calls when experiments/{id}/prereg.md is missing.
Treat that as the floor, not the standard.
A real prereg is incomplete until the repository contains the artifacts required by the active computeTarget.
The current state surface in src/state/repo.ts already supports this:
HypothesisEntry persists bestCaseConclusionHypothesisEntry persists computeTargetComputeTarget is local | docker | modalwriteJudgeLock(...) writes experiments/{id}/judge.lockgetEnvironmentLock(...) reads experiments/{id}/environment.lockcomputeEnvironmentHash(...) computes the environment lock hash you must write when the target is dockerIf the compute target is docker, prereg includes the runtime scaffold:
experiments/{id}/Dockerfileexperiments/{id}/requirements.txtexperiments/{id}/environment.lockIf the compute target is modal, prereg includes:
experiments/{id}/modal-app.pyIf the compute target is local, prereg is still required.
Local is not a loophole.
It just means the environment freeze is descriptive instead of containerized.
This skill replaces the old split between “write the prereg,” “lock the judge,” and “decide the runtime later.” Do it in one pass. Freeze the contract. Freeze the evaluator. Freeze the execution surface. Then code can run.
NO EXPERIMENT-SHAPED CODE BEFORE THE CONTRACT, LOCKS, AND SCAFFOLD EXIST
No benchmark code. No eval code. No training code. No smoke script that emits evidence you may later quote. No “just checking the docker image.” No “just making sure Modal boots.” No helper that changes what story you can tell later.
If the command can influence what gets claimed, it comes after preregistration.
Use this skill:
/skill:research-questionOPEN hypothesis is concrete enough to freeze.epistemic/cost-ledger.jsonlcomputeTarget and must be scaffolded cleanlyexperiments/{id}/RESULTS.mdDo not use this skill:
/skill:research-question/skill:baseline-reproductionlocal as permission to skip environment thinkingrunFalsificationAdversary({ claim, cwd, hypothesisId }) from src/adversary/dispatch.ts; that is later| Surface | API or file | Use at this phase |
|---|---|---|
| Repo sanity check | loadRepoState(cwd) | Confirm the scaffold exists before you assume normal flow |
| Canonical hypothesis registry | HYPOTHESES.md | Stores the active compact contract |
| Read hypotheses | loadHypotheses(cwd) | Load the current experiment registry |
| Parse raw registry text | parseHypotheses(content) | Use only if markdown is already loaded |
| Detect live work | getActiveHypothesis(entries) | Identify the active OPEN or RUNNING hypothesis |
| Render one hypothesis | hypothesisToMarkdown(entry) | Keep the registry parseable |
| Persist registry | saveHypotheses(cwd, entries) | Write the updated hypothesis entry |
| Artifact existence | fileExists(path) | Detect broken or partial prereg state |
| Judge hash | computeJudgeHash(judgeRef, hypothesisId) | Recompute the locked judge hash |
| Judge lock read | getJudgeLock(cwd, hypothesisId) | Detect judge drift before writing |
| Judge lock write | writeJudgeLock(cwd, hypothesisId, judgeRef) | Create experiments/{id}/judge.lock |
| Environment hash | computeEnvironmentHash(...) | Compute the Docker environment lock hash |
| Environment lock read | getEnvironmentLock(cwd, hypothesisId) | Detect environment drift before writing |
| Hypothesis spend | getHypothesisSpend(cwd, id) | Check whether the run already burned money |
| Repo-wide spend | getAllHypothesisSpends(cwd) | Useful when several live ideas compete for budget |
| Cost ledger | .epistemic/cost-ledger.jsonl | Spend policy anchor |
| Docker scaffold | experiments/{id}/Dockerfile, experiments/{id}/requirements.txt | Required when computeTarget = docker |
| Docker environment lock | experiments/{id}/environment.lock | Required when computeTarget = docker |
| Modal scaffold | experiments/{id}/modal-app.py | Required when computeTarget = modal |
| Prereg artifact | experiments/{id}/prereg.md | Narrative contract for the live experiment |
Current repo reality matters:
HypothesisEntry now round-trips both bestCaseConclusion and computeTarget.
Do not pretend those are prereg-only notes.
They belong in HYPOTHESES.md and in experiments/{id}/prereg.md.
The research contract still has seven epistemic fields:
It also has one execution field:
The research fields answer whether the claim is testable. The compute target answers what environment must be frozen before code runs.
If any one of these is weak, the prereg is weak. If any one of them is missing, the prereg is incomplete. If the prereg is incomplete, the experiment should not run.
loadHypotheses(cwd) to read HYPOTHESES.md.parseHypotheses(content).getActiveHypothesis(entries) to identify the live experiment./skill:research-question.OPEN or RUNNING hypothesis sharing the same result stream, stop.experiments/{id}/.fileExists(path) to check experiments/{id}/prereg.md.fileExists(path) to check experiments/{id}/judge.lock.computeTarget from the active hypothesis before deciding what else must exist.computeTarget is docker, also check experiments/{id}/Dockerfile, experiments/{id}/requirements.txt, and experiments/{id}/environment.lock.computeTarget is modal, also check experiments/{id}/modal-app.py.prereg.md, treat that as broken state.OPEN while preregistration is being created or repaired.updateHypothesisStatus(...) here.src/gates/prereg.ts flips OPEN to RUNNING on the first allowed experiment command.prereg.md.src/state/repo.ts, the registry field is n.sampleSize field in HYPOTHESES.md.n must be a positive integer.n must match the actual unit of repetition.n means prompts, seeds, tasks, or full runs.n = 1 is usually a confession.n to the falsifier.n to the budget.n to the expected runtime.model, prompt, temperature, and seed.temperature as a number.seed as a number.judgeRef.writeJudgeLock(cwd, hypothesisId, judgeRef) from src/state/repo.ts to write experiments/{id}/judge.lock.getJudgeLock(cwd, hypothesisId) already returns a value, recompute the expected hash with computeJudgeHash(judgeRef, hypothesisId) and compare it.OVERRIDES.md before changing it.loadBaselines(cwd) to inspect known baselines.BaselineEntry fields are name, url, score, judge, version, and retrieved.retrieved with getBaselineAgeDays(entry)./skill:baseline-reproduction..epistemic/cost-ledger.jsonl.getHypothesisSpend(cwd, hypothesisId) to inspect existing spend for this experiment.getAllHypothesisSpends(cwd) if you need repo-wide context.n.$0, uncapped, or we’ll see.appendCostRecord(cwd, record).HypothesisEntry persists bestCaseConclusion.HYPOTHESES.md.experiments/{id}/prereg.md.computeTarget and scaffold the right runtimecomputeTarget from the active HypothesisEntry.local, docker, and modal.gpu, cluster, k8s, serverless, or whatever runs fastest.computeTarget = localprereg.md is still required.judge.lock is still required.local in both HYPOTHESES.md and prereg.md.computeTarget = dockerexperiments/{id}/Dockerfile.experiments/{id}/requirements.txt.FROM python:3.11.9-slim.python:3.11, python:latest, or ubuntu:latest.requirements.txt, pin every third-party dependency exactly with package==version.numpy, pandas>=2, or comment-only placeholders.requirements.txt is acceptable only if the experiment truly uses the Python standard library only, and the prereg should say so plainly.FROM python:3.11.9-slim
WORKDIR /app
COPY requirements.txt .
RUN python -m pip install --upgrade pip==24.2 \
&& python -m pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "-m", "experiments.{id}.run"]
requirements.txt shape is:numpy==2.1.1
pydantic==2.9.2
computeEnvironmentHash() from src/state/repo.ts.experiments/{id}/environment.lock.experiments/{id}/prereg.mdexperiments/{id}/judge.lockexperiments/{id}/Dockerfileexperiments/{id}/requirements.txtexperiments/{id}/environment.lockgetEnvironmentLock(cwd, hypothesisId) already returns a value, recompute from the current scaffold and compare it.OVERRIDES.md first.computeTarget = modalexperiments/{id}/modal-app.py.@modal.app() decorator, matching the chosen Modal execution surface.import modal
@modal.app()
def app():
"""Experiment app scaffold for {id}."""
experiments/{id}/.modal in both HYPOTHESES.md and prereg.md.environment.lock rule for Modal unless the repo standard changes intentionally.HYPOTHESES.md and write the prereg artifactsHYPOTHESES.md is the compact registry.loadHypotheses(cwd) to load existing entries.HypothesisEntry in memory or create one if none exists.id, claim, falsifier, bestCaseConclusion, n, judgeRef, baselineRef, costCap, computeTarget, status, and timestamp.OPEN.RUNNING here.parseHypotheses cannot read.hypothesisToMarkdown(entry) and saveHypotheses(cwd, entries).experiments/{id}/prereg.md at the exact path experiments/{id}/prereg.md.experiments/{id}/ if needed.OPEN.computeTarget = docker, include the scaffold file paths and the environment lock hash.computeTarget = modal, include the path to modal-app.py.computeTarget = local, include the local environment assumptions plainly.# Pre-registration: {id}
- Date: 2026-05-31
- Status: OPEN
- Claim: {claim}
- Falsifier: {falsifier}
- N: {n}
- Baseline reference: {baselineRef}
- Cost cap: ${costCap}
- Best-case conclusion: {bestCaseConclusion}
- Compute target: {computeTarget}
## Judge
- Model: {model}
- Prompt: {prompt}
- Temperature: {temperature}
- Seed: {seed}
## Environment
- Dockerfile: experiments/{id}/Dockerfile
- Requirements: experiments/{id}/requirements.txt
- Environment lock: {environmentHash}
## Notes
- Baseline status: pending local reproduction
- Sample rationale: {why n is enough}
HYPOTHESES.mdexperiments/{id}/prereg.mdexperiments/{id}/judge.lockcomputeTarget = docker, also stage:
experiments/{id}/Dockerfileexperiments/{id}/requirements.txtexperiments/{id}/environment.lockcomputeTarget = modal, also stage:
experiments/{id}/modal-app.pysrc/gates/prereg.ts can allow experiment-shaped bash calls.updateHypothesisStatus(cwd, id, "RUNNING").git add HYPOTHESES.md experiments/{id}/prereg.md experiments/{id}/judge.lock
# docker only:
git add experiments/{id}/Dockerfile experiments/{id}/requirements.txt experiments/{id}/environment.lock
# modal only:
git add experiments/{id}/modal-app.py
git commit -m "epistemic: prereg {id}"
experiments/{id}/RESULTS.md yet.experiments/{id}/smokes/ yet.| Excuse | Reality |
|---|---|
“I’ll write prereg.md after one smoke run.” | Then the smoke run already contaminated the design. |
| “I only need a quick script.” | Quick scripts still generate evidence. |
| “The falsifier is obvious.” | If it is not written, it will move. |
| “Temperature defaults to zero anyway.” | Defaults drift; pin it. |
| “Seed does not matter for this provider.” | Recording it is cheap and auditable. |
| “I know the baseline from memory.” | Memory is not a reproduced source. |
| “I’ll lock the judge later.” | Later means after outputs existed. |
| “Judge lock is enough; the environment can stay loose.” | Judge drift and environment drift are different failure modes. |
| “We’ll decide between local and Docker after the first run.” | Then computeTarget was never part of the contract. |
“The Dockerfile can stay on latest for now.” | Floating base images are just environment drift with better branding. |
“requirements.txt can stay loose; pip will figure it out.” | Loose dependencies are how unreproducible wins get born. |
| “Modal setup is just operational glue.” | Operational changes change what actually ran. |
| “We only spent a little before prereg.” | Reconcile it honestly or kill the run. |
| “Best-case conclusion feels restrictive.” | That is exactly why it matters. |
| “The gate only catches certain commands.” | Integrity is not defined by regex loopholes. |
| “I’ll commit prereg together with the experiment run.” | That destroys ordering and turns prereg into theater. |
| “The baseline is famous enough that we do not need a URL.” | Fame is not provenance. |
“We can overwrite environment.lock; it’s only scaffolding.” | Scaffolding that changes results is part of the experiment. |
getHypothesisSpend(cwd, id) is already non-zero before prereg exists.getJudgeLock(cwd, id) exists and does not match the current canonical judge.getEnvironmentLock(cwd, id) exists and does not match the current Docker scaffold.getBaselineAgeDays(entry) says the comparator is older than 30 days.n is TBD, until stable, or any other moving target.computeTarget is missing or not one of local, docker, modal.computeTarget = docker but Dockerfile, requirements.txt, or environment.lock is missing.computeTarget = modal but modal-app.py is missing.requirements.txt contains unpinned dependencies.experiments/{id}/prereg.md contains outputs, screenshots, or smoke numbers.OPEN or RUNNING hypotheses are sharing one result stream.judge.lock because it is inconvenient.environment.lock because the container changed “just a little.”HYPOTHESES.md into a format parseHypotheses cannot parse.Claim:
Prompt A improves exact-match over the reproduced zero-shot baseline by at least 2 points on GSM8K under the locked judge.
Falsifier:
If mean exact-match improvement is less than 2 points across n=30 locked runs, the claim is falsified.
Why it is good:
Claim:
Prompt A is smarter and more robust.
Falsifier:
If the model does not truly understand the task or if the benchmark feels unfair.
Why it is bad:
{
"model": "gpt-4.1-mini-2026-04-14",
"prompt": "prompts/gsm8k-judge-v3.md@9f3e2c1",
"temperature": 0,
"seed": 17
}
Why it is good:
judgeRef.writeJudgeLock(cwd, hypothesisId, judgeRef) can lock it cleanly.{
"model": "latest",
"prompt": "current prompt",
"temperature": "default"
}
Why it is bad:
latest drifts.temperature is not numeric.seed is missing.FROM python:3.11.9-slim
WORKDIR /app
COPY requirements.txt .
RUN python -m pip install --upgrade pip==24.2 \
&& python -m pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "-m", "experiments.gsm8k_cot.run"]
numpy==2.1.1
pydantic==2.9.2
Why it is good:
computeEnvironmentHash(...) has a stable scaffold to lock.FROM python:latest
COPY . .
RUN pip install -r requirements.txt
CMD ["python", "run.py"]
numpy
pydantic>=2
Why it is bad:
environment.lock built on this is theater.prereg.md# Pre-registration: gsm8k-cot-a-vs-zeroshot-2026-05-31
- Date: 2026-05-31
- Status: OPEN
- Claim: Prompt A improves exact-match over the reproduced zero-shot baseline by at least 2 points on GSM8K under the locked judge.
- Falsifier: If mean exact-match improvement is less than 2 points across n=30 locked runs, the claim is falsified.
- N: 30
- Baseline reference: GPT-4o zero-shot, https://example.com/report, score 84.1, judge exact-match, version 2026-05-10, pending reproduction
- Cost cap: $35
- Best-case conclusion: Under the locked judge on GSM8K, prompt A appears better than the reproduced zero-shot baseline.
- Compute target: docker
## Judge
- Model: gpt-4.1-mini-2026-04-14
- Prompt: prompts/gsm8k-judge-v3.md@9f3e2c1
- Temperature: 0
- Seed: 17
## Environment
- Dockerfile: experiments/gsm8k-cot-a-vs-zeroshot-2026-05-31/Dockerfile
- Requirements: experiments/gsm8k-cot-a-vs-zeroshot-2026-05-31/requirements.txt
- Environment lock: 9db0c4f7f5f0c2f57d6e1f5a0d1b4f8a8e4d9f8e4578a6d3e7c8a2b9d1f8c4aa
Why it is good:
prereg.md# Notes for experiment
- Claim: We think this prompt might be better.
- Falsifier: TBD
- N: start with 3
- Baseline: sota
- Cost cap: maybe $100?
- Compute target: maybe docker later
Why it is bad:
N is still being negotiated.Preregistration protects you from your future self. Not the cartoon villain version. The tired, clever, motivated version that can rationalize almost anything after seeing a graph.
A written claim prevents drift.
A hard falsifier prevents rhetoric from replacing evidence.
A declared sample prevents optional stopping.
A locked judge prevents judge-shopping.
A locked compute target prevents platform-shopping.
A locked Docker scaffold prevents “works on my machine” folklore from sneaking into the result.
A named baseline gives /skill:baseline-reproduction a real target.
A cost cap prevents ego from burning money.
A best-case conclusion caps what you are allowed to say even on a good day.
A committed prereg.md gives src/gates/prereg.ts permission to let code run.
If this phase is sloppy, every later phase inherits the slop. If this phase is tight, later disagreement becomes useful instead of political.
After this, use /skill:baseline-reproduction.
npx claudepluginhub atomicstrata/epistemic --plugin epistemic-skillsProvides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.