Skill

preregistration

Freezes the experiment contract, locks the judge and compute environment, and scaffolds the run before any experiment-shaped command executes. Use after a hypothesis is concrete enough to preregister.

developer-tools

Popularity

Stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/epistemic-skills:preregistration

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

> **Related skills:** `/skill:research-question`, `/skill:baseline-reproduction`

SKILL.md

734 lines · ~7.8k tokens(exceeds 5k compaction limit)

Stats

LanguageTypeScript

Stars7

MaintenanceExcellent

Last CommitJun 4, 2026

Actions

View Source View Plugin View on GitHub View README

Preregistration

Overview

Preregistration is where ambition loses the right to improvise.

In this repo, prereg is not paperwork. It is the point where you freeze:

the claim
the falsifier
the sample
the judge
the baseline target
the budget
the best-case conclusion
the compute target and, when needed, the compute environment

src/gates/prereg.ts currently blocks experiment-shaped bash calls when experiments/{id}/prereg.md is missing. Treat that as the floor, not the standard. A real prereg is incomplete until the repository contains the artifacts required by the active computeTarget.

The current state surface in src/state/repo.ts already supports this:

HypothesisEntry persists bestCaseConclusion
HypothesisEntry persists computeTarget
ComputeTarget is local | docker | modal
writeJudgeLock(...) writes experiments/{id}/judge.lock
getEnvironmentLock(...) reads experiments/{id}/environment.lock
computeEnvironmentHash(...) computes the environment lock hash you must write when the target is docker

If the compute target is docker, prereg includes the runtime scaffold:

experiments/{id}/Dockerfile
experiments/{id}/requirements.txt
experiments/{id}/environment.lock

If the compute target is modal, prereg includes:

experiments/{id}/modal-app.py

If the compute target is local, prereg is still required. Local is not a loophole. It just means the environment freeze is descriptive instead of containerized.

This skill replaces the old split between “write the prereg,” “lock the judge,” and “decide the runtime later.” Do it in one pass. Freeze the contract. Freeze the evaluator. Freeze the execution surface. Then code can run.

The Iron Law

NO EXPERIMENT-SHAPED CODE BEFORE THE CONTRACT, LOCKS, AND SCAFFOLD EXIST

No benchmark code. No eval code. No training code. No smoke script that emits evidence you may later quote. No “just checking the docker image.” No “just making sure Modal boots.” No helper that changes what story you can tell later.

If the command can influence what gets claimed, it comes after preregistration.

When to Use

Use this skill:

immediately after /skill:research-question
when an OPEN hypothesis is concrete enough to freeze
before the first benchmark run
before the first judge call
before the first training job
before the first scripted comparison
before any spend that will land in .epistemic/cost-ledger.jsonl
when reviving an old idea under an existing hypothesis ID
when the active hypothesis now has a real computeTarget and must be scaffolded cleanly
before writing anything that might later land in experiments/{id}/RESULTS.md

When NOT to Use

Do not use this skill:

for vague brainstorming; use /skill:research-question
after you already ran the experiment
to backfill paperwork for a dirty run
to rewrite the claim after seeing smoke results
to change the compute target midstream because one platform now looks friendlier
as a substitute for /skill:baseline-reproduction
as a substitute for falsification review
to sneak in a new judge, new environment, or new dependency after outputs exist
to treat local as permission to skip environment thinking
to call runFalsificationAdversary({ claim, cwd, hypothesisId }) from src/adversary/dispatch.ts; that is later

State Surface

Surface	API or file	Use at this phase
Repo sanity check	`loadRepoState(cwd)`	Confirm the scaffold exists before you assume normal flow
Canonical hypothesis registry	`HYPOTHESES.md`	Stores the active compact contract
Read hypotheses	`loadHypotheses(cwd)`	Load the current experiment registry
Parse raw registry text	`parseHypotheses(content)`	Use only if markdown is already loaded
Detect live work	`getActiveHypothesis(entries)`	Identify the active `OPEN` or `RUNNING` hypothesis
Render one hypothesis	`hypothesisToMarkdown(entry)`	Keep the registry parseable
Persist registry	`saveHypotheses(cwd, entries)`	Write the updated hypothesis entry
Artifact existence	`fileExists(path)`	Detect broken or partial prereg state
Judge hash	`computeJudgeHash(judgeRef, hypothesisId)`	Recompute the locked judge hash
Judge lock read	`getJudgeLock(cwd, hypothesisId)`	Detect judge drift before writing
Judge lock write	`writeJudgeLock(cwd, hypothesisId, judgeRef)`	Create `experiments/{id}/judge.lock`
Environment hash	`computeEnvironmentHash(...)`	Compute the Docker environment lock hash
Environment lock read	`getEnvironmentLock(cwd, hypothesisId)`	Detect environment drift before writing
Hypothesis spend	`getHypothesisSpend(cwd, id)`	Check whether the run already burned money
Repo-wide spend	`getAllHypothesisSpends(cwd)`	Useful when several live ideas compete for budget
Cost ledger	`.epistemic/cost-ledger.jsonl`	Spend policy anchor
Docker scaffold	`experiments/{id}/Dockerfile`, `experiments/{id}/requirements.txt`	Required when `computeTarget = docker`
Docker environment lock	`experiments/{id}/environment.lock`	Required when `computeTarget = docker`
Modal scaffold	`experiments/{id}/modal-app.py`	Required when `computeTarget = modal`
Prereg artifact	`experiments/{id}/prereg.md`	Narrative contract for the live experiment

Current repo reality matters: HypothesisEntry now round-trips both bestCaseConclusion and computeTarget. Do not pretend those are prereg-only notes. They belong in HYPOTHESES.md and in experiments/{id}/prereg.md.

The Contract Shape

The research contract still has seven epistemic fields:

Claim
Falsifier
Sample size
Judge configuration
Baseline reference
Cost cap
Best-case conclusion

It also has one execution field:

Compute target

The research fields answer whether the claim is testable. The compute target answers what environment must be frozen before code runs.

If any one of these is weak, the prereg is weak. If any one of them is missing, the prereg is incomplete. If the prereg is incomplete, the experiment should not run.

The Process

1. Resolve repository state before you write anything

Start from the repo, not from memory.
Use loadHypotheses(cwd) to read HYPOTHESES.md.
If raw markdown is already loaded, use parseHypotheses(content).
Use getActiveHypothesis(entries) to identify the live experiment.
If there is no active hypothesis and the idea is still vague, go back to /skill:research-question.
If there is no active hypothesis but the idea is already concrete, create the entry here through the canonical helpers.
If there is more than one OPEN or RUNNING hypothesis sharing the same result stream, stop.
One prereg belongs to one experiment ID.
Confirm the directory root is experiments/{id}/.
Use fileExists(path) to check experiments/{id}/prereg.md.
Use fileExists(path) to check experiments/{id}/judge.lock.
Read computeTarget from the active hypothesis before deciding what else must exist.
If computeTarget is docker, also check experiments/{id}/Dockerfile, experiments/{id}/requirements.txt, and experiments/{id}/environment.lock.
If computeTarget is modal, also check experiments/{id}/modal-app.py.
If any lock or scaffold exists without prereg.md, treat that as broken state.
Keep status OPEN while preregistration is being created or repaired.
Do not call updateHypothesisStatus(...) here.
The gate in src/gates/prereg.ts flips OPEN to RUNNING on the first allowed experiment command.
The gate currently enforces only prereg.md.
Your process is stricter than the gate.
Methodology is not defined by the regex floor.

2. Field 1 of 7 — validate the claim

A claim is not a slogan.
It must name the intervention.
It must name the comparator.
It must name the metric.
It must name the task, benchmark, dataset, or slice.
It must name the direction of change.
If a threshold matters, write the threshold now.
Reject claims like “this is better.”
Reject claims like “more robust” with no metric.
Reject claims like “users will love it” with no observable criterion.
Reject bundled claims that require multiple experiments to test.
Good claims are narrow enough to fail cleanly.
Bad claims can only be defended with interpretation.
Write the claim so later review can attack it without reading your mind.
If the claim cannot later become a one-line result statement, it is still mush.

3. Field 2 of 7 — validate the falsifier

The falsifier is the condition that kills the claim.
If the claim cannot be killed, it is not a research claim.
The falsifier must be empirical.
The falsifier must be reachable by the planned experiment.
The falsifier must not depend on vibes, elegance, intent, or worldview.
Reject philosophical non-falsifiers.
Reject “if the model does not truly understand.”
Reject “if users do not spiritually resonate.”
Reject “if the approach is not elegant enough.”
Reject moving-goal clauses like “unless the seed was unlucky.”
A valid falsifier sounds like a stop condition.
Example: “If mean exact-match improvement is less than 2 points across n=30 runs, the claim is falsified.”
Example: “If pass@1 is not higher than baseline under the locked judge, the claim is falsified.”
Example: “If cost-normalized win rate does not exceed the named baseline by 5%, the claim is falsified.”
If two hostile reviewers would not agree on the falsifier, it is still weak.
Fix this before you write anything else.

4. Field 3 of 7 — validate sample size

In src/state/repo.ts, the registry field is n.
Do not invent a parallel sampleSize field in HYPOTHESES.md.
n must be a positive integer.
n must match the actual unit of repetition.
Say whether n means prompts, seeds, tasks, or full runs.
If the system is stochastic, n = 1 is usually a confession.
Reject “we will run until it looks stable.”
Reject “we will stop when the chart looks convincing.”
Reject “start with 3 and decide later” unless the staged plan itself is preregistered.
Match n to the falsifier.
Match n to the budget.
Match n to the expected runtime.
If you cannot afford the declared sample, narrow the claim instead of lying about the design.
If you do not know what one unit of repetition means, you are not ready to preregister.

5. Field 4 of 7 — validate judge configuration and lock it

The judge has four required leaves.
They are model, prompt, temperature, and seed.
Missing any one of them means the judge is not locked.
“Default temperature” is not a value.
“Current prompt” is not a value.
“Latest model” is not a value.
Pin the exact model identifier.
Pin the exact prompt text or an immutable prompt reference.
If the prompt lives in a file, record the file path and immutable revision.
Record temperature as a number.
Record seed as a number.
If the provider ignores seeds, record the requested seed anyway.
Build a canonical object from exactly these four fields.
Serialize it in stable key order.
Turn that frozen judge payload into judgeRef.
Use writeJudgeLock(cwd, hypothesisId, judgeRef) from src/state/repo.ts to write experiments/{id}/judge.lock.
If getJudgeLock(cwd, hypothesisId) already returns a value, recompute the expected hash with computeJudgeHash(judgeRef, hypothesisId) and compare it.
If the hash differs, stop.
That is judge drift.
Do not overwrite drift casually.
If you must break the lock, record the reason in OVERRIDES.md before changing it.
The discipline is simple: no drift without override.
This lock belongs in preregistration, not after the first eval.

6. Field 5 of 7 — validate baseline reference

Comparative claims require a named baseline.
Use loadBaselines(cwd) to inspect known baselines.
Current BaselineEntry fields are name, url, score, judge, version, and retrieved.
If the target baseline exists locally, inspect retrieved with getBaselineAgeDays(entry).
If the baseline is older than 30 days, it is stale.
Stale baselines do not support fresh comparison claims.
If the baseline is external, record the URL anyway.
Record the quoted score.
Record the version or release identifier.
Record the evaluation method or judge when known.
Record whether local reproduction is still pending.
Do not use “SOTA” as a baseline name.
Do not use memory as a baseline source.
If you cannot name the baseline precisely, the claim is not ready.
Naming a baseline is not reproducing a baseline.
That comes next, under /skill:baseline-reproduction.

7. Field 6 of 7 — validate cost cap

The cost cap is part of the design.
It is not decoration.
The ledger lives at .epistemic/cost-ledger.jsonl.
Use getHypothesisSpend(cwd, hypothesisId) to inspect existing spend for this experiment.
Use getAllHypothesisSpends(cwd) if you need repo-wide context.
Before a clean prereg, spend should usually be zero.
Non-zero spend before prereg is a protocol breach or a resumed run.
Treat that as a red flag.
Set the cap in real USD.
Base it on expected calls, tokens, environment cost, and n.
Reject $0, uncapped, or we’ll see.
Reject caps that cannot fund the declared sample.
Reject caps that only work if every run succeeds on the first try.
Later, actual tool costs get recorded through appendCostRecord(cwd, record).
A real cap should feel slightly uncomfortable.

8. Field 7 of 7 — validate best-case conclusion explicitly

This field is mandatory.
Ask the question directly:
What is the sober, low-expectations outcome if this works?
Write the answer before results exist.
Keep it to one sentence.
Keep it smaller than the story in your head.
Tie it to the named task, named baseline, and locked judge.
Good: “Under the locked judge on GSM8K, prompt A appears better than the reproduced zero-shot baseline.”
Bad: “We solved reasoning.”
Bad: “This proves general intelligence.”
Bad: “Users will love it everywhere.”
In the current repo, HypothesisEntry persists bestCaseConclusion.
Record it in HYPOTHESES.md.
Mirror it in experiments/{id}/prereg.md.
If the conclusion ceiling feels restrictive, that is proof it is doing its job.

9. Execution field — validate `computeTarget` and scaffold the right runtime

Read computeTarget from the active HypothesisEntry.
The allowed values are exactly local, docker, and modal.
Validate the value.
Reject blank values.
Reject unknown values like gpu, cluster, k8s, serverless, or whatever runs fastest.
Do not infer the target from the current laptop.
Do not let the target drift because infra friction changed.
The target is part of the contract.
It determines which scaffold must exist before code runs.
Route strictly by value.

If `computeTarget = local`

prereg.md is still required.
judge.lock is still required.
Record the target as local in both HYPOTHESES.md and prereg.md.
Write down any local-only assumptions in the prereg notes.
Do not pretend local execution is reproducible just because it is convenient.
Local is the lightest scaffold, not an exemption from discipline.

If `computeTarget = docker`

Create experiments/{id}/Dockerfile.
Create experiments/{id}/requirements.txt.
Pin the Python base image version exactly.
Example: FROM python:3.11.9-slim.
Do not use floating tags like python:3.11, python:latest, or ubuntu:latest.
In requirements.txt, pin every third-party dependency exactly with package==version.
Reject unpinned entries like numpy, pandas>=2, or comment-only placeholders.
An empty requirements.txt is acceptable only if the experiment truly uses the Python standard library only, and the prereg should say so plainly.
Keep the Dockerfile boring.
Good scaffolds privilege reproducibility over clever caching tricks.
A minimal shape is enough:

FROM python:3.11.9-slim
WORKDIR /app
COPY requirements.txt .
RUN python -m pip install --upgrade pip==24.2 \
    && python -m pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "-m", "experiments.{id}.run"]

A minimal requirements.txt shape is:

numpy==2.1.1
pydantic==2.9.2

Those versions are examples, not permission to guess your real dependencies.
The real file must pin the dependencies the experiment actually uses.
After the Dockerfile and requirements are frozen, compute the environment hash with computeEnvironmentHash() from src/state/repo.ts.
Use the exact frozen Dockerfile input and exact frozen requirements input as the lock basis.
Write the resulting SHA-256 to experiments/{id}/environment.lock.
The required layout is:
- experiments/{id}/prereg.md
- experiments/{id}/judge.lock
- experiments/{id}/Dockerfile
- experiments/{id}/requirements.txt
- experiments/{id}/environment.lock
If getEnvironmentLock(cwd, hypothesisId) already returns a value, recompute from the current scaffold and compare it.
If the hash differs, stop.
That is environment drift.
Do not overwrite it casually.
If you must change the environment after prereg, record the reason in OVERRIDES.md first.
Judge drift and environment drift follow the same rule: no drift without override.

If `computeTarget = modal`

Create experiments/{id}/modal-app.py.
The stub must exist before the first remote run.
Generate a minimal app stub with a @modal.app() decorator, matching the chosen Modal execution surface.
Keep it boring and explicit.
The stub exists to freeze the entry point, not to show off framework fluency.
A minimal shape is:

import modal

@modal.app()
def app():
    """Experiment app scaffold for {id}."""

If the real run needs additional functions, images, volumes, or secrets, add them deliberately after they are part of the prereg contract.
Do not quietly bootstrap Modal from an ad hoc scratch file outside experiments/{id}/.
Record the target as modal in both HYPOTHESES.md and prereg.md.
For now, the explicit environment lock discipline in this repo applies to Docker scaffolds.
Do not invent a parallel environment.lock rule for Modal unless the repo standard changes intentionally.

10. Update `HYPOTHESES.md` and write the prereg artifacts

HYPOTHESES.md is the compact registry.
Use loadHypotheses(cwd) to load existing entries.
Modify the active HypothesisEntry in memory or create one if none exists.
The persisted fields now include id, claim, falsifier, bestCaseConclusion, n, judgeRef, baselineRef, costCap, computeTarget, status, and timestamp.
Preserve valid existing metadata.
Keep status OPEN.
Do not set RUNNING here.
Do not hand-edit the file into a shape parseHypotheses cannot read.
Use hypothesisToMarkdown(entry) and saveHypotheses(cwd, entries).
Write experiments/{id}/prereg.md at the exact path experiments/{id}/prereg.md.
Create experiments/{id}/ if needed.
Include all seven epistemic fields.
Include the compute target explicitly.
Include the experiment ID.
Include the current date.
Include status OPEN.
Include the raw judge fields under a dedicated judge section.
Include notes that justify the baseline and sample size.
If computeTarget = docker, include the scaffold file paths and the environment lock hash.
If computeTarget = modal, include the path to modal-app.py.
If computeTarget = local, include the local environment assumptions plainly.
Do not include outputs.
Do not include smoke numbers.
Do not include screenshots.
Do not include provisional claims.
Use a shape like this:

# Pre-registration: {id}
- Date: 2026-05-31
- Status: OPEN
- Claim: {claim}
- Falsifier: {falsifier}
- N: {n}
- Baseline reference: {baselineRef}
- Cost cap: ${costCap}
- Best-case conclusion: {bestCaseConclusion}
- Compute target: {computeTarget}

## Judge
- Model: {model}
- Prompt: {prompt}
- Temperature: {temperature}
- Seed: {seed}

## Environment
- Dockerfile: experiments/{id}/Dockerfile
- Requirements: experiments/{id}/requirements.txt
- Environment lock: {environmentHash}

## Notes
- Baseline status: pending local reproduction
- Sample rationale: {why n is enough}

If the target is not Docker, adapt the environment section honestly instead of copying boilerplate.
If a field is missing, the prereg is incomplete.
If the file reads like a diary, rewrite it until it reads like a contract.

11. Commit the prereg before code runs

Writing the files is not enough.
The prereg must exist in version control before the experiment starts.
Stage the registry and experiment artifacts together.
The always-required set is:
- HYPOTHESES.md
- experiments/{id}/prereg.md
- experiments/{id}/judge.lock
If computeTarget = docker, also stage:
- experiments/{id}/Dockerfile
- experiments/{id}/requirements.txt
- experiments/{id}/environment.lock
If computeTarget = modal, also stage:
- experiments/{id}/modal-app.py
Use a clean prereg commit.
Do not batch this with result files.
Do not batch this with smoke artifacts.
Do not batch this with “one quick run.”
The whole point is temporal ordering.
Once the prereg exists, src/gates/prereg.ts can allow experiment-shaped bash calls.
On the first allowed run, that gate calls updateHypothesisStatus(cwd, id, "RUNNING").
Let the gate own that transition.
A clean commit looks like this:

git add HYPOTHESES.md experiments/{id}/prereg.md experiments/{id}/judge.lock
# docker only:
git add experiments/{id}/Dockerfile experiments/{id}/requirements.txt experiments/{id}/environment.lock
# modal only:
git add experiments/{id}/modal-app.py
git commit -m "epistemic: prereg {id}"

After the commit exists, hand off to baseline reproduction.
Do not call the adversary yet.
Do not write experiments/{id}/RESULTS.md yet.
Do not quote smoke artifacts from experiments/{id}/smokes/ yet.

Common Rationalizations

Excuse	Reality
“I’ll write `prereg.md` after one smoke run.”	Then the smoke run already contaminated the design.
“I only need a quick script.”	Quick scripts still generate evidence.
“The falsifier is obvious.”	If it is not written, it will move.
“Temperature defaults to zero anyway.”	Defaults drift; pin it.
“Seed does not matter for this provider.”	Recording it is cheap and auditable.
“I know the baseline from memory.”	Memory is not a reproduced source.
“I’ll lock the judge later.”	Later means after outputs existed.
“Judge lock is enough; the environment can stay loose.”	Judge drift and environment drift are different failure modes.
“We’ll decide between local and Docker after the first run.”	Then `computeTarget` was never part of the contract.
“The Dockerfile can stay on `latest` for now.”	Floating base images are just environment drift with better branding.
“`requirements.txt` can stay loose; pip will figure it out.”	Loose dependencies are how unreproducible wins get born.
“Modal setup is just operational glue.”	Operational changes change what actually ran.
“We only spent a little before prereg.”	Reconcile it honestly or kill the run.
“Best-case conclusion feels restrictive.”	That is exactly why it matters.
“The gate only catches certain commands.”	Integrity is not defined by regex loopholes.
“I’ll commit prereg together with the experiment run.”	That destroys ordering and turns prereg into theater.
“The baseline is famous enough that we do not need a URL.”	Fame is not provenance.
“We can overwrite `environment.lock`; it’s only scaffolding.”	Scaffolding that changes results is part of the experiment.

Red Flags - STOP

getHypothesisSpend(cwd, id) is already non-zero before prereg exists.
getJudgeLock(cwd, id) exists and does not match the current canonical judge.
getEnvironmentLock(cwd, id) exists and does not match the current Docker scaffold.
The falsifier mentions philosophy, intent, vibes, elegance, or worldview.
The claim compares against a baseline you cannot name precisely.
getBaselineAgeDays(entry) says the comparator is older than 30 days.
The judge prompt points at a mutable scratch file with no immutable revision.
n is TBD, until stable, or any other moving target.
The cost cap cannot fund the declared sample.
The best-case conclusion is broader than the claim.
computeTarget is missing or not one of local, docker, modal.
computeTarget = docker but Dockerfile, requirements.txt, or environment.lock is missing.
computeTarget = modal but modal-app.py is missing.
The Docker base image tag floats.
requirements.txt contains unpinned dependencies.
experiments/{id}/prereg.md contains outputs, screenshots, or smoke numbers.
You feel pressure to run “just one command” before the prereg commit exists.
Multiple OPEN or RUNNING hypotheses are sharing one result stream.
You are about to overwrite judge.lock because it is inconvenient.
You are about to overwrite environment.lock because the container changed “just a little.”
You are about to hand-edit HYPOTHESES.md into a format parseHypotheses cannot parse.

Good vs Bad

Good claim and falsifier

Claim:
Prompt A improves exact-match over the reproduced zero-shot baseline by at least 2 points on GSM8K under the locked judge.
Falsifier:
If mean exact-match improvement is less than 2 points across n=30 locked runs, the claim is falsified.

Why it is good:

The comparator exists.
The metric exists.
The threshold exists.
The falsifier is empirical.
Another reviewer can execute it without reading your mind.

Bad claim and falsifier

Claim:
Prompt A is smarter and more robust.
Falsifier:
If the model does not truly understand the task or if the benchmark feels unfair.

Why it is bad:

“Smarter” is not a metric.
“More robust” is not tied to a measurement.
The falsifier is philosophical.
The benchmark complaint is a moving escape hatch.

Good judge config

{
  "model": "gpt-4.1-mini-2026-04-14",
  "prompt": "prompts/gsm8k-judge-v3.md@9f3e2c1",
  "temperature": 0,
  "seed": 17
}

Why it is good:

Every required leaf is present.
The prompt is pinned.
The values can be serialized into judgeRef.
writeJudgeLock(cwd, hypothesisId, judgeRef) can lock it cleanly.

Bad judge config

{
  "model": "latest",
  "prompt": "current prompt",
  "temperature": "default"
}

Why it is bad:

latest drifts.
The prompt is mutable.
temperature is not numeric.
seed is missing.

Good Docker scaffold

FROM python:3.11.9-slim
WORKDIR /app
COPY requirements.txt .
RUN python -m pip install --upgrade pip==24.2 \
    && python -m pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "-m", "experiments.gsm8k_cot.run"]

numpy==2.1.1
pydantic==2.9.2

Why it is good:

The Python version is pinned.
Dependencies are pinned.
The entrypoint is explicit.
computeEnvironmentHash(...) has a stable scaffold to lock.

Bad Docker scaffold

FROM python:latest
COPY . .
RUN pip install -r requirements.txt
CMD ["python", "run.py"]

numpy
pydantic>=2

Why it is bad:

The base image drifts.
The dependency set drifts.
The entrypoint is vague.
Any later environment.lock built on this is theater.

Good `prereg.md`

# Pre-registration: gsm8k-cot-a-vs-zeroshot-2026-05-31
- Date: 2026-05-31
- Status: OPEN
- Claim: Prompt A improves exact-match over the reproduced zero-shot baseline by at least 2 points on GSM8K under the locked judge.
- Falsifier: If mean exact-match improvement is less than 2 points across n=30 locked runs, the claim is falsified.
- N: 30
- Baseline reference: GPT-4o zero-shot, https://example.com/report, score 84.1, judge exact-match, version 2026-05-10, pending reproduction
- Cost cap: $35
- Best-case conclusion: Under the locked judge on GSM8K, prompt A appears better than the reproduced zero-shot baseline.
- Compute target: docker

## Judge
- Model: gpt-4.1-mini-2026-04-14
- Prompt: prompts/gsm8k-judge-v3.md@9f3e2c1
- Temperature: 0
- Seed: 17

## Environment
- Dockerfile: experiments/gsm8k-cot-a-vs-zeroshot-2026-05-31/Dockerfile
- Requirements: experiments/gsm8k-cot-a-vs-zeroshot-2026-05-31/requirements.txt
- Environment lock: 9db0c4f7f5f0c2f57d6e1f5a0d1b4f8a8e4d9f8e4578a6d3e7c8a2b9d1f8c4aa

Why it is good:

All seven epistemic fields are present.
The compute target is explicit.
The judge is inspectable.
The baseline target is concrete.
The conclusion ceiling is explicit.
The environment artifact path is frozen.

Bad `prereg.md`

# Notes for experiment
- Claim: We think this prompt might be better.
- Falsifier: TBD
- N: start with 3
- Baseline: sota
- Cost cap: maybe $100?
- Compute target: maybe docker later

Why it is bad:

The judge section is missing.
The best-case conclusion is missing.
The falsifier is missing.
The baseline is fake.
N is still being negotiated.
The compute target is not locked.
There is no environment scaffold discipline.

Why This Matters

Preregistration protects you from your future self. Not the cartoon villain version. The tired, clever, motivated version that can rationalize almost anything after seeing a graph.

A written claim prevents drift. A hard falsifier prevents rhetoric from replacing evidence. A declared sample prevents optional stopping. A locked judge prevents judge-shopping. A locked compute target prevents platform-shopping. A locked Docker scaffold prevents “works on my machine” folklore from sneaking into the result. A named baseline gives /skill:baseline-reproduction a real target. A cost cap prevents ego from burning money. A best-case conclusion caps what you are allowed to say even on a good day. A committed prereg.md gives src/gates/prereg.ts permission to let code run.

If this phase is sloppy, every later phase inherits the slop. If this phase is tight, later disagreement becomes useful instead of political.

After this, use /skill:baseline-reproduction.

preregistration

Popularity

Invocation

Context Preview

SKILL.md

preregistration

Popularity

Invocation

Context Preview

SKILL.md

Preregistration

Overview

The Iron Law

When to Use

When NOT to Use

State Surface

The Contract Shape

The Process

1. Resolve repository state before you write anything

2. Field 1 of 7 — validate the claim

3. Field 2 of 7 — validate the falsifier

4. Field 3 of 7 — validate sample size

5. Field 4 of 7 — validate judge configuration and lock it

6. Field 5 of 7 — validate baseline reference

7. Field 6 of 7 — validate cost cap

8. Field 7 of 7 — validate best-case conclusion explicitly

9. Execution field — validate computeTarget and scaffold the right runtime

If computeTarget = local

If computeTarget = docker

If computeTarget = modal

10. Update HYPOTHESES.md and write the prereg artifacts

11. Commit the prereg before code runs

Common Rationalizations

Red Flags - STOP

Good vs Bad

Good claim and falsifier

Bad claim and falsifier

Good judge config

Bad judge config

Good Docker scaffold

Bad Docker scaffold

Good prereg.md

Bad prereg.md

Why This Matters

Similar Skills

Preregistration

Overview

The Iron Law

When to Use

When NOT to Use

State Surface

The Contract Shape

The Process

1. Resolve repository state before you write anything

2. Field 1 of 7 — validate the claim

3. Field 2 of 7 — validate the falsifier

4. Field 3 of 7 — validate sample size

5. Field 4 of 7 — validate judge configuration and lock it

6. Field 5 of 7 — validate baseline reference

7. Field 6 of 7 — validate cost cap

8. Field 7 of 7 — validate best-case conclusion explicitly

9. Execution field — validate computeTarget and scaffold the right runtime

If computeTarget = local

If computeTarget = docker

If computeTarget = modal

10. Update HYPOTHESES.md and write the prereg artifacts

11. Commit the prereg before code runs

Common Rationalizations

Red Flags - STOP

Good vs Bad

Good claim and falsifier

Bad claim and falsifier

Good judge config

Bad judge config

Good Docker scaffold

Bad Docker scaffold

Good prereg.md

Bad prereg.md

Why This Matters

Similar Skills

9. Execution field — validate `computeTarget` and scaffold the right runtime

If `computeTarget = local`

If `computeTarget = docker`

If `computeTarget = modal`

10. Update `HYPOTHESES.md` and write the prereg artifacts

Good `prereg.md`

Bad `prereg.md`

9. Execution field — validate `computeTarget` and scaffold the right runtime

If `computeTarget = local`

If `computeTarget = docker`

If `computeTarget = modal`

10. Update `HYPOTHESES.md` and write the prereg artifacts

Good `prereg.md`

Bad `prereg.md`