From epistemic-skills
Transforms vague research ideas into falsifiable, prereg-ready hypotheses with rivals and disconfirming predictions. Use before any benchmark, eval, or training work begins.
How this skill is triggered — by the user, by Claude, or both
Slash command
/epistemic-skills:research-questionThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
A vague idea is not a hypothesis.
A vague idea is not a hypothesis. A favorite explanation is not research.
This phase exists to stop the oldest failure mode in experimental work: seeing one interesting observation, falling in love with the first story that explains it, and then building the whole experiment around that story.
In this repo, the output of this phase is:
OPEN draft in HYPOTHESES.mdexperiments/{id}/alternatives/Current repo reality matters. src/state/repo.ts already persists bestCaseConclusion and computeTarget on HypothesisEntry. Use the real shape. Do not invent a smaller checklist because an older note called it "seven fields."
One question at a time. One ambiguity killed at a time. No giant intake form. No favorite theory disguised as inevitability.
NO HYPOTHESIS WITHOUT A RIVAL AND A WAY TO LOSE
If the observation only has one explanation, you are not done. If the explanations do not have distinct disconfirming predictions, you are not done. If the claim still cannot be falsified mechanically, you are not done. If the compute target is still "we'll figure it out later," the cost and runtime story is still fiction.
Do not move to preregistration with a claim that only sounds disciplined because the wording got longer.
Use this skill when:
HYPOTHESES.mdDo not use this skill when:
experiments/{id}/prereg.md already exists and the contract is being locked; use /skill:preregistration/skill:baseline-reproductionHYPOTHESES.md does not exist yet| Surface | API or file | Why it matters now |
|---|---|---|
| Repo sanity | loadRepoState(cwd) | Confirm the epistemic scaffold exists before assuming normal flow |
| Canonical registry | HYPOTHESES.md | Final output of this stage |
| Read live ideas | loadHypotheses(cwd) | Avoid overwriting active work |
| Parse registry text | parseHypotheses(content) | Use only if raw markdown is already loaded |
| Detect live work | getActiveHypothesis(entries) | Decide whether you are refining an existing idea or minting a new id |
| Render and persist | hypothesisToMarkdown(entry), saveHypotheses(cwd, entries) | Preserve the real repo format, including bestCaseConclusion and computeTarget |
| Baseline context | BASELINES.md, loadBaselines(cwd) | Name the comparator precisely instead of vaguely |
| Baseline freshness | getBaselineAgeDays(entry) | Detect stale references before you lean on them |
| Prior spend | getHypothesisSpend(cwd, id), getAllHypothesisSpends(cwd) | Ground the budget in actual burn instead of vibes |
| Cost ledger | .epistemic/cost-ledger.jsonl | Planned here, filled later |
| Alternative archive | experiments/{id}/alternatives/ | Preserve rejected explanations instead of letting them vanish |
| Later prereg artifact | experiments/{id}/prereg.md | Next phase only; do not write it here |
Legacy shorthand called this the "7-field" checklist. That shorthand is stale.
In this repo, a prereg-ready hypothesis has eight required research fields because bestCaseConclusion and computeTarget are first-class parts of the contract:
claim — one measurable sentence with an intervention, comparator, metric, and context.falsifier — one empirical sentence answering: What would disprove this?bestCaseConclusion — one-sentence low-expectations framing: the strongest boring conclusion you are allowed to write if the experiment succeeds.n — the sample size: prompts, seeds, tasks, or runs.judgeRef — the exact judge configuration you expect to lock later.baselineRef — the comparator name plus enough provenance to reproduce it later.costCap — the maximum USD the idea is allowed to consume before a real decision point.computeTarget — where the experiment will run: local, docker, or modal.Housekeeping fields still matter: id, status, and timestamp.
Do not confuse "the registry can store it" with "the hypothesis is scientifically complete." The checklist is the research contract. The markdown shape is only the transport.
This phase comes before you settle on one claim.
Start from one observation. Generate 2-3 competing explanations for that same observation. Each explanation must have its own unique disconfirming prediction.
If two explanations die under the same killer test, they are not meaningfully separated yet. Tighten them or delete one.
Use a table like this:
| Explanation | Unique disconfirming prediction | Falsifiability score (1-5) | Cost to test | Prior plausibility (1-5) |
|---|---|---|---|---|
| A | What result would kill A specifically? | 1-5 | dollars or low/med/high | 1-5 |
| B | What result would kill B specifically? | 1-5 | dollars or low/med/high | 1-5 |
| C | What result would kill C specifically? | 1-5 | dollars or low/med/high | 1-5 |
Ranking rule:
Do not let "it feels most likely" outrank "it is easiest to kill." Research gets cleaner when the chosen explanation is cheap to defeat.
Workflow:
claim candidate.experiments/{id}/alternatives/.Unchosen explanations are not trash. They are live audit material. If the chosen claim dies later, that archive is where honest follow-up starts.
loadRepoState(cwd) if you need a top-level scaffold check.loadHypotheses(cwd) to inspect HYPOTHESES.md.parseHypotheses(content) instead of improvising a parser.getActiveHypothesis(entries) to see whether one idea is already OPEN or RUNNING.id.id before you archive alternatives.experiments/{id}/alternatives/ immediately.better, smarter, more robust, or more aligned when no metric is named.RESULTS.md without extra storytelling.HYPOTHESES.md.n.n = TBD.n to the falsifier.n must support a mean.n must support a win rate.n = 1 is usually theater.n does not fit the likely budget, narrow the claim instead of pretending the sample is enough.model, prompt, temperature, and seed.latest, default, current prompt, and other drifting placeholders.judgeRef.loadBaselines(cwd) to see whether the comparator already exists in BASELINES.md.getBaselineAgeDays(entry).computeTarget.local — fastest feedback loop, lowest setup overhead, easiest interactive debugging, highest environment drift riskdocker — slower upfront, better dependency and OS pinning, better handoff and reproducibility, usually the safest default when environment mattersmodal — best for remote parallelism or heavier managed compute, but adds orchestration, secrets, cold starts, and extra spend; not a free defaultmodal, include infra overhead and retries in the budget.docker, include image build and environment prep costs when they are real.local, make sure the hardware and ambient environment assumptions are actually credible.getHypothesisSpend(cwd, id) for resumed ideas and getAllHypothesisSpends(cwd) when several open ideas are already burning money.n under the chosen compute target.$0, uncapped, or whatever it takes.bestCaseConclusion before results can seduce youn must be large enough for the falsifier.n under the chosen compute target.bestCaseConclusion must be narrower than the claim surface, not broader.HYPOTHESES.md and stop thereid.status: "OPEN" and timestamp: Date.now().loadHypotheses(cwd).id, claim, falsifier, bestCaseConclusion, n, judgeRef, baselineRef, costCap, computeTarget, status, and timestamp.saveHypotheses(cwd, entries) so the file stays parseable by parseHypotheses(content).experiments/{id}/alternatives/.experiments/{id}/prereg.md yet.experiments/{id}/judge.lock yet.OPEN draft in HYPOTHESES.md, the rival explanations are archived, and every research field is specific enough to survive preregistration.| Excuse | Reality |
|---|---|
| "The first explanation is probably the right one." | Probably is exactly why you generate rivals first. |
| "We already know what caused the observation." | Then it should survive distinct competing explanations and distinct killer predictions. |
| "Two alternatives can share the same disconfirming test." | Then they are not separated explanations yet. Tighten or delete one. |
| "I'll just ask for all the fields at once." | Then you will get polished vagueness all at once. |
| "The claim is directionally clear." | Direction is not a metric, comparator, or falsifier. |
| "We can define the falsifier later." | Then the claim will drift to protect itself. |
| "The alternatives don't need to be saved." | Unwritten rejected explanations come back later as convenient excuses. |
| "The baseline is obvious." | If it is not named and sourced, it is folklore. |
"n depends on how the first runs look." | That is optional stopping in a lab coat. |
"bestCaseConclusion is just wording." | It is the ceiling that prevents post-hoc marketing. |
"computeTarget is implied." | Hidden infrastructure assumptions are still assumptions. Record them. |
| "We'll decide local vs Docker vs Modal after a smoke run." | Environment choice changes cost, runtime, and reproducibility. Decide before evidence exists. |
| "Docker is overkill for a small experiment." | If reproducibility matters, the overhead is the point. |
| "Modal is just more cores." | Remote infrastructure changes billing, secrets, startup, and failure modes. |
| "The registry can stay compact; we'll remember the rest." | If it is not in the record, it will drift. |
| "The gates will catch mistakes later." | Gates do not rescue vague thinking. |
better, smarter, or more robust without a metric.n is still TBD, until stable, or whatever fits.latest, default, or a mutable scratch prompt.getBaselineAgeDays(entry) shows the only local comparator is stale and you are ignoring it.computeTarget is blank, implied, or later.local but the planned run is actually modal.experiments/{id}/alternatives/.bestCaseConclusion is broader than the claim.getActiveHypothesis(entries) points at a different live idea and you are still about to overwrite it.Observation:
Prompt B beat Prompt A on GSM8K in three pilot traces.
| Explanation | Unique disconfirming prediction | Falsifiability | Cost | Prior |
| --- | --- | --- | --- | --- |
| Reasoning scaffold helps multi-step arithmetic | Gain persists under exact-match and concentrates on multi-step items | 5/5 | $0.60 | 4/5 |
| Judge prefers verbose answers | Gain disappears under exact-match or a terse-answer rubric | 5/5 | $0.40 | 3/5 |
| Prompt B leaks answer templates | Gain disappears when demonstrations are length-matched but content-swapped | 4/5 | $0.80 | 2/5 |
Why it is good:
Bad
Observation:
Prompt B looks smarter.
Explanation 1:
Prompt B is better.
Explanation 2:
Maybe the judge liked it.
Explanation 3:
Maybe randomness.
Why it is bad:
Q: Where will experiments run? local, Docker, or Modal?
A: Docker. The run is CPU-only, dependency-sensitive, and another researcher must be able to rerun it unchanged.
Why it is good:
computeTarget now has real methodological meaningBad
We'll start local and move it wherever if the numbers look good.
Why it is bad:
wherever is not a registered compute targetHYPOTHESES.mdconst entries = await loadHypotheses(cwd);
entries.push({
id: "prompt-b-vs-prompt-a-gsm8k-2026-05-31",
claim: "Prompt B improves exact-match over Prompt A on GSM8K under the named judge.",
falsifier: "If mean exact-match improvement is less than 2 points across n=30 runs, the claim is falsified.",
bestCaseConclusion: "Under the locked judge on GSM8K, Prompt B may outperform Prompt A by a modest margin.",
n: 30,
judgeRef: "model=gpt-5.4-mini,prompt=prompts/gsm8k-judge-v3.md@9f3e2c1,temp=0,seed=17",
baselineRef: "prompt-a|url=https://example.com/prompts/a|version=2026-05-31|retrieved=2026-05-31",
costCap: 18,
computeTarget: "docker",
status: "OPEN",
timestamp: Date.now(),
});
await saveHypotheses(cwd, entries);
Why it is good:
bestCaseConclusion and computeTargetOPEN draft for preregistrationBad
## Hypothesis: big-win
- **Claim:** Prompting is better
- **Falsifier:** If the vibe is off
- **N:** We'll see
- **Judge:** latest
- **Baseline:** SOTA
- **Cost cap:** whatever it takes
- **Compute target:** later
- **Best-case conclusion:** This changes everything
Why it is bad:
# Alternative 02: judge-format-bias
- Observation: Prompt B beat Prompt A on GSM8K in three pilot traces.
- Explanation: The judge prefers verbose step-by-step answers rather than better arithmetic.
- Unique disconfirming prediction: The win disappears under exact-match or a terse-answer rubric.
- Falsifiability score: 5/5
- Cost to test: $0.40
- Prior plausibility: 3/5
- Not chosen now because: The researcher selected the reasoning-scaffold explanation as the primary claim.
Why it is good:
Bad
We considered some other ideas but they were weaker.
Why it is bad:
Most bad research does not begin with fabricated numbers. It begins when the first plausible story becomes the only story, and nobody records what else could have explained the same observation.
Competing hypotheses make the question honest.
bestCaseConclusion keeps the future writeup small enough to deserve trust.
computeTarget stops environment drift from masquerading as methodological detail.
This stage is cheap. That is why people try to skip it.
After the draft exists in HYPOTHESES.md, the non-chosen explanations are written to experiments/{id}/alternatives/, and the prereg-ready fields are explicit, use /skill:preregistration.
When all slots are filled and confirmed by the user:
Generate the complete Research Document using the template from docs/research-document.md
Write it to RESEARCH.md in the repo root (overwrite if it already exists)
For each Research Story in section 10 (RS-001, RS-002, …), append an entry to HYPOTHESES.md:
Before appending: if HYPOTHESES.md does not yet exist, create it with the following header (including the trailing blank line):
# Hypotheses
Registered hypotheses for this research project.
Then append each entry. Get the current timestamp by running:
node -e "console.log(Date.now())"
This produces a 13-digit Unix millisecond integer (e.g. 1748995200000). Use that value for the Timestamp field.
Set Compute target to local, docker, or modal based on the compute requirements described in the Research Story.
Each entry must use this exact format (field names match hypothesisToMarkdown exactly):
## Hypothesis: RS-NNN
- **Status:** OPEN
- **Claim:** <story title as a testable claim>
- **Falsifier:** <the observation that would falsify this claim>
- **Best case conclusion:** <one sentence on what a positive result would let you conclude>
- **N:** 30
- **Judge:** <judge model or script ref, e.g. gpt-4o or scripts/judge.py>
- **Baseline:** <baseline description, e.g. zero-shot GPT-4o>
- **Cost cap:** 50
- **Compute target:** local
- **Timestamp:** 1748995200000
Notify the user: "Research Document written to RESEARCH.md. Open the graph to see your hypotheses — the graph auto-refreshes within 2 seconds."
The graph panel reads RESEARCH.md and HYPOTHESES.md and will display the Research Document as the root node with Research Stories as circles.
npx claudepluginhub atomicstrata/epistemic --plugin epistemic-skillsProvides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.